Do You Support Regex (Regular Expressions)?

THIS ARTICLE WILL HELP YOU:

Understand How Convert Handles Regular Expressions

Convert Experiments supports regular expressions (otherwise known as regex) in its Split URL testing and in segments creation.  Below you will find a short summary on how to use these. You can also take a look at our blog article for regex examples and tools to generate your own formulas.

Regular expressions are a pattern matching standard for string parsing and replacement. They are used in a wide range of platforms and programming environments. Originally missing in Visual Basic, regular expressions are now available for most VB and VBA versions.

Regular expressions are a way to match text with patterns. They are a powerful way to find and replace strings that take a defined format. For example, regular expressions can be used to parse dates, URLs, email addresses, log files, configuration files, command line switches or programming scripts.

Since regular expressions are language independent, we're trying to keep this article as language independent as possible. However, it's to be noted that not all regex implementations are the same. The below text is based on Perl 5.0. This is also the format that RegExpr for VB/VBA uses. Some implementations may not handle all expressions the same way.

Regex syntax

In it's simplest form, a regular expression is a string of symbols that match literally.

Regex Matches
abc abcabcabc
234 12345

That's not very impressive yet. But you can see that regexes match the first case found, once, anywhere in the input string.

Quantifiers

So what if you want to match several characters? You need to use a quantifier. The most important quantifiers are *?+. They may look familiar to you from, say, the dir statement of DOS, but they're not exactly the same.


* matches zero or more of what comes before it. 
? matches zero or one of what comes before it.
+ matches one or more of what comes before it.

Regex Match (denoted by underscore)
23*4 1245, 12345, 123345
23?4 1245, 12345
23+4 12345, 123345


By default, regexes are greedy. They take as many characters as possible. In the next example, you can see that the regex matches as many 2's as there are.

Regex Match (denoted by underscore)
2* 122223


There is also stingy matching available that matches as few characters as possible. There are more quantifiers than those mentioned above.

Special characters

A lot of special characters are available for regex building. Here are some of the more usual ones.

. The dot matches any single character.
\n Matches a newline character (or CR+LF combination).
\t Matches a tab (ASCII 9).
\d Matches a digit [0-9].
\D Matches a non-digit.
\w Matches an alphanumberic character.
\W Matches a non-alphanumberic character.
\s Matches a whitespace character.
\S Matches a non-whitespace character.
\ Use \ to escape special characters. For example, \. matches a dot, and \\ matches a backslash.
^ Match at the beginning of the input string.
$ Match at the end of the input string.


Here are some likely uses for the special characters.

Regex Match (denoted by underscore)
1.3 123, 1z3, 133
1.*3 13, 123, 1zdfkj3
\d\d 01, 02, 99
\w+@\w+ a@a, email@company.com


^ and $ are important to regex. ^ indicates the start of the string and $ designates the end of the string.  Without them, regex matches anywhere in the input. With ^ and $ you can make sure to match only a full string, the beginning of the input, or the end of the input.

Regex Match (denoted by underscore) Does not match
^1.*3$ 13, 123, 1zdfkj3 x13, 123x, x1zdfkj3x
^\d\d 01abc a01abc
\d\d$ xyz01 xyz01

 

Character classes

You can group characters by putting them between square brackets. This way, any character in the class will match one character in the input.

[abc] Match any of a, b, and c.
[a-z] Match any character between a and z. (ASCII order)
[^abc] A caret ^ at the beginning of the square bracket indicates "not". In this case, match anything other than a, b, or c.
[+*?.] Most special characters have no meaning inside the square brackets. This expression matches literally any of +, *, ? or the dot.


Here are some sample uses.

Regex Matches Does not match
[^ab] cdz ab
^[1-9][0-9]*$ Any positive integer Zero, negative or decimal numbers
[0-9]*[,.]?[0-9]+ .111.2100,000 12.

Grouping and alternatives

It's often necessary to group things together with parentheses ( and ).

Regex Matches Does not match
(ab)+ abababababab aabb
(aa|bb)+ aabbaaaabbaaaa abab


Notice the | operator. This is the Or operator that takes any of the alternatives.

With parentheses, you can also define subexpressions to remember after the match has happened. In the below example, the string what is between (.)

Regex Matches Stores
a(\d+)a a12a 12
(\d+)\.(\d+) 1.2 1 and 2


In these examples, what is matched by (\d+) gets stored. The regex engine will allow you to retrieve the stored value by a successive call. The implementation of the call varies. In RegExpr for VB/VBA, you call RegExprResult(1) to get the first stored value, RegExprResult(2) to get the second one, and so on. This way you can retrieve fields for further processing.

Case sensitivity

So are regexes case sensitive? Yes and no. They are both. It depends on the way you write the regex call in the programming language. Refer to the documentation of your programming language or regex implementation on how to write the calls.

Advanced syntax

The above is in no way a complete description of regexes. There are more ways to write them, more special characters, and more quantifiers available. What's available depends also on the implementation. Some regex engines don't implement all of the possibilities, rendering them not so usable for every purpose. In case you're interested in learning a more complete set of regexes, see the help file of RegExpr for VB/VBA. It's available for free download.

Regex examples

Here are a few practical examples of regular expressions. They are provided for learning purposes. In real applications, you should carefully design your regexes to match the exact use.

Email matching

It's often necessary to check if a string is an email address or not. Here's one way to do it.

^[A-Za-z0-9_\.-]+@[A-Za-z0-9_\.-]+[A-Za-z0-9_][A-Za-z0-9_]$


Explanation

^[A-Za-z0-9_\.-]+ Match a positive number of acceptable characters at the start of the string.
@ Match the @ sign.
[A-Za-z0-9_\.-]+ Match any domain name, including a dot.
[A-Za-z0-9_][A-Za-z0-9_]$ Match two acceptable characters but not a dot. This ensures that the email address ends with .xx, .xxx, .xxxx etc.


This example works for most cases but is not written based on any standard. It may accept non-working email addresses and reject working ones. Fine-tuning is required.

Parsing dates

Date strings are difficult to parse because there are so many variations. You can't always trust VB's own date conversion functions as the date may come in an unexpected or unsupported format. Let's assume we have a date string in the following format: 2002-Nov-14.

^\d\d\d\d-[A-Z][a-z][a-z]-\d\d$


Explanation

^\d\d\d\d Match four digits that make up the year.
- Match the separator dash.
[A-Z][a-z][a-z] Match a 3-letter month name. The first letter is in upper case.
- Match the separator dash.
\d\d$ Match two digits that make up the day.


If a match is found, you can be sure that the input string is formatted like a date. However, a regex is not able to verify that it's a real date. For example, it could as well be 5400-Qui-32. This doesn't look like an acceptable date to most applications. If you want to prepare yourself for the stranger dates, you'll have to write a more limit ing expression:

^20\d\d-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(0[1-9]|[1-2][0-9]|3[01])$


Explanation

^20\d\d Match four digits that make up the year. The year must be between 2000 and 2099. No other dates please!
- Match the separator dash.
(Jan|Feb|Mar|Apr |May|Jun|Jul|Aug |Sep|Oct|Nov|Dec) Match the month abbreviation in English. Now you don't accept the date in any other language.
- Match the separator dash.
(0[1-9]|[1-2][0-9]|3[01])$ Match two digits that make up the day. This accepts numbers from 01 to 09, 10 to 29 and 30 to 31. What if the user gives 2003-Feb-31? There are limitations to what regexes can do. If you want to validate the string futher, you need to use other techniques than regexes.

Web logs

Web server logs come in several formats. This is a typical line in a log file.

144.18.39.44 - - [01/Sep/2002:00:03:20 -0700] "GET /resources.html HTTP/1.1" 200 3458 "http://www.aivosto.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"


As you can see, there are several fields on the line. They conform to a complex format. The fields are different from each other. A human-readable way to define the various fields is here:

host - - [date] "GET URL HTTP/1.1" status size "ref" "agent"


As you can see, there are fields such as host (visitor's Internet address), date and time (enclosed in square brackets), an HTTP request with file to retrieve (enclosed in quotation marks), numeric status code, numeric size of file, referer field (enclosed in quotation marks), and agent (browser) name (enclosed in quotation marks).

To programmatically parse the line, you need to split it into fields, then look at each field. This is a sample regex that will split the fields.}

^(\S*) - - \[(.*) .....\] \"....? (\S*) .*\" (\d*) ([-0-9]*) (\"([^"]+)\")?


Explanation

^(\S*) Match any number of non-space characters at the start of the line.
- - Match the two dashes. They are actually empty fields that might have content in another log file.
\[(.*) .....\] Match the date inside square brackets. The date consists of a datetime string, a space, and a 5-character time zone indication. To actually use the date you'd need to write a more detailed regex to separate the year, month, day, hour, minute, and second.
\"....? (\S*) .*\" Match the HTTP request inside quotation marks. First there is a 3 to 4-character verb, such as GET, POST or HEAD. (\S*) matches the actual file that is being retrieved. At the end, .* matches HTTP/1.1 or whatever protocol was used to retrieve the file.
(\d*) Match a numeric status code.
([-0-9]*) Match a numeric file size, or - if no number is present.
(\"([^"]+)\")? Match the "ref" field. It's anything enclosed in quotation marks.
  In this example, we've left "agent" unmatched. That does no harm because $ is not used to match the end-of-line. We can leave "agent" unmatched if we're not interested in the field.


This example has been taken from a web log file parser script. To use it in your own code, you have to fine-tune it to suit your log file format. The regex assumes that lines come only in the presented format. If, say, a field is missing or the file contains garbage lines, the regex may fail. This results in an unparsed line.

Source: http://www.aivosto.com/vbtips/regex.html

Multiple URLs

There can be tests that need the experience to run on multiple URLs. You can use Regex to construct a URL pattern to help check the multiple URLs in a single "Matches Exactly Regex" condition.

Let's assume you want the experience to run on the following URLs:

https://www.abc.com/ex/checkout

https://www.abc.com/ed/checkout

https://www.abc.com/ac/checkout

The regex to match these URLs would be: https://www.abc.com/([a-zA-Z0-9]+)/checkout 

Explanation

https://www.abc.com/ This would match the domain (common to all URLs).
/ Match the / backward slash.
([a-zA-Z0-9]+) Match any combination of letters or numbers i.e. ex, ac, ed, etc.
/ Match the /backward slash.
checkout Matches the subpart of the URL that comes at the end of the URL and is common to all the URLs in this example.