How to Use Regular Expressions in Host Integrator


A regular expression (regex or regexp) is a special text string used to describe a search pattern, according to certain syntax rules. For example, l[0-9]+ matches "l" followed by one or more digits.

Use regular expressions judiciously and only when necessary. If you require even greater complexity than regular expressions can support, consider using event handlers instead. Using regular expressions, or event handlers, indiscriminately can result in significant performance overhead.

Host Integrator regular expressions are based on Perl syntax. Host Integrator supports regular expressions for:

Perl programming documentation is available at Perl regular expressions quick start. This is a good introduction to regular expressions.

Regular Expression Syntax in Host Integrator

This is the complete syntax supported in Host Integrator. If you do not find a character listed in these tables, than it is not part of the regular expression syntax supported by Host Integrator.

Special characters

Certain characters are reserved for special use. If you want to use any of these characters as a literal in a regular expression, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.

\/ literal /
\\ literal \
\. literal .
\* literal *
\+ literal +
\? literal ?
\| literal |
\( literal (
\) literal )
\[ literal [
\] literal \
\- The - must be escaped inside brackets: [a-z0-9_.\-\?!]
Character Description Example
// Used to search a string for a match. "Hello World" =~ /World/;
In this statement, World is a regex and the // enclosing /World/ tells Perl to search a string for a match. The operator =~ associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match. In this case, World matches the second word in "Hello World" , so the expression is true.
\ (backslash) Escape character used to represent characters that would otherwise be a part of a regular expression. "\." = the period character.
[abc] Match any character listed within the square brackets. [abc] matches a, b or c
\d,\w, and \s Shorthand character classes matching digits 0-9, word characters (letters and digits) and white space respectively. Can be used inside and outside character classes [\d\s] matches a character that is a digit or whitespace
\D,\W, and \S Negated versions of the above. Should be used only outside character classes. \D matches a character that is not a digit
\b Word boundary. Matches at the position between a word character (anything matched by \w) and a non-word character (anything matched by [^\w] or \W) as well as at the start or end of the string if the first or last characters in the string are word characters or an alphanumeric sequence. Use to perform a "whole words only" search using a regular expression in the form of \bword\b.

\b also matches at the start or end of the string if the first or last characters in the string are word characters.

\b4\b matches 4 that is not part of a larger number.
\B Non-word boundary. \B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters. \B.\B matches b in abc
. (period) Match any single character "." matches x or any other character.
x (reg character) Match an instance of character "x". x matches x
^x Match any character except for character "x". [^a-d] matches any character except a, b, c, or d
^ (caret) Match the beginning of a string. Matches a position rather than a character. ^. matches a in abc\ndef. Also matches d in "multi-line" mode.
$ (dollar) Match the end of a string. Matches a position rather than a character. Also matches before the very last line break if the string ends with a line break. .$ matches f in abc\ndef. Also matches c in "multi-line" mode.
| (pipe) Or. Match either the part on the left side, or the part on the right side. Can be strung together into a series of options.

The pipe has the lowest precedence of all operators. Use grouping to alternate only part of the regular expression.

abc|def|xyz matches abc, def or xyz

abc(def|xyz) matches abcdef or abcxyz

(abc) (parentheses) Used to group sequences of characters or expressions. (Larry|Moe|Curly) Howard matches Larry Howard, Moe Howard, or Curly Howard
\1, $1, $$ \1 Refers to first grouping, used in the expression
$1 Refers to first grouping, used in the replacement string
$$ Literal “$” used in the replacement string.
/(.+)((\r?\n|\r)\1)+\b/ig,“$1” Removes duplicate lines from a list. The (.+) grabs a line of text and the parenthesis save it for a reference. The (\r?\n|\r) grabs the line separator, either \r\n, \n, or \r. Next, \1 references the first line and so ((\r?\n|\r)\1)+ matches 1 or more subsequent lines that match the first line. Notice that in Javascript, a reference within the expression is \1 while a reference in the replacement string is $1. The \b prevents “street” and “streets” from being seen as the same word.
{ } (curly braces) Used to define numeric qualifiers a{3} matches aaa
{N,} Match must occur at least "N" times Z{1,} matches when "Z" occurs at least once
{N,M} Match must occur at least "N" times, but no more than "M" times a{2,4} matches aa, aaa or aaaa
? (question mark) Makes the preceding item optional or once only. The optional item is included in the match if possible. abc? matches ab or abc
* (star) Match on zero or more of the preceding match. Repeats the previous item zero or more times. As many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all. "go*gle" matches ggle, gogle, google, gooogle, and so on.
+ (plus) Match on 1 or more of the preceding match. Repeats the previous item once or more. As many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once. "go+gle" matches gogle, google, gooogle, and so on (but not ggle.)
i Case insensitive search /expression/i
g (plus) Global replacement. Replaces all matches. /expression/g
q(?=u) Matches q only before u. Does not match the u. This is positive lookahead. The u is not part of the overall regex match. The lookahead matches at each position in the string before a u. q(?=u) matches the "q" in question, but not in Iraq.
q(?!u) Matches q except before u. q(?!u)) matches "q" in Iraq but not in question.

For information about additional pattern matching operators for conditions and filters, see Condition Edit/Filter String Edit.

Examples of Regular Expressions

Matching examples

Matches when an error message is displayed on the status line.
ERROR [0-9]{1,4}: .*

Match 3 instances of a string.
"/(John){3}/" (Matches John John John)

Match any of several first names, followed by a common last name.
"(Homer|Marge|Bart|Lisa|Maggie) Simpson" (Matches any member of the Simpson family)

Condition matching "Page N of M" when N = M.
PageStatus =~ s/Page ([0-9]+) of [0-9]+/$1/ = PageStatus =~ s/Page [0-9]+ of ([0-9]+)/$1/

Recordset condition to match records where myfield starts with "P".
myrecordset.myfield =~ m/P.*$/

Recordset condition where the field is not numeric.
myrecordset.myrecordsetfield =~ /[0-9]+/

Read and write substitution examples

For substitution examples, see Read or Write Substitution (Read or Write). You can also walk through a step-by-step exercise using the CCSDemo sample model, Substituting a Regular Expression for a Recordset Field.

Additional Resources

Regular expressions can be complex. A number of resources are available on the Internet to help you understand regular expressions.

 

Related Topics
Bullet Substituting a Regular Expression for a Recordset Field
Bullet Read or Write Substitutions (Attribute or Field)
Bullet Condition Edit or Filter String Edit