|
A regular expression is a formula for matching strings that
follow some pattern. Many people are afraid to use them because
they can look confusing and complicated. However, with a little
practice, it's pretty easy to write the expressions to make the
advanced web filters in Check&Get.
We are prepared some easy
examples that could help you to understand regular expression
use and start making your own web filters.
This document describes the formal Regular Expression
syntax, used in Check&Get
See also:
Table Of Context
- Regular Expression Syntax;
- Order of
Precedence
- Character Matching
- Bracket Expressions
- Quantifiers and Associated Meanings
-
Anchors
- Alternation and Grouping
Regular Expression Syntax
A regular expression is a pattern of text that consists of
ordinary characters (for example, letters a through z) and special
characters, known as metacharacters. The pattern describes
one or more strings to match when searching a body of text. The
regular expression serves as a template for matching a character
pattern to the string being searched.
Here are some examples of regular expression you might
encounter:
|
Regular Expression
|
Matches
|
|
^\s*$
|
Match a blank line.
|
|
\d{2}-\d{5}
|
Validate an ID number consisting f 2 digits, a hyphen, and
another 5 digits.
|
The following table contains the complete list of metacharacters
and their behavior in the context of regular expressions:
| Character
|
Description
|
|
\
|
Marks the next character as either a special character, a
literal, a backreference, or an octal escape. For example, 'n'
matches the character "n". '\n' matches a newline character. The
sequence '\\' matches "\" and "\(" matches "(".
|
|
^
|
Matches the position at the beginning of the input string. If
the RegExp object's Multiline
property is set, ^ also matches the position following '\n' or
'\r'.
|
|
$
|
Matches the position at the end of the input string. If the
RegExp object's Multiline
property is set, $ also matches the position preceding '\n' or
'\r'.
|
|
*
|
Matches the preceding character or subexpression zero or more
times. For example, zo* matches "z" and "zoo". * is equivalent to
{0,}.
|
|
+
|
Matches the preceding character or subexpression one or more
times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is
equivalent to {1,}.
|
|
?
|
Matches the preceding character or subexpression zero or one
time. For example, "do(es)?" matches the "do" in "do" or "does". ?
is equivalent to {0,1}
|
|
{n}
|
n is a nonnegative integer. Matches exactly n
times. For example, 'o{2}' does not match the 'o' in "Bob," but
matches the two o's in "food".
|
|
{n,}
|
n is a nonnegative integer. Matches at least n
times. For example, 'o{2,}' does not match the "o" in "Bob" and
matches all the o's in "foooood". 'o{1,}' is equivalent to 'o+'.
'o{0,}' is equivalent to 'o*'.
|
|
{n,m}
|
m and n are nonnegative integers, where
n <= m. Matches at least n and at
most m times. For example, "o{1,3}" matches the first
three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note that
you cannot put a space between the comma and the numbers.
|
|
?
|
When this character immediately follows any of the other
quantifiers (*, +, ?, {n}, {n,},
{n,m}), the matching pattern is non-greedy. A
non-greedy pattern matches as little of the searched string as
possible, whereas the default greedy pattern matches as much of the
searched string as possible. For example, in the string "oooo",
'o+?' matches a single "o", while 'o+' matches all 'o's.
|
|
.
|
Matches any single character except "\n". To match any character
including the '\n', use a pattern such as '[\s\S]'.
|
|
(pattern)
|
Matches pattern and captures the match. The captured
match can be retrieved from the resulting Matches collection, using
the SubMatches collection in VBScript or the
$0…$9 properties in JScript. To
match parentheses characters ( ), use '\(' or '\)'.
|
|
(?:pattern)
|
Matches pattern but does not capture the match, that
is, it is a non-capturing match that is not stored for possible
later use. This is useful for combining parts of a pattern with the
"or" character (|). For example, 'industr(?:y|ies) is a more
economical expression than 'industry|industries'.
|
|
(?=pattern)
|
Positive lookahead matches the search string at any point where
a string matching pattern begins. This is a non-capturing
match, that is, the match is not captured for possible later use.
For example 'Windows (?=95|98|NT|2000)' matches "Windows" in
"Windows 2000" but not "Windows" in "Windows 3.1". Lookaheads do
not consume characters, that is, after a match occurs, the search
for the next match begins immediately following the last match, not
after the characters that comprised the lookahead.
|
|
(?!pattern)
|
Negative lookahead matches the search string at any point where
a string not matching pattern begins. This is a
non-capturing match, that is, the match is not captured for
possible later use. For example 'Windows (?!95|98|NT|2000)' matches
"Windows" in "Windows 3.1" but does not match "Windows" in "Windows
2000". Lookaheads do not consume characters, that is, after a match
occurs, the search for the next match begins immediately following
the last match, not after the characters that comprised the
lookahead.
|
|
x|y
|
Matches either x or y. For example, 'z|food'
matches "z" or "food". '(z|f)ood' matches "zood" or "food".
|
|
[xyz]
|
A character set. Matches any one of the enclosed characters. For
example, '[abc]' matches the 'a' in "plain".
|
|
[^xyz]
|
A negative character set. Matches any character not enclosed.
For example, '[^abc]' matches the 'p' in "plain".
|
|
[a-z]
|
A range of characters. Matches any character in the specified
range. For example, '[a-z]' matches any lowercase alphabetic
character in the range 'a' through 'z'.
|
|
[^a-z]
|
A negative range characters. Matches any character not in the
specified range. For example, '[^a-z]' matches any character not in
the range 'a' through 'z'.
|
|
\b
|
Matches a word boundary, that is, the position between a word
and a space. For example, 'er\b' matches the 'er' in "never" but
not the 'er' in "verb".
|
|
\B
|
Matches a nonword boundary. 'er\B' matches the 'er' in "verb"
but not the 'er' in "never".
|
|
\cx
|
Matches the control character indicated by x. For
example, \cM matches a Control-M or carriage return character. The
value of x must be in the range of A-Z or a-z. If not, c
is assumed to be a literal 'c' character.
|
|
\d
|
Matches a digit character. Equivalent to [0-9].
|
|
\D
|
Matches a nondigit character. Equivalent to [^0-9].
|
|
\f
|
Matches a form-feed character. Equivalent to \x0c and \cL.
|
|
\n
|
Matches a newline character. Equivalent to \x0a and \cJ.
|
|
\r
|
Matches a carriage return character. Equivalent to \x0d and
\cM.
|
|
\s
|
Matches any whitespace character including space, tab,
form-feed, etc. Equivalent to [ \f\n\r\t\v].
|
|
\S
|
Matches any non-white space character. Equivalent to
[^ \f\n\r\t\v].
|
|
\t
|
Matches a tab character. Equivalent to \x09 and \cI.
|
|
\v
|
Matches a vertical tab character. Equivalent to \x0b and
\cK.
|
|
\w
|
Matches any word character including underscore. Equivalent to
'[A-Za-z0-9_]'.
|
|
\W
|
Matches any nonword character. Equivalent to
'[^A-Za-z0-9_]'.
|
|
\xn
|
Matches n, where n is a hexadecimal escape
value. Hexadecimal escape values must be exactly two digits long.
For example, '\x41' matches "A". '\x041' is equivalent to '\x04'
& "1". Allows ASCII codes to be used in regular
expressions.
|
|
\num
|
Matches num, where num is a positive integer.
A reference back to captured matches. For example, '(.)\1' matches
two consecutive identical characters.
|
|
\n
|
Identifies either an octal escape value or a backreference. If
\n is preceded by at least n captured
subexpressions, n is a backreference. Otherwise,
n is an octal escape value if n is an octal digit
(0-7).
|
|
\nm
|
Identifies either an octal escape value or a backreference. If
\nm is preceded by at least nm captured
subexpressions, nm is a backreference. If \nm is
preceded by at least n captures, n is a
backreference followed by literal m. If neither of the
preceding conditions exists, \nm matches octal escape
value nm when n and m are octal digits
(0-7).
|
|
\nml
|
Matches octal escape value nml when n is an
octal digit (0-3) and m and l are octal digits
(0-7).
|
|
\un
|
Matches n, where n is a Unicode character
expressed as four hexadecimal digits. For example, \u00A9 matches
the copyright symbol (©).
|
Order of
Precedence
From Highest to Lowest, the Order of Precedence of the Regular
Expression Operators:
|
Operator(s)
|
Description
|
|
\
|
Escape
|
|
(), (?:), (?=), []
|
Parentheses and Brackets
|
|
*, +, ?, {n}, {n,}, {n,m}
|
Quantifiers
|
|
^, $, \anymetacharacter
|
Anchors and Sequences
|
|
|
|
Alternation
|
Characters have higher precedence than the alternation operator,
which allows 'm|food' to match "m" or "food". To match "mood" or
"food", use parentheses to create a subexpression, which results in
'(m|f)ood'.
Character Matching
The period (.) matches any single printing or non-printing
character in a string, except a newline character (\n). The
following regular expression matches 'aac', 'abc', 'acc', 'adc',
and so on, as well as 'a1c', 'a2c', a-c', and a#c':
If you are trying to match a string containing a word where a
period (.) is part of the input string, you do so by preceding the
period in the regular expression with a backslash (\) character. To
illustrate, the following regular expression matches
'filename.ext':
Bracket Expressions
You can create a list of matching characters by placing one or
more individual characters within square brackets ([ and ]). When
characters are enclosed in brackets, the list is called a
bracket expression. Within brackets, as anywhere else,
ordinary characters represent themselves, that is, they match an
occurrence of themselves in the input text. Most special characters
lose their meaning when they occur inside a bracket expression.
Here are some exceptions:
- The ']' character ends a list if it's not the first item. To
match the ']' character in a list, place it first, immediately
following the opening '['.
- The '\' character continues to be the escape character. To
match the '\' character, use '\\'.
Characters enclosed in a bracket expression match only a single
character for the position in the regular expression where the
bracket expression appears. The following JScript regular
expression matches 'Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter
4', and 'Chapter 5':
If you want to express the matching characters using a range
instead of the characters themselves, you can separate the
beginning and ending characters in the range using the hyphen (-)
character. The character value of the individual characters
determines their relative order within a range. The following
regular expression contains a range expression that is equivalent
to the bracketed list shown above.
When a range is specified in this manner, both the starting and
ending values are included in the range.
If you want to include the hyphen character in your bracket
expression, you must do one of the following:
- Escape it with a backslash: [\-]
You can also find all the characters not in the list or range by
placing the caret (^) character at the beginning of the list. If
the caret character appears in any other position within the list,
it matches itself, that is, it has no special meaning. The
following regular expression matches chapter headings with numbers
greater than 5':
OR
A typical use of a bracket expression is to specify matches of
any upper- or lowercase alphabetic characters or any digits. The
following regular expression specifies such a match:
Quantifiers and Associated
Meanings
Sometimes, you do not know how many characters there are to
match. In order to accommodate that kind of uncertainty, regular
expressions support the concept of quantifiers. These quantifiers
let you specify how many times a given component of your regular
expression must occur for your match to be true.
|
Character
|
Description
|
|
*
|
Matches the preceding character or subexpression zero or more
times. For example, 'zo*' matches "z" and "zoo". * is equivalent to
{0,}.
|
|
+
|
Matches the preceding character or subexpression one or more
times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is
equivalent to {1,}.
|
|
?
|
Matches the preceding character or subexpression zero or one
time. For example, 'do(es)?' matches the "do" in "do" or "does". ?
is equivalent to {0,1}
|
|
{n}
|
n is a nonnegative integer. Matches exactly n
times. For example, 'o{2}' does not match the 'o' in "Bob," but
matches the two o's in "food".
|
|
{n,}
|
n is a nonnegative integer. Matches at least n
times. For example, 'o{2,}' does not match the 'o' in "Bob" and
matches all the o's in "foooood". 'o{1,}' is equivalent to 'o+'.
'o{0,}' is equivalent to 'o*'.
|
|
{n,m}
|
m and n are nonnegative integers, where
n <= m. Matches at least n and at
most m times. For example, 'o{1,3}' matches the first
three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note that
you cannot put a space between the comma and the numbers.
|
With a large input document, chapter numbers could easily exceed
nine, so you need a way to handle two or three digit chapter
numbers. Quantifiers give you that capability. The following
regular expression matches chapter headings with any number of
digits:
If you know that your chapter numbers are limited to only 99
chapters, you can use the following regular expression to specify
at least one, but not more than 2 digits.
The disadvantage to the expression shown above is that if there
is a chapter number greater than 99, it will still only match the
first two digits. Another disadvantage is that somebody could
create a Chapter 0 and it would match. A better expression for
matching only two digits are the following:
The '*', '+', and '?' quantifiers are all what are referred to
as greedy, that is, they match as
much text as possible. Sometimes that is not at all what
you want to happen. Sometimes, you just want a minimal match.
Say, for example, you are searching an HTML document for an
occurrence of a chapter title enclosed in an H1 tag. That text
appears in your document as:
<H1>Chapter 1 – Introduction to Regular
Expressions</H1>
The following expression matches everything from the opening
less than symbol (<) to the greater than symbol (>) at the
end of the closing H1 tag.
If all you really wanted to match was the opening H1 tag, the
following, non-greedy expression matches only
<H1>.
By placing the '?' after a '*', '+', or '?' quantifier, the
expression is transformed from a greedy to a
non-greedy, or minimal, match.
Anchors
So far, the examples you've seen have been concerned only with
finding chapter headings wherever they occur. Any occurrence of the
string 'Chapter' followed by a space, followed by a number, could
be an actual chapter heading, or it could also be a cross-reference
to another chapter. Since true chapter headings always appear at
the beginning of a line, you'll need to devise a way to find only
the headings and not find the cross-references.
Anchors provide that capability. Anchors allow you to fix a regular
expression to either the beginning or end of a line. They also
allow you to create regular expressions that occur either within a
word or at the beginning or end of a word. The following table
contains the list of regular expression anchors and their
meanings:
|
Character
|
Description
|
|
^
|
Matches the position at the beginning of the input string.
|
|
$
|
Matches the position at the end of the input string.
|
|
\b
|
Matches a word boundary, that is, the position between a word
and a space.
|
|
\B
|
Matches a nonword boundary.
|
To match text at the beginning of a line of text, use the '^'
character at the beginning of the regular expression. Don't confuse
this use of the '^' with the use within a bracket expression.
They're definitely not the same.
To match text at the end of a line of text, use the '$'
character at the end of the regular expression.
To use anchors when searching for chapter headings, the
following regular expression matches a chapter heading with up to
two following digits that occurs at the beginning of a line:
Not only does a true chapter heading occur at the beginning of a
line, it's also the only thing on the line, so it also must be at
the end of a line as well. The following expression ensures that
the match you've specified only matches chapters and not
cross-references. It does so by creating a regular expression that
matches only at the beginning and end of a line of text.
| ^Chapter [1-9][0-9]{0,1}$
|
Matching word boundaries is a little different but adds a very
important capability to regular expressions. A word boundary is the
position between a word and a space. A non-word boundary is any
other position. The following expression matches the first three
characters of the word 'Chapter' because they appear following a
word boundary:
The position of the '\b' operator is critical here. If it's
positioned at the beginning of a string to be matched, it looks for
the match at the beginning of the word; if it's positioned at the
end of the string, it looks for the match at the end of the word.
For example, the following expressions match 'ter' in the word
'Chapter' because it appears before a word boundary:
The following expressions match 'apt' as it occurs in 'Chapter',
but not as it occurs in 'aptitude':
That's because 'apt' occurs on a non-word boundary in the word
'Chapter' but on a word boundary in the word 'aptitude'. For the
non-word boundary operator, position isn't important because the
match isn't relative to the beginning or end of a word.
Alternation and Grouping
Alternation allows use of the '|' character to allow a choice
between two or more alternatives. Expanding the chapter heading
regular expression, you can expand it to cover more than just
chapter headings. However, it's not as straightforward as you might
think. When alternation is used, the largest possible expression on
either side of the '|' character is matched. You might think that
the following expressions match either 'Chapter' or 'Section'
followed by one or two digits occurring at the beginning and ending
of a line:
| ^Chapter|Section
[1-9][0-9]{0,1}$
|
Unfortunately, what happens is that the regular expressions
shown above match either the word 'Chapter' at the beginning of a
line, or 'Section' and whatever numbers follow that, at the end of
the line. If the input string is 'Chapter 22', the expression shown
above only matches the word 'Chapter'. If the input string is
'Section 22', the expression matches 'Section 22'. But that's not
the intent here so there must be a way to make that regular
expression more responsive to what you're trying to do and there
is.
You can use parentheses to limit the scope of the alternation,
that is, make sure that it applies only to the two words, 'Chapter'
and 'Section'. However, parentheses are tricky as well, because
they are also used to create subexpressions, something that's
covered later in the section on subexpressions. By taking the
regular expressions shown above and adding parentheses in the
appropriate places, you can make the regular expression match
either 'Chapter 1' or 'Section 3'.
The following regular expressions use parentheses to group
'Chapter' and 'Section' so the expression works properly:
| ^(Chapter|Section)
[1-9][0-9]{0,1}$
|
See also:
|