|
A regular expression is a
formula for matching strings that follow some pattern. Many people
are afraid to use them because they can look confusing and
complicated. However, with a little practice, it's pretty easy to
write the expressions to make the advanced web filters in Check&Get.
Here are the some easy examples that could help you to
understand regular expression use and start making your own
web filters.
Ignoring Date/Time on Web-Page:
|
Date/Time Format
|
Regular Expression
|
Explanation
|
| April 2, 2006 |
\w+ \d{1,2}, \d{4} |
|
\w+
|
matches any word (January, May, etc.) |
| \d{1,2} |
matches one or two digits (01, 22) |
| , |
matches comma |
| \d{4} |
matches four digits |
|
| January 28, 2006, 09:07:19 AM |
\w+ \d{1,2}, 200\d, \d\d:\d\d:\d\d (AM|PM) |
|
\w+
|
matches any word (April, May, etc) |
| \d{1,2} |
matches one or two digits (1 or 33) |
| , |
matches comma (,) |
| 200\d |
matches 200 and any digit (2006, 2000, 2008) |
| \d\d: |
matches two digits and : (23:, 12:, 01:) |
| (AM|PM) |
matches AM or PM words |
|
| 17-Jan-2006 |
\d\d-\w{3}-\d{4} |
|
\d\d
|
matches any two digits (01, 22, etc.) |
| - |
matched "-" char |
| \w{3} |
matches three characters (Jan, Feb, ZZZ, etc.) |
| - |
matched "-" char |
| \d{4} |
matches any four digits |
|
| 2006/06/06 12:38:51 |
\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} |
| \d{4} |
matches any four digits |
| / |
matched "/" char |
|
\d{2}
|
matches any two digits (01, 22, etc.) |
| : |
matched ":" char |
|
Ignoring Counters:
|
Counter Format
|
Regular Expression
|
Explanation
|
| 12432 Visitors |
\d+ Visitors |
|
\d+
|
matches any number of digits |
| Visitors |
matches "Visitors" word |
|
| Activity: 78.2% |
Activity: \d+\.\d+% |
|
Activity:
|
matches "Activity:" word |
| \d+ |
matches one or more digits (1 or 33) |
| \. |
matches "dot"character (.) |
| \d+ |
matches one or more digits (1 or 33) |
| % |
matches "percent"character (%) |
|
| (c) 1999-2006 ActiveURLs. All Rights Reserved. |
\(c\) \d{4}-\d{4} |
|
\(c\)
|
matches "(c)" word |
| \d{4}- |
matched four digits and "-" char |
| \d{4} |
matched four digits |
| |
This filter will ignore the changes in dates of
copyright (1999-2005, 1998-2006 etc.) |
|
Ignoring Advertisements:
The following example shows how to ignore the typical text
advertisement, like this one:
|
SPONSORED LINKS
Get a $200,000 Loan for
$770/month
Fill out 1 form, and regardless of credit receive up to 4 loan
offers in minutes from our certified lenders. When Banks Compete,
You Win!
Mortgage solutions that fit your
needs.
Comparing rates of fixed rate loans and adjustable rate mortgages?
We'll match the interest rate quoted by any competing lender for
products with the same terms AND we'll beat the other lender's fees
...
Buy a Link Now ยป
|
We need to ignore the text between the "SPONSORED
LINKS" words and "Buy a Link Now"
words.
The following regular expression does this job:
|
Regular Expression
|
Explanation
|
| SPONSORED LINKS[\w\W]+?Buy a Link Now |
|
SPONSORED LINKS
|
matches "SPONSORED LINKS"
words |
| [\w\W]+ |
matches any number of any characters, including the line
feeds |
| ? |
"?" means, that expression [\w\W]+ is not
greedy (the minimal match) |
| Buy a Link Now |
matches "Buy a Link Now" words |
|
|
Note: Check&Get provides the easy to use
"Select and
Ignore" way to ignore the advertisements like shown in
this example. You do not need to create such
web-filters manually - Check&Get will do this job for
you automatically. Nevertheless, you can use the methods,
descrived above to create the highly customizable web filters that
will solve nearly any filtering task that could be imagined. |
See Also:
|