Regular expressions (REs), unlike simple queries, allow you to search for text which matches a particular pattern.
REs are similar to (but more poweful than) the "wildcards" used in the command-line interfaces found in operating systems such as Unix and MS-DOS. REs are used by sophisticated search engines, as well as by many Unix-based languages and tools ( e.g.,
awk
,
grep
,
lex
,
perl
, and
sed
).
Examples
compan(y|ies)
|
Search for company , companies
|
(peter|paul)
|
Search for peter , paul
|
bug*
|
Search for bug , bugs , bugfix
|
[Bb]ag
|
Search for Bag , bag
|
b[aiueo]g
|
Second letter is a vowel. Matches bag , bug , big
|
b.g
|
Second letter is any letter. Matches also b&g
|
[a-zA-Z]
|
Matches any one letter (not a number and a symbol)
|
[^0-9a-zA-Z]
|
Matches any symbol (not a number or a letter)
|
[A-Z][A-Z]*
|
Matches one or more uppercase letters
|
[0-9][0-9][0-9]-[0-9][0-9]- [0-9][0-9][0-9][0-9]
|
US social security number, e.g. 123-45-6789
|
Here is stuff for our UNIX freaks:
(copied from 'man grep')
\c A backslash (\) followed by any special character is a
one-character regular expression that matches the spe-
cial character itself. The special characters are:
+ `.', `*', `[', and `\' (period, asterisk,
left square bracket, and backslash, respec-
tively), which are always special, except
when they appear within square brackets ([]).
+ `^' (caret or circumflex), which is special
at the beginning of an entire regular expres-
sion, or when it immediately follows the left
of a pair of square brackets ([]).
+ $ (currency symbol), which is special at the
end of an entire regular expression.
. A `.' (period) is a one-character regular expression
that matches any character except NEWLINE.
[string]
A non-empty string of characters enclosed in square
brackets is a one-character regular expression that
matches any one character in that string. If, however,
the first character of the string is a `^' (a circum-
flex or caret), the one-character regular expression
matches any character except NEWLINE and the remaining
characters in the string. The `^' has this special
meaning only if it occurs first in the string. The `-'
(minus) may be used to indicate a range of consecutive
ASCII characters; for example, [0-9] is equivalent to
[0123456789]. The `-' loses this special meaning if it
occurs first (after an initial `^', if any) or last in
the string. The `]' (right square bracket) does not
terminate such a string when it is the first character
within it (after an initial `^', if any); that is,
[]a-f] matches either `]' (a right square bracket ) or
one of the letters a through f inclusive. The four
characters `.', `*', `[', and `\' stand for themselves
within such a string of characters.
The following rules may be used to construct regular expres-
sions:
* A one-character regular expression followed by `*' (an
asterisk) is a regular expression that matches zero or
more occurrences of the one-character regular expres-
sion. If there is any choice, the longest leftmost
string that permits a match is chosen.
^ A circumflex or caret (^) at the beginning of an entire
regular expression constrains that regular expression
to match an initial segment of a line.
$ A currency symbol ($) at the end of an entire regular
expression constrains that regular expression to match
a final segment of a line.
* A regular expression (not just a one-
character regular expression) followed by `*'
(an asterisk) is a regular expression that
matches zero or more occurrences of the one-
character regular expression. If there is
any choice, the longest leftmost string that
permits a match is chosen.
+ A regular expression followed by `+' (a plus
sign) is a regular expression that matches
one or more occurrences of the one-character
regular expression. If there is any choice,
the longest leftmost string that permits a
match is chosen.
? A regular expression followed by `?' (a ques-
tion mark) is a regular expression that
matches zero or one occurrences of the one-
character regular expression. If there is
any choice, the longest leftmost string that
permits a match is chosen.
| Alternation: two regular expressions
separated by `|' or NEWLINE match either a
match for the first or a match for the
second.
() A regular expression enclosed in parentheses
matches a match for the regular expression.
The order of precedence of operators at the same parenthesis
level is `[ ]' (character classes), then `*' `+' `?'
(closures),then concatenation, then `|' (alternation)and
NEWLINE.
to top