Regular Expressions


A regular expression is a way of specifying a pattern so that some strings match the pattern and some strings do not. Parts of the matching pattern can be marked for use in operations such as substitution. This is a powerful tool for processing text, especially when producing text-based reports. Many UNIX utilities use a form of regular expressions as a pattern matching mechanism (for example, egrep) and Perl has adopted this concept, almost as its own.

Like arithmetic expressions, regular expressions are made up of a sequence of legal symbols linked with legal operators. This table lists all of these operators and symbols in one table for easy reference. If you are new to regular expressions you may find the description in "Perl Overview" informative.

Table 11 lists Perl's Regular Expressions.

Table 11  Regular Expression Meta-Characters, Meta-Brackets,and Meta-Sequences

Meta-Character
Description
^
This meta-character-the caret-will match the beginning of a string or, if the /m option is used, match the beginning of a line. It is one of two pattern anchors-the other anchor is the $.
.
This meta-character will match any single character except for the newline unless the /s option is specified. If the /s option is specified, then the newline will also be matched.
$
This meta-character will match the end of a string or, if the /m option is used, match the end of a line. It is one of two pattern anchors-the other anchor is the ^.
|
This meta-character-called alternation-lets you specify two values that can cause the match to succeed. For instance, m/a|b/ means that the $_ variable must contain the "a" or "b" character for the match to succeed.
*
This meta-character indicates that the "thing" immediately to the left should be matched 0 or more times in order to be evaluated as true; thus, .* matches any number of character).
+
This meta-character indicates that the "thing" immediately to the left should be matched 1 or more times in order to be evaluated as true.
?
This meta-character indicates that the thing immediately to the left should be matched 0 or 1 times in order to be evaluated as true. When used in conjunction with the +, ?, or {n, m} meta-characters and brackets, it means that the regular expression should be non-greedy and match the smallest possible string.
Meta-Brackets
Description
()
The parentheses let you affect the order of pattern evaluation and act as a form of pattern memory. See the "Special Variables" chapter for more details.
(?...)
If a question mark immediately follows the left parentheses it indicates that an extended mode component is being specified (new to Perl 5).
(?#comment)
Extension: comment is any text.
(?:regx)
Extension: regx is any regular expression, but parentheses are not saved as a backreference.
(?=regx)
Extension: allows matching of zero-width positive lookahead characters (that is, the regular expression is matched but not returned as being matched).
(?!regx)
Extension: allows matching of zero-width negative lookahead characters (that is, negated form of (=regx)).
(?options)
Extension: applies the specified options to the pattern bypassing the need for the option to specified in the normal way. Valid options are: i (case insenstive), m (treat as multiple lines), s (treat as single line), x (allow whitespace and comments).
{n, m}
The braces let you specify how many times the "thing" immediately to the left should be matched. {n} means that it should be matched exactly n times. {n,} means it must be matched at least n times. {n, m} means that it must be matched at least n times but not more than m times.
[]
The square brackets let you create a character class. For instance, m/[abc]/ will evaluate to true if any of "a", "b", or "c" is contained in $_. The square brackets are a more readable alternative to the alternation meta-character.
Meta-Sequences
Description
\
This meta-character "escapes" the character that follows. This means that any special meaning normally attached to that character is ignored. For instance, if you need to include a dollar sign in a pattern, you must use \$ to avoid Perl's variable interpolation. Use \\ to specify the backslash character in your pattern.
\nnn
Any octal byte (where nnn represents the octal number-this allows any charcter to be specified by its octal number).
\a
The alarm character (this is a special character that, when printed, produces a warning bell sound).
\A
This meta-sequence represents the beginning of the string. Its meaning is not affected by the /m option.
\b
This meta-sequence represents the backspace character inside a character class, otherwise it represents a word boundary. A word boundary is the spot between word (\w) and non-word (\W) characters. Perl thinks that the W meta-sequence matches the imaginary characters of the end of the string.
\B
Match a non-word boundary.
\cn
Any control character (where n is the character, for example, \cY for Ctrl+Y).
\d
Match a single digit character.
\D
Match a single non-digit character.
\e
The escape character.
\E
Terminate the \L or \U sequence.
\f
The form feed character.
\G
Match only where the previous m//g left off.
\l
Change the next character to lower case.
\L
Change the following characters to lowercase until a \E sequence is encountered.
\n
The newline character.
\Q
Quote Regular Expression Meta-characters literally until the \E sequence is encountered.
\r
The carriage return character.
\s
Match a single whitespace character.
\S
Match a single non-whitespace character.
\t
The tab character.
\u
Change the next character to uppercase.
\U
Change the following characters to uppercase until a \E sequence is encountered.
\v
The vertical tab character.
\w
Match a single word character. Word characters are the alphanumeric and underscore characters.
\W
Match a single non-word character.
\xnn
Any hexadecimal byte.
\Z
This meta-sequence represents the end of the string. Its meaning is not affected by the /m option.
\$
The dollar character.
\@
The ampersand character.
\%
The percentcharacter.