Perl Pattern Matching: Regular Expressions
May 10, 1999
Although
Part 1
of this series surveyed fundamentals of the
Perl language, one such fundamental remains -- pattern
matching. Pattern matching is an integral, but somewhat
complex, aspect of Perl programming; in fact, pattern matching
is the very basis of inspiration for using Perl in
many programming tasks. Nearly every real-world Perl program
you encounter or create will use pattern matching in some
form or another, and so we should get comfortable with
the issue here and now. Like Perl itself, though, pattern
matching is complex enough to write a whole book about
(in fact, there is just such a book), and you do not need
to master every nuance of pattern matching to leverage
its basic power.
Rudimentary pattern matching is probably familiar to
many readers; in MS-DOS or
Unix command lines, for example,
you might request a directory listing for "*.txt"
-- which, of course, means "all filenames which
end with .txt". Similarly, "abc*" would
mean all filenames which begin with "abc". The
concept behind pattern matching, then, is that you construct
a template which is then applied against some data
-- in the above examples the template was "*.txt"
or "abc*" and the data was each filename
in the directory. If a template matches a piece of data,
some action is taken (the full filename is displayed),
while if the template does not match a different action
may be taken (the filename is not displayed). In this example,
the pattern matching syntax is the set of symbols
used to construct the pattern -- here, the asterisk simply
means "zero or more of any character".
Perl provides a powerful but complex method of creating
pattern matching templates, and the syntax used is known
as regular expression syntax. Regular expression
syntax is inherited from the UNIX world, and it is not
limited to Perl. Typically, you may see regular expressions
referred to as "regexps", simply because
it's shorter.
The two most common uses for pattern matching in Perl are
conditional matches and substitutions.
In the first, we test to see whether a match against
a piece of data is true (matches) or false (does not match);
in the second, we use the constructed template to perform
a "search and replace" on a matched pattern
within a piece of data.
Perl Pattern Matching: Conditional Matches using
Regular Expressions
The general, commonly used syntax for a basic conditional
match is:
$dataVariable =~ /template/ ;
Let's break this down: $dataVariable represents the
piece of data you are matching against; for instance, imagine
that you are checking to see if the user's e-mail address
contains an @ symbol (an extremely rudimentary form of
address validation). The user's e-mail address, then, would
be contained in the match variable $dataVariable. Next,
the match operator =~ tells Perl that this is a match.
The slashes (/) are used to enclose the regular expression
syntax.
Now, we must construct a template to fit between the slashes.
The template depends entirely on our goal -- what
are we trying to match? It's best to think of this goal
in terms of rules. We're checking to see that there be
an @ symbol within $dataVariable -- very simply, this could
be stated by the rule "one or more of any characters
followed by an @ symbol followed by one or more of any
characters". You can immediately imagine that this
rule could be quite a bit more specific to perform a
truly sophisticated address validation (for instance, there
should be only one @ symbol), but let's start with this
very simple match.
First, we must consult with the
basic regular expression syntax table
(you may want to print this table). Now, let's translate
the rule we've devised into the proper symbolic representation:
$dataVariable =~ /.+\@.+/ ;
Yes, this is where Perl turns ugly. According to the
regexp syntax table, the single dot followed by the plus
sign (.+) means "any character one or more times".
Next, the @ symbol is escaped with a backslash (\)
simply to ensure that Perl does not misinterpret the
symbol as possibly a regexp syntax itself. It is safest, and
sometimes obligatory, to escape symbols inside of a regular
expression when you want to match them literally (for
instance, to match a slash you would have to use \/ --
an escaped slash!). Lastly, the pattern ends with another
sequence of "any character one or more times."
Let's take this match one step further, and define a
stricter pattern. Imagine that this e-mail address we are
testing must end in ".org" or ".net".
Revised rule: "data begins with one or more
alphanumeric characters followed by an @ symbol followed
by one or more alphanumeric characters followed by the
literal sequence Net or '.org' followed by the end of
the data string". Whew! I can smell a complex regexp
coming on ... turning back to the syntax table, we see
that there are symbols which refer to classes of data. For
instance, the \w symbol includes all characters
from a-z and AZ and 0-9 and the underscore (_). Also note
the use of square brackets to enclose a set of data.
$dataVariable =~ /^\w+\@\w+(\.org|\.net)$/i ;
Regular expressions quickly grow beastly and hairy, and
require quite a bit of mental attention. Caffeine helps,
too, although this is not likely to become the theme of
Coke's next ad campaign. The above pattern begins with
a caret symbol (^) which represents the start of the
data (known as an anchor symbol because it matches
a boundary rather than a literal character). Next, the
word class symbol requires one or more alphanumeric
characters, followed by an @ symbol, followed by one or more
alphanumeric characters. Next, parentheses are used
to group together a logical set of either .org
or .net; also notice that the dot has been escaped
because we are looking for a literal dot character. The
dollar sign ($) is another anchor symbol which represents
the end of the data. Finally, notice the "i"
following the closing slash of the pattern. This "i"
is a pattern modifier -- it tells Perl, in this case, to
perform the match case-insensitively. Thus, an otherwise
valid address ending in ".ORG" or ".Org"
or ".org", etc., would be successful. If
the "i" modifier is omitted, only ".org"
would match, since Perl is normally case-sensitive.
The result of this pattern matching operator is simply a
value of true or false; thus, one typically encloses
the pattern matching expression inside of a conditional
statement, such as an if statement:
if ( $dataVariable =~ /^\w+\@\w+[\.org|\.net]$/i )
{ ...statements if match is true... }
Perl Pattern Matching: Substitutions using Regular Expressions
A substitution is basically like the "search and
replace" function found in text editors and word
processors, leveraged on the power of regular expression
matching syntax. Rather than simply determine whether
a match is true or false, we use substitutions to modify
portions of the original piece of data according to some
template.
To begin with a simple example, we'd like to analyze a
piece of data which is user-submitted information about
their pet, and search-end-replace any instance of the word
"cat" with "feline".
$pet =~ s/\bcat\b/feline/ig ;
The substitution construction is similar to the match
construction we just saw, but extended a bit. First, notice
that an "s" precedes the first slash. This tells
Perl that we're performing a substitution rather than
a match. Following the first slash we construct our
matching pattern -- in other words, the "search for"
part of the template.
We're looking for the word "cat", but only as a
full word -- in other words we don't want to match
"catamaran", "catharsis",
"staccato", and so on. Thus, the word boundary
anchor (\b) symbols surround "cat" in our match,
obliging Perl to find a non-word character on either side of
"cat" (such as a space or other symbol). The next
slash serves to define the end of the pattern, just as we
saw in a conditional match. After this second slash, though,
we include the characters which will replace the matched pattern
-- in other words, the "replace with" part of the
template.
Following the replacement data (the literal string
"feline" in this case, but you could also use Perl
variables here) are two regexp modifiers: the
familiar "i", for a case-insensitive substitution, and
"g", which forces a global substitution.
Without the "g" modifier, the substitution
would only replace the first occurrence of
"cat". Using the global modifier essentially lets us
replace all instances of "cat" within the piece of
data. Note: the substitution operation applies
the changes directly to the left-hand $dataVariable. Depending
on your needs, you may want to apply the substitution
to a copy of the data variable rather than the original.
In the realm of web interaction, substitutions are often used
in processing user input. Imagine that we have
a search form which accepts multiple keywords, and each
keyword must be separated by a non-word character; typically,
a comma. However, to provide a safety net when interpreting
the user's input, we might want to substitute a comma
for any non-word and non-space character, in case they
accidentally used periods or semicolons or some other symbol
to delimit keywords.
$search =~ s/[^\w| ]/,/g ;
This substitution matches the pattern described as "not
containing a word class character or a space".
The caret symbol inside the class specification (the
square brackets) tells Perl to exclude the characters
listed inside the class. In other words, we are substituting
the comma for any characters which is not either
a word class character (AZ, AZ, 0-9, and underscore) or
a space. If we ran this substitution on the string "black
cat,dog*mouse/frog" the resulting string, following
substitution, would be "black cat,dog,mouse,frog".
Another common use for substitutions is to remove certain
characters from a piece of data; again, typically
user input. We might want to remove all carriage returns
from a string of user input:
$userinput =~ s/\n//g ;
Simply, this substitution looks for any newline characters
(\n) and replaces them with null; meaning they are
removed.
As you can see, constructing regular expression templates
whether for conditional matches or substitutions is
very powerful. But, we've only touched on a small amount of
regexp syntax (albeit the most common). Building and
debugging regular expressions is a craft to some, a puzzle
to others, and a great source of heartache and headache
to many. If you have the curiosity and stamina to learn
more about regular expressions and pattern matching in
Perl, check out both the
Regular Expression man page and the
Regular Expression FAQ.
Three oft-used substitutions worth clipping into your
regexp toolbox lead us out of this chapter.
| Strip HTML tags from a string |
$string =~ s/<([^>]|\n)*>//g ;
|
| Strip leading spaces from a string |
$string =~ s/^\s+// ;
|
| Strip trailing spaces from a string |
$string =~ s/\s+$// ;
|
The Perl You Need to Know
The Perl You Need to Know
CGI and Object Oriented Perl: Background
|