Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Perl Pattern Matching: Regular Expressions

May 10, 1999

Although Part 1 of this series surveyed fundamentals of the Perl language, one such fundamental remains -- pattern matching. Pattern matching is an integral, but somewhat complex, aspect of Perl programming; in fact, pattern matching is the very basis of inspiration for using Perl in many programming tasks. Nearly every real-world Perl program you encounter or create will use pattern matching in some form or another, and so we should get comfortable with the issue here and now. Like Perl itself, though, pattern matching is complex enough to write a whole book about (in fact, there is just such a book), and you do not need to master every nuance of pattern matching to leverage its basic power.

Rudimentary pattern matching is probably familiar to many readers; in MS-DOS or Unix command lines, for example, you might request a directory listing for "*.txt" -- which, of course, means "all filenames which end with .txt". Similarly, "abc*" would mean all filenames which begin with "abc". The concept behind pattern matching, then, is that you construct a template which is then applied against some data -- in the above examples the template was "*.txt" or "abc*" and the data was each filename in the directory. If a template matches a piece of data, some action is taken (the full filename is displayed), while if the template does not match a different action may be taken (the filename is not displayed). In this example, the pattern matching syntax is the set of symbols used to construct the pattern -- here, the asterisk simply means "zero or more of any character".

Perl provides a powerful but complex method of creating pattern matching templates, and the syntax used is known as regular expression syntax. Regular expression syntax is inherited from the UNIX world, and it is not limited to Perl. Typically, you may see regular expressions referred to as "regexps", simply because it's shorter.

The two most common uses for pattern matching in Perl are conditional matches and substitutions. In the first, we test to see whether a match against a piece of data is true (matches) or false (does not match); in the second, we use the constructed template to perform a "search and replace" on a matched pattern within a piece of data.

Perl Pattern Matching: Conditional Matches using Regular Expressions

The general, commonly used syntax for a basic conditional match is:

$dataVariable =~ /template/ ;

Let's break this down: $dataVariable represents the piece of data you are matching against; for instance, imagine that you are checking to see if the user's e-mail address contains an @ symbol (an extremely rudimentary form of address validation). The user's e-mail address, then, would be contained in the match variable $dataVariable. Next, the match operator =~ tells Perl that this is a match. The slashes (/) are used to enclose the regular expression syntax.

Now, we must construct a template to fit between the slashes. The template depends entirely on our goal -- what are we trying to match? It's best to think of this goal in terms of rules. We're checking to see that there be an @ symbol within $dataVariable -- very simply, this could be stated by the rule "one or more of any characters followed by an @ symbol followed by one or more of any characters". You can immediately imagine that this rule could be quite a bit more specific to perform a truly sophisticated address validation (for instance, there should be only one @ symbol), but let's start with this very simple match.

First, we must consult with the basic regular expression syntax table (you may want to print this table). Now, let's translate the rule we've devised into the proper symbolic representation:

$dataVariable =~ /.+\@.+/ ;

Yes, this is where Perl turns ugly. According to the regexp syntax table, the single dot followed by the plus sign (.+) means "any character one or more times". Next, the @ symbol is escaped with a backslash (\) simply to ensure that Perl does not misinterpret the symbol as possibly a regexp syntax itself. It is safest, and sometimes obligatory, to escape symbols inside of a regular expression when you want to match them literally (for instance, to match a slash you would have to use \/ -- an escaped slash!). Lastly, the pattern ends with another sequence of "any character one or more times."

Let's take this match one step further, and define a stricter pattern. Imagine that this e-mail address we are testing must end in ".org" or ".net". Revised rule: "data begins with one or more alphanumeric characters followed by an @ symbol followed by one or more alphanumeric characters followed by the literal sequence Net or '.org' followed by the end of the data string". Whew! I can smell a complex regexp coming on ... turning back to the syntax table, we see that there are symbols which refer to classes of data. For instance, the \w symbol includes all characters from a-z and AZ and 0-9 and the underscore (_). Also note the use of square brackets to enclose a set of data.

$dataVariable =~ /^\w+\@\w+(\.org|\.net)$/i ;

Regular expressions quickly grow beastly and hairy, and require quite a bit of mental attention. Caffeine helps, too, although this is not likely to become the theme of Coke's next ad campaign. The above pattern begins with a caret symbol (^) which represents the start of the data (known as an anchor symbol because it matches a boundary rather than a literal character). Next, the word class symbol requires one or more alphanumeric characters, followed by an @ symbol, followed by one or more alphanumeric characters. Next, parentheses are used to group together a logical set of either .org or .net; also notice that the dot has been escaped because we are looking for a literal dot character. The dollar sign ($) is another anchor symbol which represents the end of the data. Finally, notice the "i" following the closing slash of the pattern. This "i" is a pattern modifier -- it tells Perl, in this case, to perform the match case-insensitively. Thus, an otherwise valid address ending in ".ORG" or ".Org" or ".org", etc., would be successful. If the "i" modifier is omitted, only ".org" would match, since Perl is normally case-sensitive.

The result of this pattern matching operator is simply a value of true or false; thus, one typically encloses the pattern matching expression inside of a conditional statement, such as an if statement:

if ( $dataVariable =~ /^\w+\@\w+[\.org|\.net]$/i )
 { ...statements if match is true... }

Perl Pattern Matching: Substitutions using Regular Expressions

A substitution is basically like the "search and replace" function found in text editors and word processors, leveraged on the power of regular expression matching syntax. Rather than simply determine whether a match is true or false, we use substitutions to modify portions of the original piece of data according to some template.

To begin with a simple example, we'd like to analyze a piece of data which is user-submitted information about their pet, and search-end-replace any instance of the word "cat" with "feline".

$pet =~ s/\bcat\b/feline/ig ;

The substitution construction is similar to the match construction we just saw, but extended a bit. First, notice that an "s" precedes the first slash. This tells Perl that we're performing a substitution rather than a match. Following the first slash we construct our matching pattern -- in other words, the "search for" part of the template.

We're looking for the word "cat", but only as a full word -- in other words we don't want to match "catamaran", "catharsis", "staccato", and so on. Thus, the word boundary anchor (\b) symbols surround "cat" in our match, obliging Perl to find a non-word character on either side of "cat" (such as a space or other symbol). The next slash serves to define the end of the pattern, just as we saw in a conditional match. After this second slash, though, we include the characters which will replace the matched pattern -- in other words, the "replace with" part of the template.

Following the replacement data (the literal string "feline" in this case, but you could also use Perl variables here) are two regexp modifiers: the familiar "i", for a case-insensitive substitution, and "g", which forces a global substitution. Without the "g" modifier, the substitution would only replace the first occurrence of "cat". Using the global modifier essentially lets us replace all instances of "cat" within the piece of data. Note: the substitution operation applies the changes directly to the left-hand $dataVariable. Depending on your needs, you may want to apply the substitution to a copy of the data variable rather than the original.

In the realm of web interaction, substitutions are often used in processing user input. Imagine that we have a search form which accepts multiple keywords, and each keyword must be separated by a non-word character; typically, a comma. However, to provide a safety net when interpreting the user's input, we might want to substitute a comma for any non-word and non-space character, in case they accidentally used periods or semicolons or some other symbol to delimit keywords.

$search =~ s/[^\w| ]/,/g ;

This substitution matches the pattern described as "not containing a word class character or a space". The caret symbol inside the class specification (the square brackets) tells Perl to exclude the characters listed inside the class. In other words, we are substituting the comma for any characters which is not either a word class character (AZ, AZ, 0-9, and underscore) or a space. If we ran this substitution on the string "black cat,dog*mouse/frog" the resulting string, following substitution, would be "black cat,dog,mouse,frog".

Another common use for substitutions is to remove certain characters from a piece of data; again, typically user input. We might want to remove all carriage returns from a string of user input:

$userinput =~ s/\n//g ;

Simply, this substitution looks for any newline characters (\n) and replaces them with null; meaning they are removed.

As you can see, constructing regular expression templates whether for conditional matches or substitutions is very powerful. But, we've only touched on a small amount of regexp syntax (albeit the most common). Building and debugging regular expressions is a craft to some, a puzzle to others, and a great source of heartache and headache to many. If you have the curiosity and stamina to learn more about regular expressions and pattern matching in Perl, check out both the Regular Expression man page and the Regular Expression FAQ.

Three oft-used substitutions worth clipping into your regexp toolbox lead us out of this chapter.

Strip HTML tags from a string
$string =~ s/<([^>]|\n)*>//g ;
Strip leading spaces from a string
$string =~ s/^\s+// ;
Strip trailing spaces from a string
$string =~ s/\s+$// ;

The Perl You Need to Know
The Perl You Need to Know
CGI and Object Oriented Perl: Background


Up to => Home / Authoring / Languages / Perl / PerlfortheWeb




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers