CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in
Add ListingModify ListingTell A FriendLink to TPASubscribeNew ListingsCool ListingsTop RatedRandom Link
Newest Reviews
  • review
  • hagen software
  • NOT GPL!
  • Hagan Software
  • Wasted Time with ...
  • poor pre-sale sup...
  • no response
  • rating the offer
  • Good Stuff
  • Good idea but use...


  • Brochure Templates  
     
    Perl Archive : TLC : Programming : Perl : Book Chapter : Perl Developer's Dictionary (Part 2 of 2)
    Guide Search entire directory 
     

    Date Published: 2002-08-28

    Anchors, Grouping, and Backreferences (See Part 1)

    Modification Characters

    Usage

    \Q \E \L \l \U \u

    Description

    The modification characters used in string literals (in an interpolated context) are available in regular expressions as well. See the entry on modification characters for a list.

    Understand that these "metacharacters" aren't really metacharacters at all. They do their work because regular expression match operators allow interpolation to happen when the pattern is first examined—much in the same way that \L and \U are only effective in double-quoted strings; they're only effective in regular expressions when the pattern is first examined by perl.

    $foo='\U';
    if (m/${foo}blah/) {  } # Won't look for BLAH, but `Ublah'
    if (m/\Ublah/) {  }     # Will look for BLAH
    if (m/(a)\U\1/ { }      # Won't look for aA as you might hope

    Most useful among these in regular expressions is the \Q modifier. The \Q modifier is used to quote any metacharacters that follow. When accepting something that will be used in a pattern match from an untrusted source, it is vitally important that you not put the pattern into the regular expression directly. Take this small sample:

    # A CGI form is a _VERY_ untrustworthy source of info.
    
    use CGI qw(:form :standard);
    print header();
    $pat=param("SEARCH");
    # ...sometime later...
    if (/$pat/) {
    }

    The trouble with this is that handing $pat to the regular expression engine opens up your system to running code that's determined solely by the user. If the user is malicious, he can:

    • Specify a regular expression that will never terminate (endless backtracking).

    • Specify a regular expression that will use indeterminate amounts of memory.

    • Specify a regular expression that can run perl code with a (?{}) pattern.

    The third one is probably the most malicious, so it is disabled unless a use re `eval' pragma is in effect or the pattern is compiled with a qr operator.

    The \Q modifier will cause perl to treat the contents of the pattern literally until an \E is encountered.

    Example Listing 3.4

    # A re-working of the inline sample above a little
    #   more safe.*  A form parameter "SEARCH" is used to
    #   check the GECOS (often "name") field.
    # *[Of course, software is only completely "safe" when
    #   it's not being used. --cap]
    
    use CGI qw(:form :standard);
    
    print header(-type => `text/plain');
    $pat=param("SEARCH");
    
    push(@ARGV, "/etc/passwd");
    while(<>) {
       ($name)=(split(/:/, $_))[4];
       if ($name=~/\Q$pat\E/) {
           print "Yup, ${pat}'s in there somewhere!\n";
       }
    }

    See Also

    Modification characters in this book


    Anchors, Grouping, and Backreferences

    Grouping

    Usage

    (pattern)
    (?:pattern)
    (?=pattern)
    (?!pattern)
    (?<=pattern)
    (?<!pattern)
    (?#text)
    internal modifiers: i, m, s, x

    Description

    Parenthesis in regular expressions are used for grouping subpatterns within the larger pattern. This can be done to provide

    • Limited action to quantifiers: /\bba(na)+na\b/ # "But I don't know when to stop"

    • Limited scope to alternation: /my favorite stooge is (Moe|Curly|Larry)\./

    • Limited range for modifiers: /The dog made a ((?i)spot)\./

    • Introduction for assertions: /Jimmy (?=Buffet|the Greek)/

    • Capturing for backreferences: /\b([a-z]+)\s+\1\b/

    The ability to capture subpatterns for backreferences is covered in the entry on backreferences. Some of the examples in this section assume prior knowledge of backreferences.

    Simple parenthesis (pattern) and the (?:pattern) form allow you to group a subpattern of a regular expression. Once grouped, quantifiers can be applied against just that portion of the regular expression:

    m/\w+:            # Match the first field (it's required)
      (?:[^:]*:){3}   # Match (and discard) the next three fields
      ([^:]*)        # Match (and capture) the next field
    /x;

    Also, alternation can be limited so that when an alternation symbol is seen, exactly what's being alternated against can be determined:

    m/oats|peas|beans$/;  # oats, peas or beans (but beans at the end)
    m/(oats|peas|beans)$/;# Any of oats, peas or beans only at the end

    Internal modifiers can have their scope limited (in fact, internal modifiers can only be specified with parenthesis). So in the following:

    m/Tony\s(?i:the)\sTiger/;

    the phrase will be matched only if the capitalization is just as it appears; however the word the will not be matched case sensitively. (This could have been accomplished with [Tt]he as well.)

    The difference between () and (?:) is that the (?:patterns form of parenthesis doesn't capture the subpattern matched and that (pattern) does—it provides grouping without the capturing side effect. This makes a difference if you're using backreferences. See the backreferences entry.

    The constructs (?=pattern), (?!pattern), (?<=pattern), and (?<!pattern) are all used to "look around" the current match to see what either precedes or follows it. They are zero-width assertions, meaning that the subpattern contained within is only used to look ahead or look behind the current point of the match to see whether something is true or not.

    Pattern

    Name

    (?=pattern)

    Positive lookahead. Is only true if pattern is seen after the current point of the match. So /Abraham\s(?=Simpson|Lincoln)/ matches only if Abraham is followed by Lincoln or Simpson. The benefit is that the last name is not absorbed by the match. See the later examples.

    (?!pattern)

    Negative lookahead. True only if pattern is not seen after the current point of the match. So if /^(?:\d{1,3}\.){3}\d{1,3}$/ matches an IP address (and some bad ones too, such as 999.888.777.666), /^(?!(?:0+\.){3}0+)(?:\d{1,3}\.){3}\d{1,3}$/x matches those same IP addresses, but disallows 0.0.0.0.

    (?<=pattern)

    Positive lookbehind. This asserts that pattern was seen before the current point in the match. /(?<=bar)foo/ matches only if foo was directly preceded by bar. There is a restriction on this subpattern: it must be fixed-width, so /(?<=bar.*)foo/ isn't allowed.

    (?<!pattern)

    Negative lookbehind. True only if pattern was not seen before the current point in the match. /(?<!bar)foo/ is true only if foo was not directly preceded by bar. Like positive lookbehind, the subpattern must be fixed-width.

    The (?#text) construct is used to place comments in the body of a regular expression. For example, if the expression is long and convoluted, you might say:

    /\D\d{5}(-\d{4})?($# ZIP+4 optional)\D/

    Because perl needs the ) to know when to terminate the comment, you cannot include a literal ) in the comment itself.

    A cleaner way to include comments within a regular expression is to use the /x modifier to the expression.

    The internal modifiers are modifiers (such as /i, /s, /x) that are applied to only a portion of the regular expression. They are specified with the non-capturing parenthesis mechanism by inserting the modifier after the ? but before the next token or by using them within parenthesis with a lone ?:

    (?modifiers:pattern) (?modifiers)

    To add a modifier to a portion of the expression, use the following modifier value:

    if (/Linus Torvalds wrote L(?i:inux)/) { }

    This match is case sensitive except the letter-sequence inux, which can be uppercase, lowercase or a mix. A modifier can be removed by preceding it with a dash:

    (?modifiers_to_add - modifiers_to_remove:pattern)

    For example,

    if (/(?-i:Linus) wrote Linux/i) { }

    The preceding match is not case sensitive, except the portion matching Linus.

    Alternation

    Usage

    pat|pat

    Description

    The | metacharacter is used to make the regular expression engine choose between two potential matches; this is called alternation. The | should be placed between potential choices within the pattern:

    /cat|dogfish/

    Would match either cat or dogfish. The alternation extends outward from the | to the end of the innermost enclosing parenthesis or to another alternation symbol.

    /(cat|dog)fish/;       # Either "catfish" or "dogfish"
    /(cat|dog|sword)fish/;  # catfish, dogfish or swordfish

    The alternation extends outward to include any anchors or zero-width assertions that are within the enclosed scope:

    s/^\s+|\s+$//g;  # Remove leading/trailing whitespace

    An empty alternative can be specified, which allows you to choose between a few choices or nothing at all:

    /(cat|dog|sword|)fish/;  # catfish, dogfish or swordfish or just fish

    Perl's regexp engine will process the alternations left-to-right and select the first one that matches. Thus, if you have an alternation that is the prefix of a following alternation, or an empty alternation, it should be placed at the end:

    /paper|paperbacks|paperweight/;   # The last two will never match
    /(paperbacks|paperweight|paper)/; # Better!
    /paper(backs|weight)?/;           # Even better still!
    
    /(|bugle|bugs|bugaboo)/;        # The empty choice will always match

    Alternation isn't always the best choice for determining whether a list of things will match. Because of the way that Perl's regex engine works, a list of alternations such as the following:

    /than|that|thaw|them|then|they|thin|this|thud|thug|thus/

    will run much slower than if the match is re-written as follows:

    /th(?:an|at|aw|em|en|ey|in|is|ud|ug|us)/

    The regex engine can't scan through the alternations and notice the obvious: the program is trying to match four-letter words that begin with th—it's not that smart (yet). By giving it a hint, that a literal th will need to match before the alternations need to be searched, the speedup time is tremendous. In this case, it is nearly 25 times faster for a large volume of text.

    So avoid alternation for simple cases similar to:

    m/\b\w(a|e|i|o|u)\w\b/;  # 3 letter words, vowel in the middle

    when a character class ([aeiou]) or another construct would work better.


    See Also

    character classes in this book


    Capturing and Backreferences

    Usage

    ()
    \1 \2 \3 \n
    $1 $2 $3 $n

    Description

    The parenthesis in regular expressions, in addition to grouping and other functions mentioned in the grouping entry, also have a side effect—patterns matched within parenthesis are stored, and can be used later in the expression or later in the program outside of the expression. This storage of matched patterns is called capturing, and referring to the captured values are backreferences.

    Each set of capturing parenthesis encountered takes the portion of the target string matched by the pattern and stores it in a register. The registers are numbered 1, 2, 3, and so on up to the number of parenthesis in the entire pattern match.

    During the match, any captured values are available by referring to the proper register with \register. This allows you to refer to something previously matched later in the pattern:

    /(\w+)\s\1/;  # Look for repeated words, separated by a space.

    In the preceding example, (\w+) captures word characters into the first capture register, and \1 looks for whatever word was stored there after the whitespace character.

    After the match has completed (or during the substitution-phase with the s/// operator), the captured value will appear in the variables named $1, $2, $3, and so on up to the number of parenthesis captured in the match.

    if ( s/(\w+)\s\1/$1/ ) {  # Remove repeated words, separated by a space.
       print "Removed duplicate word $1\n";
    }

    In this example, the backreference \1 is used to find the repeated word as shown previously. During the substitution, $1 is used to put back just one instance of the repeated word. After the match, $1 is still set to the captured value during the match.

    Some notes about the variables $1, $2, and so on are as follows:

    • They're dynamically scoped. So, given the following code:

    • $_="She loves you yeah yeah yeah";
      {
         if ( s/(\w+)\s\1/$1/ ) {
             $match=1;
         }
      }
      print "Removed a $1" if $match;

      Because the match occurred within a block of its own (the bare block), $1's value isn't valid outside of that block. Treat them as though they had been declared with local.

    • They're only set if the match succeeds. If the match fails, the values in them are indeterminate. A very common programming mistake is to assume that the match succeeded and then proceed using $1 and company without whatever values they happen to have:

    • @addr=(q{From: Bill Murray <bmurray@ttsd.k12.or.us>},
            q{From: Clinton Pierce <clintp@geeksalad.org>},
            q{From: Chris Doyle him@bootlegtoys.com},
            q{From: Shelley.Robertson@samspublishing.com},);
      for(@addr) {
         m/From: (\w+ \w+) <?([\w@.])+>?/;
         print "You got mail from $1\n";
      }

      In this example, because the last bit of data isn't as well-formed as the others, the match actually fails, but the program goes blindly on using $1.

    • You cannot use $1, $2, $3, and so on in the left-hand portion of the substitution operator. Notice this attempt:

    • s/(\w+)\s$1/$1/;  # WRONG

      The $1 is scanned as a regular variable name when the regular expression is first parsed. It will have the old value of $1 (if any) from a previous match.

    • Multiple sets of parenthesis will cause the capture registers to be used in the order encountered. If the parenthesis nest, each opening ( assigns the next register.

    • $name="James T. Kirk";
      if ($name=~m/^((\w+)\s(\w+.?)?\s(\w+))$/) {
         print "First: $2\n";  # First name
         print "Middle: $3\n"; # Middle name/initial
         print "Last: $4\n";   # Last name
         print "Whole: $1\n"   # Whole name
      }

    Example Listing 3.5

    # Read a file in the format
    #       key=value
    #       key2=value2
    #   and assign the data to %conf appropriately
    # ** This is done with a clever code trick in the
    #    match operator entry.  See TIMTOWDI in action!
    
    open(CONFIG, "config") || die "Can't open config: $!";
    while(<CONFIG>) {
       if (m/^([^=]+)=(.*)$/) {  # Look for FOO=BAR
           $conf{$1}=$2;
       }
    }

    See Also

    local, dynamic scope, match operator, Regular Expression Special Variables, and Character shorthand in this book


    Line Anchors

    Usage

    \A ^ \z \Z $

    Description

    Anchors are used within regular expression patterns to describe a location. Sometimes the location is relative to something else (\b) or the location can be absolute (\A). Because they don't match an actual character but make an assertion about the state of the match, they also are called zero-width assertions.

    The first anchor (appropriately) is ^, which causes the match to happen at the beginning of the string. So,

    if (m/^whales/) { }

    will only be true if whales occurs at the beginning of $_. If whales occurs anywhere else in $_, the match won't succeed.

    Next is the $ metacharacter that only matches at the end of a string:

    if (m/Stimpy$/) { }

    This pattern will only match if Stimpy occurs at the end of the string. These two metacharacters can be combined for interesting effects:

    if (/^$/) {  }   # Matches empty lines
    # Here, the middle "doesn't matter", but the beginning and
    #   endings that must match are well-defined.
    if (/^In the beginning.*Amen$/) {}
    if (m/^/) { }    # Will always match

    When you think you understand $ and ^, read on.

    The first few anchors describe the beginning and ending of a string. These are complicated by the fact that "end of a string" can often mean "end of a logical line" or "end of the storage unit," depending on who you ask. The /m modifier on a regular expression match (or substitution) can change which meaning you want. The same goes for "beginning of a string."

    From now on in this entry, I'll refer to a logical line and a string. A string is the entire storage unit. A logical line begins at the beginning of the string and extends to a newline character. It also begins after a newline character and extends to the next newline character in the (or the end of a) string. Take, for example, the string of characters in $t the following:

    $t=q{That whim on the way
    And again I took the day off
    To roam the river's edge};

    The string contains two newline characters: one following the word way and one following off. Three logical lines are in the one string.

    The ^ metacharacter will match at the beginning of the string, unless /m is used as a modifier on the match. In that case, ^ can match at the beginning of any logical line in the string.

    The $ metacharacter will match at the end of the string, unless /m is used as a modifier on the match. If that is the case, $ can match at the end of any logical line in the string.

    So observe the following matches against $t from the preceding:

    if ($t=~/way$/) { }  # False!  Without /m way isn't at the EOL
    if ($t=~/way$/m) { } # True!  With /m way is at the End Of Line
    if ($t=~/^That/) { } # Always true!
     if ($t=~/^And/) { }  # False!  Without /m, And isn't at the beginning
    if ($t=~/^And/m) { } # True!  With /m, And is at the beginning of line
    
    while($t=~/(\w+)$/g) {  # Prints only "edge", because
       print "$1";    #  without /m, there is only one "end of line"
    }
    
    while($t=~/(\w+)$/gm) { # Prints way, off and edge
       print "$1";     #   because each represents an "end of line"
    }                       #   with /m

    The \A metacharacter matches the beginning of the string always, and without regard to the /m modifier being used on the match. So in the sample string $t, the expression $t=~/\A\w+/m will only match the word That. The \z metacharacter similarly will always match at the end of the string, regardless of whether /m is in effect.

    The \Z metacharacter is similar to \z with a bit of a difference: \z anchors at the end of the string behind (to the right of) the newline character if any. The \Z metacharacter anchors at the end of the string just in front of the newline character, if there is one, and at the end of the string if there isn't.


    See Also

    multi match and word anchors in this book


    Word Anchors

    Usage

    \b \B

    Description

    The word anchors \b and \B are zero-width assertions that deal with the boundary between nonword characters (\W) and word characters (\w). The beginning and ending of a string are considered nonword characters.

    The \b character matches the boundary between \w and \W characters. So, \bFOO matches FOO but only if the character preceding FOO is not a \w. The \B character matches between \W and \W characters; thus \BFOO will find FOO, but only if it's preceded by a word character.

    $t=q{There was a young lady from Hyde
    Who ate a green apple and died.
    While her lover lamented
    The apple fermented
    And made cider inside her inside.};
    
    $t=~m/\bher\b/;   # Matches "her" but not "There"
    $t=~m/\Bher\B/;   # Matches the "her" in "There"
    $t=~m/\bide\b/;   # Matches nothing!  Not cider nor inside
    $t=~m/\bThere/;      # Matches There, because ^ is a word-boundary

    Within a character class, \b stands for backspace and not a word boundary.

    A common mistake is to assume that \b matches what people consider to be word boundaries (because _ is a word character). So, clintp@geeksalad.org is three words, U.S.A is also three, but War_And_Peace is only one word.


    See Also

    line anchors in this book


    Multimatch Anchor

    Usage

    \G

    Description

    Similar to the line anchors, the multimatch anchor is used to match positions within a string as opposed to actually matching characters. It is in that class of metacharacters called zero-width assertions.

    The \G metacharacter matches the position right after the previous regular expression match. For example, given the following code:

    $_="One fish, two fish, red fish, blue fish";
    m/\b\w{3}\b/g;  # Matches "One"
    m/\G\W+(\w+)/;  # $1 is fish
    m/\b\w{3}\b/g;  # Picks up "two"
    m/\G\W+(\w+)/;  # $1 is fish (number two)

    \G is useful for incrementally bumping along within a string with regular expressions. The location marked by \G can be reset by calling the pos function with an argument:

    pos($_)=0;      # Reset \G to the beginning

    The advantage of \G to look-ahead or look-behind assertions is that you get to write smaller (and simpler!) regular expressions. The /g modifier will cause the match to go back to the position where the last /g left off. The \G assertion allows you to look ahead without destroying your last position.

    Example Listing 3.6

    # Take apart the given paragraph looking for
    #   phrases joined with the conjunctions "nor" and "or".
    # Note that "now or later" and "later Or no" are both
    #   picked up.  With a single regular expression and no \G
    #   this would be much more complicated.
    
    # C.J. lyrics and music by Bob Dorough (c)1973
    $t=q{Conjunction Junction, what's your function?
    Hookin' up two cars to one when you say
    Something like this choice: Either now or later,
    Or no choice.  Neither now nor ever.  (Hey that's clever)
    Eat this or that, grow thin or fat.};
    
    # The expression here picks up a word at a time, remembering
    #   where we left off with /g
    while( $t=~m/(\w+)/g ) {
         $left=$1;
    
       # Matching with \G here doesn't ruin our position in
       #   the match above...because we didn't use /g.
       if ($t=~/\G\W+(n?or)\W+(\w+)/i) {
           print "$left $1 $2\n";
       }
    }

    See Also

    line anchors in this book


    Match Modifiers

    Usage

    m//cgimosx
    qr//imosx
    s///egimosx

    Description

    This section describes the modifiers used with regular expression matches, substitutions, and compilations. Some modifiers are particular to an operator:

    Modifier

    Particular To

    /g

    Match and Substitution Operators

    /gc

    Match Operators

    /e

    Substitution Operators

    These modifiers are discussed along with the particular operators to which they apply elsewhere in this book.

    The /i operator causes the regular expression to not match case sensitively. During the match, no distinction is made between upper and lowercase letters, including those within character classes:

    m/Scrabble/i;    # Matches scrabble or scrabble or sCrAbBlE or...  

    The locale pragma causes a wider range of alphabetic characters to be recognized, and sensitivity of upper- and lowercase characters will expand appropriately.

    The /m modifier causes the meaning of the ^ and $ anchors to change. With the /m modifier, ^ and s will match at the beginning and end of logical lines (possibly multiple logical lines) within a target string. Some examples of this are in the "Anchors" section.

    The /s modifier causes the nature of the . (dot) metacharacter to change. Normally, dot matches any single character except a newline character (\n). With /s in place, the newline is a potential match for .:

    $text=q{You are my sunshine, my only sunshine.
       You make me happy, when skies are grey.};
    m/You.*/;  # Matches from "You are" to "sunshine."
    m/You.*/s; # Matches from "You are" to "grey."

    The /o modifier causes perl to only compile a regular expression once. Normally, a regular expression containing variables is recompiled each time perl encounters the expression.

    $pat='\w+\W\w+';
    while(<>) {
       if (/$pat/o) {
           $a++;
       }
    }

    In this example, the pattern in $pat is only changed outside of the loop. Perl doesn't realize this, so each pass through the loop, the pattern /$pat/ has to be recompiled by the regex engine. Giving perl the hint with /o that the pattern won't change allows the regex engine to skip the recompilation.

    This optimization only makes sense when the pattern contains a value that could potentially change ($pat shown previously). Also, if the /o optimization is used and you do change the variables that make up the pattern, subsequent pattern matches won't reflect those changes.

    The /x modifier allows you to specify comments within a regular expression. Specifically, comments are as follows:

    • All whitespace in a regular expression becomes insignificant, except within a character class.

    • Comments extend from the # character to the end of the line, or the end of the expression.

    • Literal #s in the expression must be escaped with a \ or represented as a hex or octal constant.

    • # The FAQ answer to "how to print a number with commas"
      $_="1234567890";
      1 while          # Repeat ad nauseam...
         s/^         #    start at the beginning, and
              (-?\d+)  #    absorb all of the digits (maybe a -)
             (\d{3})  #    except for the last three.
         /$1,$2/x;    #    Put a comma before those three

    See Also

    match operator and substitution operator in this book


    Miscellaneous Regular Expression Operators

    Binding Operators

    Usage

    expression =~ op
    expression !~ op

    Description

    The binding operators bind an expression to a pattern match or translation operator. Normally the m//, s///, and tr/// operators work on the variable $_. If you need to work on a variable other than $_, use the binding operator from before as follows:

    $line=~s/^\s*//;

    This causes the substitution operator to work on $line instead of $_. The return value for the operator on the right is returned by the bind operator.

    The !~ operator works exactly the same as the =~ operator except that the return value is logically inverted. So, $f !~ /pat/ is the same as saying not $f =~ /path/.

    Because =~ has a higher precedence than assignment, this allows you to do curious (and useful) things with the return value from =~. To return a list from a pattern match on $_, you would normally capture that as follows:

    ($first, $second)=m/(\w+)\W+(\w+)/;

    With the bind operator, it's no different except that you can name your variable:

    ($first, $second)=$sentence=~m/(\w+)\W+(\w+)/;

    Coupling this with the fact that the assignment operator yields an assignable value, you can assign, bind, and alter a variable at the same time:

    # Okay, here's an assignment, bind and change.
    $orig="Won't see this trick in Teach Yourself Perl!";
    ($lower=$orig)=~s/!$/ in 24 hours!/;
    # $lower is now "Won't see this [...] Yourself Perl in 24 Hours!"
    
    # Watch this:
    $changes=($upper=$lower)=~s/(\w\w+)/ucfirst $1/ge;

    That last statement is kind of difficult and bears some explanation. The highest precedence operator in this expression is =~, but in order for the bind to happen, the ($upper=$lower) must be taken care of. So, $lower's value is assigned to $upper. The bind then takes $upper and performs the substitution. The substitution operator returns the number of substitutions made. This value passes back through the bind and is assigned to $changes. So $changes is 11 and $upper is "Won't See This Trick...".

    A special note, if the thing to the right of the bind operator is an expression instead of a pattern match, substitution, or translation operator, a pattern match is performed using the expression.

    $pattern="Buick";
    if ($shorts =~ $pattern) {
       print "There's a Buick in your shorts\n";
    }

    Using the bind operator as an implicit pattern match is slower than explicitly calling m// because perl must re-compile the pattern for each pass through the expression.


    See Also

    substitution operator, pattern match operator, and translation operator in this book


    ??

    Usage

    ?pattern?modifiers

    Description

    The ?? operator works the same as the m// operator, with one small difference. The operator only attempts to match the pattern until it is successful and thereafter the operator no longer tries to match the pattern.

    Each instance of the ?? operator maintains its own state. Once latched, the ?? can be reset by using the reset function. This resets all the ?? operators in the current package.

    Example Listing 3.7

    # Prints a summary of a given mailbox file.
    # Unix mailbox format is extremely common and uses a paragraph
    #   beginning with "From " to describe the start of a message header.
    #   The body of the message follows in subsequent paragraphs.
    
    use strict;
    use warnings;
    my($from, $subject, $to)=("","","");
    open(MBOX, "mbox") || die;
    $/="";            # Paragraph mode.
    while(<MBOX>) {
       $from=$1     if (?^From: (.*)?m);
       $to=$1       if (?^To: (.*)?m);
       $subject=$1  if (?^Subject: (.*)?m);
    } continue {
       if (/^From/ or eof MBOX) {
           print "From: $from\nTo: $to\nSubject: $subject\n\n"
               if $from;
           # The 0-argument reset function resets all of the ??
           #   latches above for use in the next message.
           reset;
           $from=$subject=$to="";
       }
    }

    See Also

    reset, match operator, and match modifiers in this book


    pos

    Usage

    pos
    pos target string

    Description

    The pos function returns the position in the target string where the last m//g left off. If no target string is specified, the target string $_ is used. The position returned is the one after the last match, so

    $t="I am the very model of a modern major general with mojo";
    $t=~m/mo\w+/g;
    print pos($t);

    prints 19, which is the offset of the substring " of a modern...".

    The pos function also can be assigned; doing so causes the position of the next match to begin at that point:

    $t="I am the very model of a modern major general with mojo";
    $t=~m/mo\w+/g;    # Now we're at 19, just as before.
    pos($t)=38;     # Skip forward to the word "general"
    $t=~m/(mo\w+)/g;# Grab the next "mo" word...
    print $1;    # It's "mojo"!

    Example Listing 3.8

    # Sample from a text-processing system, where tags of the form
    #   <#command> are substituted for variables, and other files can
    #   be included, and so on.
    # pos() is used to return to the original matchpoint to re-insert
    #   the new and improved text. 
    
    use strict;
    
    # Just some sample data to play with.
    our $r="Hello, world";
    my $data='bar<#var r/>Foo<#include "/etc/passwd"/>';
    
    while($data=~/(<#(.*?)\/?>)/sg) {
       my($whole, $inside)=($1,$2);
    
       if ($inside=~/var\s+(\w+)/) {    # Grab a variable from main::
           no strict `refs';
           substr($data, pos($data)-length($whole),
                length($whole))=${`main::' . $1}  
         }
       if ($inside=~/include\s+"(.*)"\s*/) { # Include another file..
           open(NEWFH, $1) ||
               die "Cannot open included file: $1";
           {
               local $/;
               my $t=<NEWFH>;
               $t=eval "qq\\$t\\";
               die "Inlcuded file $1 had eval error: $@"
                   if $@;
               substr($data, pos($data)-length($whole),
                   length($whole))=$t; 
           }
       }
       # ...and many more
    }
    print $data;  # Gives "barHello, worldFoo[contents of /etc/passwd]"

    See Also

    match operator in this book


    Translation Operator

    Usage

    tr/searchlist/replacement/modifiers
    y/searchlist/replacement/modifiers

    Description

    The tr/// operator is the translation (or transliteration) operator. Each character in searchlist is examined and replaced with the corresponding character from replacement. The tr/// operator returns the number of characters replaced or deleted. Similar to the match and substitution operators, the translation operator will use the $_ variable unless another variable is bound to it with =~:

    tr/aeiou/AEIOU/;     # Change $_ vowels to uppercase
    $t=~tr/AEIOU/aeiou/; # Change $t vowels to lowercase

    The y/// operator is simply a synonym for the tr/// operator, and they are alike in every other respect.

    The tr/// operator doesn't use regular expressions. The searchlist can be expressed as the following:

    • A sequence of characters, as in tr/aeiou/AEIOU/

    • A range of characters, similar to those used in character classes:

        tr/a-zA-Z/n-za-mN-ZA-M/;  # ROT-13 encoding

    Special characters are allowed, such as backslash escape sequences (covered in the "Character Shorthand" section). Special characters that represent classes (\w\d\s) aren't allowed. (tr/// doesn't use regular expressions!)

    No variable interpolation occurs within the tr/// operator. If a character is repeated more than once in the searchlist, only the first instance counts.

    The replacement list specifies the character into which searchlist will be translated. If the replacement list is shorter than the searchlist, the last character in the replacement list is repeated. If the replacement list is empty, the searchlist is used as the replacement list (that is, the characters aren't changed, merely counted). If the replacement list is too long, the extra characters are ignored.

    The modifiers are as follows:

    Modifier

    Meaning

    /c

    Compliments the search list. In other words, similar to using a ^ in a character class; all the characters not represented in the searchlist will be used.

    $consonants=$word=~tr/aeiouAEIOU//c; # Count consonants

    /d

    Deletes characters that are found, but doesn't appear in the replacement list. This bends the aforementioned rules about empty or too-short replacement lists.

    $text=~tr/.!?;://d; # Remove punctuation

    /s

    Takes repeated strings of characters and squashes them into a single instance of the character. For example,

    $a="Pardon me, boy. Is that the Chattanooga Choo-Choo?"

    $a=~tr/a-z A-Z//s; # Pardon me, boy. Is that the Chatanoga Cho-Cho?


    See Also

    character shorthand and character classes in this book


    study

    Usage

    study
    study expression

    Description

    The study function is a potential optimization for perl's regular expression engine. It prepares an expression (or $_ if none is specified) for pattern matching with m// or s///. It does this by prescanning the expression and building a list of uncommon characters seen in the expression, so that the match operators jump right to them as anchors.

    Calling the study function for a second expression undoes any optimizations by the previously studied expression.

    Whether study will save any time on your regular expression matches depends on several factors:

    • The study process itself takes time.

    • The kinds of data that makes up the expression being studied.

    • Whether your search expression uses many constant strings (study might help) or few constant strings (study might not help).

    As always, with any optimization, use the Benchmark module and determine whether there really is a cost savings to using study. Constructing a case in which study is actually useful is difficult. Do not use it indiscriminately.


    See Also

    qr in this book


    Quote Regular Expression Operator

    Usage

    qr/pattern/

    Description

    The qr operator takes a regular expression and precompiles it for later matching. The compiled expression then can be used as a part of other regular expressions. For example,

    $r=qr/\d{3}-\d{2}-\d{4} $name/i;
    if (/$r/) {
       # Matched digits-digits-digits and whatever was in $name...
    }

    Similar to the match operator, the delimiters can be changed to any character other than whitespace. Also, using single quotes as delimiters prevents interpolation.

    Example Listing 3.9

    # A short demo of the qr// operator.  The fast subroutine
    #   runs nearly 4 times faster than the slow subroutine
    #   because the qr// operator pre-compiles all of the regular
    #   expressions for &fast.
    # Remember, if you're not sure something is faster: Benchmark it.
    
    use Benchmark;
    sub slow {
       seek(BIG, 0, 0);
       @pats=qw(the a an);
       while(<BIG>) {
           for (@pats) {
               if (/\b$_\b/i) {
                   $count{$_}++;
               }
           }
       }
    }
    sub fast {
       seek(BIG, 0, 0);
       # Pre-compile all of the patterns with
        #   qr//
       @pats=map { qr/\b$_\b/i } qw(the a an);
       while(<BIG>) {
           for (@pats) {
               if (/$_/) {
                   $count{$_}++;
               }
           }
       }
    }
    
    open(BIG, "bigfile.txt") || die;
    timethese(10, {
       slow => \&slow,
       fast => \&fast, });

    See Also

    match modifiers in this book


    See Part 1

     
     


    About The Perl ArchiveLink Validation ProcessSearch Tips
    Web Applications & Managed Hosting Powered by Gossamer Threads
    Visit our Mailing List Archives