Skip to main content

The Curious Case of .*

About a month ago, I tweeted what I thought was some pithy advice on the use of .* in regular expressions:

#regex tip of the day: never, _ever_ use .* unless you are forced to do so in a #24-style 
kidnap situation. And even then, think twice.

Funny and topical! But as it turns out, I spoke too soon. Early in Chapter 5 of the superlative Mastering Regular Expressions, the author (Jeffrey Friedl) demonstrates how the judicious use of .* can actually help you to write some very efficient regular expressions.

Disadvantages of .*

.* is kind of like the SELECT * of SQL, but about a million times worse. Here’s why: whenever .* shows up in a regular expression, the effect is always to skip evaluation of the string all the way to the end of the string. The regular expression engine then continues evaluating your regex from the end of the string, backing up one character at a time if it needs to in order to make a match.

For example, suppose you have the string ‘Hello’, and your regex is ‘.*!’. The regex will obviously fail here, because the string doesn’t contain a ‘!’. But here’s how most regular expression engines will go about testing this expression.

  1. The regex engine sees the .* and immediately skips to the end of the line, which you can think about as the position after the ‘o’.
  2. The engine doesn’t find a ‘!’ there, so it backs up one character and evaluates the ‘o’.
  3. Still no ‘!’, so it backs up again and evaluates the second ‘l’.
  4. This process continues until it reaches the ‘H’.

When it reaches the ‘H’ and can’t find a match, then you would probably think that the regex engine is done testing and declares failure. Right?

Afraid not. What happens next is that the regex engine re-evaluates the string, but this time, it starts its evaluation from the position right after the ‘H’. It again skips evaluation to the end of the line; it again fails to find a ‘!’ and so continues backing up one character at a time until it reaches the ‘e’. Still no match, so it kicks forward to the position after the ‘e’ and tries again.

Efficiency of .*

Just how inefficient is this? The example string ‘Hello’ is five characters long. On the first pass, the regex engine evaluates the pattern at 5 (or n) different positions. On the next pass, it evaluates 4 (or n-1) positions, then 3 (n-2) positions, and so on. Obviously, the number of evaluations made by the engine is equal to the summation of n, which we know works out to an efficiency of O(n^2). Eep. Clearly then, a poorly placed .* can really screw you. *

Advantage of .*

Recall the general behavior of .*: skip to the end of the line and try to make a match. If the match fails, back up one character and try again. How can we use this to our advantage?

Once again, I’ll adapt an example from Mastering Regular Expressions. Suppose you have a URL and you want to find what I’ll call the resource (the last part of the URL). If your URL is http://drupal.org/node/12345, you want to find ‘12345’. Assuming that the URL is valid, we know that the resource will always start with a slash – specifically, with the last slash in the URL. Wouldn’t it be handy to skip to the end of the line and look for the slash in reverse?

Aha! This is exactly what .* allows us to do! A pattern of ‘.*/’ will find everything up to the last slash, and it will find it quickly, because it starts at the end of the string, backing up one character at a time until it finds that last slash. If we then obliterate the matched text, the only thing left in our string will be the resource. Brilliant!

Final Thoughts on .*

To sum up: .* is usually your enemy, but sometimes it can be your best friend. And just like with your real-life best friend, you just need to know how to manipulate him in order to get what you want.

Ha. Kidding.

Happy patterning.

* I’ve glossed over some details here. The (more complete) truth is that some engines are smart enough to figure out that if you evaluate an entire line of text from the start of the line and the pattern fails to match, then it ain’t gonna match if you start from the second position in the line either, and it stops. This is known as ‘implicit-anchor optimization’, and is very cool. But if the .* isn’t the very first thing in your regex, then you don’t get that optimization. **

** Actually, some engines are smart enough to figure out that you can sometimes add an implicit anchor even when .* isn’t the very first thing in your regex. But it isn’t something you should rely on if portability of your regex is an issue. ***

*** Bottom line here folks: know your tool. There are so many different versions, flavors, and implementations of regular expressions that in order to use them successfully, you have to learn the ins and outs of the implementation you’re working with.

ty4tip

ty4tip

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.