Skip navigation.
Home

So You Want to Learn Regular Expressions? Part 6: Errr... Or...

The full list of Regular Expression Articles I've done:

In the previous article with our box of chocolates, we used a method for choosing between one or more of several, more or less random, characters. [abc]+ for example.
But a common task in any web analytics is to be able to choose between several different items and treat them identically.
eg Images: gif, jpg, png
or, Pages: htm, html, cfm, php, asp and so on.

Or to put the first case pretty bluntly in English, we want "gif" or "jpg" or "png" at the end of a file name request.

Or?

Robbin Steif has done a really great starter on this one.
Regular Expressions Part VI: OR

Do check Robbin's words out first and we'll see if we can expand on them.

A Quick Recap

The vertical bar, or pipe character '|' means "or" in a Regular Expression. It's another of those jailed characters.

So lets rephrase our first example from above.
We want Images. We want: gif or jpg or png.

Phrased like that, we can write this as a Regular Expression pretty easily:


gif|jpg|png

And really that's about it!

Pages, much the same:


htm or html or cfm or php or asp

becomes:


htm|html|cfm|php|asp

Which is also identical to:


asp|htm|cfm|php|html

The order doesn't matter. It does from an efficiency perspective, but not from correctness. Don't worry too much about speedy Regular Expressions yet. It will come. Smile

Note that we also removed the spaces? In English we need spaces to separate words from each other, but not always. The full stop at the end of a sentence has no space before it without losing it's meaning, likewise use of commas and such. In a Regular Expression, the space is treated no differently to an 'a' or a 'Z'. They do matter and you should only put a space in if you want to match against a space.

All Alone

When used on their own, like we did above, using the | is fine. It doesn't need anything else to make itself explicit.

The trick comes when you want to combine an OR against something else.

Looking at our page example, suppose you wanted all index.WHATEVER pages.
index.php, index.html, index.cfm and so on.

You could write:


index.asp|index.htm|index.cfm|index.php|index.html

And that will work, and work fine, but it's pretty obviously not that efficient or elegant. The Negation of those words form two of the most powerfully damning phrases in the entire pantheon of curses available to IT folk. "We hates them my precious"!

What we really want is something like:
index. followed by asp|htm|cfm|php|html

Sub-Expressions

Sorry about that. Actually went and used the right name in the title. Smile

Sadly though, this is one time where the correct name should help. Recall from your days in school and early math, when you learnt about parenthesis?
They were probably used initially to show the difference between:
(1 + 2) x 3 = 9 and 1 + 2 x 3 = 7

And explain why those two sums are different. The parenthesis allows the 1 + 2 sub expression to take precedence over the x 3 multiplication.

You have, fortunately!, the same concept with a Regular Expression.


index\.(asp|htm|cfm|php|html)

Note that I've escaped the dot to force it to be itself, and not a special jailed character.

You may find it useful when you're deciphering such a Regular Expression to break it down into it's component parts. I do it myself, so there's no shame in it! Smile


index\.
(
   asp|
   htm|
   cfm|
   php|
   html
)

Laid out like this, you may be able to see the breakdown of discrete parts that make up this Regular Expression more clearly.

The Trap

This isn't quite the end of this particular regular expression, 'cause it's not actually correct.

What!!! Why?

Because it will also match "old_index.htm" which is not something we wanted it to do.

Knowing a little bit about web servers, you'll no doubt be aware that the actual URL would be "/old_index.htm" or "/index.htm" and "/index.cfm" for two examples of the desired results.
It may even be "/cart/index.php" which is also acceptable.

The fix is simple, and recall, this is mainly for use with Google Analytics:


/index\.(asp|htm|cfm|php|html)

The trick here is that there was a little bit extra information that we did know about, but that wasn't explicitly spelt out for us. Regular Expressions are like any computer software. They don't know what you meant, only what you told it.

Steak Knives?

One last little bonus extra for this article.

Notice in our Page examples that we had "htm" and "html"? Well we can make this Regular Expression just a tad more efficient and elegant. Yeah, there's those words again...


/index\.(asp|html?|cfm|php)

This is where going beyond the basics begins. We are now starting to combine multiple simple concepts to form quite complex Regular Expressions.

If you don't see how this one works, lets simplify a little:


/index\.html?

The question mark is a counter againt the 'l'. Essentially you can have an 'l' or not have an l.
/index.htm or /index.html are your only two choices.


And that brings us to the end of this article. Believe it or not, but that's about all you really need to know about Regular Expressions. With understanding of the concepts in these 6 articles, you now have the basic building blocks to handle 99% or more of all Regular Expressions you are likely to ever write.

Next up, we'll look at more complex, real life examples. So if you have any particular questions send 'em in.

The full list of Regular Expression Articles I've done:

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Very awesome

This has been so helpful. Today I wanted to look at the search terms on my website where people weren't looking for me but had just found the site by searching the web. So I wanted to get all the company terms and personal terms out. Now that you have taught me so much, I did a filter-out like this:

luna|robb?in

It was perfect.

My Pleasure!

Only too glad to have helped.

Would you need to filter on your last name as well?
I don't get many doing that here, but do get some coming in via a last name only search.

The awffull demo pages give a (heavily filtered!) quick grab-bag of the various phrases.

I suspect the person(s) searching for "v the visitors" got the wrong site. :-)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.