Skip navigation.
Home

So You Want to Learn Regular Expressions? Part 3: Positioning

The full list of Regular Expression Articles I've done:

If you've been tracking the public discussion on Robbin Steif's blog regarding this series, you'll no doubt be aware that she was prompting me (in a really unsubtle fashion ;-) ) to explain the use of the "beginning" and "end" characters. ^ and $ respectively.

So that's what this episode in the series will be focusing on.

A Quick Recap

We've explored the use of wildcards to match generic chunks of text. And discovered that Regular Expressions begin life in jail and need to be escaped to become normal.

Well it's even sadder for those poor Regular Expressions. Not only in jail, but completely lost. They rarely have any idea of where they are.

Does that matter?

Good question. Sometimes it does. Sometimes it doesn't. It Depends. :-)

Continuing the "lost" analogy, a regular expression may find it's way without any map, but with a map it may be a far simpler path. Which can mean faster and more efficient - always good goals.

The Map

So two new jailed characters to familiarise ourselves with:
caret: ^
which means the start of a line; and
dollar: $
which means the end of a line

Which is all very well and good, but the trick with any Regular Expression character is how and why do you use them!

As I mentioned earlier, efficiency is a huge win with these two. More so caret than dollar, for reasons that are not worth explaining here. Trust Me. If you do something once or twice, efficiency usually doesn't matter. If you're doing the same thing ten's of thousands of times a second, efficiency matters hugely.

I've written programs that actually reverse data coming in so the Regular Expression can use a start of line caret instead of a dollar and hence run a LOT faster. This is usually a Bad Idea(tm). :-)

So: Speed and Efficiency. First two reasons. But there is another. And this can be even more important. It can make the Regular Expression easier to understand and read.

How?

By giving a positional character, you are helping to supply some context. This is at the start of a line. This is at the end. The Regular Expression starts to become self documenting. As a person reading a Regular Expression, you can start to understand what the Regular Expression is trying to match immediately. Start. End. Easy!

Usage

As for using them? Generally I find that caret is used when the start of a line is relatively fixed in format. It could be a complete URL which will always start with a http:// (1), so we know that this is always at the start of our line of data, and hence we can explicitly make it so:

^http://

Using dollar is much the same, just in reverse. You know the end of the line of data, but are unsure about the earlier stuff. eg. Searching for all Jpeg images:

\.jpg$

Note that I've escaped the dot to force a match against an actual dot character.

In both these examples we have not applied any wildcards. Recall that we have the two implied wildcards at the start and end of any Regular Expression? Using these positional characters removes one of those implied wildcards.

Obviously, caret removes the left implied wildcard. Dollar removes the right implied wildcard.

So ^http:// could also be written as ^http://.* and have the exact same meaning. .*\.jpg$ being the same as \.jpg$.

Cruising!

A Negative Example

Suppose instead of jpeg images, we were instead looking for gif's. For the sake of simplicity we'll assume that none of these are tagged in any way. No image.gif?id=blah style, or that we don't want to see that style.

eg:
http://www.stedee.id.au/images/alpha.gif
http://www.stedee.id.au/stuff/beta.gif
and so on.

But let's make it a bit harder. That www may or may not be there. Also, I have several other domains that all point to the one web site. So we really can't rely on the domain portion at all, or directory as given. Which means - it needs to be wildcarded in some way shape or form.

The immediately obvious solution would be to do a simple Regular Expression like so:

\.gif

The trap with this, and why position becomes important, is that it can also match against a URL like this:
http://www.stedee.id.au/shop.gifts/buyme

See the .gif in shop.gifts? I admit this is a bit contrived, but hopefully you do see the danger. Unless you explicitly tell them, a Regular Expression will try and match where ever it can. This is also where indiscriminate use of wild cards can be so dangerous - you'll match more than you intended to.

So lets back up a bit and see what we do know:

  1. it's literally: .gif
  2. at the end of the line.
  3. could be anything before the.gif

Which obviously also gives us the answer:

\.gif$

This cannot match againstshop.gifts as the use of dollar forces the 'f' character in gif to be the last character at the end of the line. And in the case of shop.gifts, the 'f' is not at the end of the line. So no match.

Summation

When we as people look for a pattern we automatically apply the rules - this is at the start, this is at the end, without even realising that we're doing it. Regular Expressions aren't so clever. They need to be explicitly told that this is at the start and that is at the end. Using caret (^) and ($) is how we inform the Regular Expression processor of these rules.

Till next time, when we'll focus on advanced wildcard use.


The full list of Regular Expression Articles I've done:

Syndicate content