So You Want to Learn Regular Expressions? Part 3: Positioning
The full list of Regular Expression Articles I've done:
- What and How? - Wherein we look at What a Regular Expression actually is, and How they basically work.
- Wild Cards - A starting look at the use of wildcards. Ways to find lots of stuff.
- Positioning - Finding stuff is good. Being able to specify where stuff may be found is better.
- More Wild Cards - More ways to find lots of stuff.
- Just Like a Box of Chocolates - The Ultimate Analogy. Single choices from between Square Brackets.
- Choices, Choices, Choices.... - Exposing bigger choices: Err... Or.
- Filtering IP Addresses - Wherein I bore the reader to tears explaining how to RegEx for IP Addresses.
If you've been tracking the public discussion on Robbin Steif's blog regarding this series, you'll no doubt be aware that she was prompting me (in a really unsubtle fashion ;-) ) to explain the use of the "beginning" and "end" characters. ^ and $ respectively.
So that's what this episode in the series will be focusing on.
A Quick Recap
We've explored the use of wildcards to match generic chunks of text. And discovered that Regular Expressions begin life in jail and need to be escaped to become normal.
Well it's even sadder for those poor Regular Expressions. Not only in jail, but completely lost. They rarely have any idea of where they are.
Good question. Sometimes it does. Sometimes it doesn't. It Depends. :-)
Continuing the "lost" analogy, a regular expression may find it's way without any map, but with a map it may be a far simpler path. Which can mean faster and more efficient - always good goals.
The Map
So two new jailed characters to familiarise ourselves with:
caret: ^
which means the start of a line; and
dollar: $
which means the end of a line
Which is all very well and good, but the trick with any Regular Expression character is how and why do you use them!
As I mentioned earlier, efficiency is a huge win with these two. More so caret than dollar, for reasons that are not worth explaining here. Trust Me. If you do something once or twice, efficiency usually doesn't matter. If you're doing the same thing ten's of thousands of times a second, efficiency matters hugely.
I've written programs that actually reverse data coming in so the Regular Expression can use a start of line caret instead of a dollar and hence run a LOT faster. This is usually a Bad Idea(tm). :-)
So: Speed and Efficiency. First two reasons. But there is another. And this can be even more important. It can make the Regular Expression easier to understand and read.
By giving a positional character, you are helping to supply some context. This is at the start of a line. This is at the end. The Regular Expression starts to become self documenting. As a person reading a Regular Expression, you can start to understand what the Regular Expression is trying to match immediately. Start. End. Easy!
Usage
As for using them? Generally I find that caret is used when the start of a line is relatively fixed in format. It could be a complete URL which will always start with a http:// (1), so we know that this is always at the start of our line of data, and hence we can explicitly make it so:
^http://
Using dollar is much the same, just in reverse. You know the end of the line of data, but are unsure about the earlier stuff. eg. Searching for all Jpeg images:
\.jpg$
Note that I've escaped the dot to force a match against an actual dot character.
In both these examples we have not applied any wildcards. Recall that we have the two implied wildcards at the start and end of any Regular Expression? Using these positional characters removes one of those implied wildcards.
Obviously, caret removes the left implied wildcard. Dollar removes the right implied wildcard.
So ^http:// could also be written as ^http://.* and have the exact same meaning. .*\.jpg$ being the same as \.jpg$.
A Negative Example
Suppose instead of jpeg images, we were instead looking for gif's. For the sake of simplicity we'll assume that none of these are tagged in any way. No image.gif?id=blah style, or that we don't want to see that style.
eg:
http://www.stedee.id.au/images/alpha.gif
http://www.stedee.id.au/stuff/beta.gif
and so on.
But let's make it a bit harder. That www may or may not be there. Also, I have several other domains that all point to the one web site. So we really can't rely on the domain portion at all, or directory as given. Which means - it needs to be wildcarded in some way shape or form.
The immediately obvious solution would be to do a simple Regular Expression like so:
\.gif
The trap with this, and why position becomes important, is that it can also match against a URL like this:
http://www.stedee.id.au/shop.gifts/buyme
See the .gif in shop.gifts? I admit this is a bit contrived, but hopefully you do see the danger. Unless you explicitly tell them, a Regular Expression will try and match where ever it can. This is also where indiscriminate use of wild cards can be so dangerous - you'll match more than you intended to.
So lets back up a bit and see what we do know:
- it's literally: .gif
- at the end of the line.
- could be anything before the.gif
Which obviously also gives us the answer:
\.gif$
This cannot match againstshop.gifts as the use of dollar forces the 'f' character in gif to be the last character at the end of the line. And in the case of shop.gifts, the 'f' is not at the end of the line. So no match.
Summation
When we as people look for a pattern we automatically apply the rules - this is at the start, this is at the end, without even realising that we're doing it. Regular Expressions aren't so clever. They need to be explicitly told that this is at the start and that is at the end. Using caret (^) and ($) is how we inform the Regular Expression processor of these rules.
Till next time, when we'll focus on advanced wildcard use.
(1) Keeping in mind that I'm targeting this series at non IT folk who are principally exposed to Regular Expressions via Google Analytics. The forward slash character '/' is usually a special/jailed character in a regular Regular Expression. In GA it's just like an 'a' or a 'b'. IMHO this was a good call on Google's part. Even if it could cause severe confusion down the track.
For my usual readers: the forward slash character is akin to the double quotes you would put around the words spoken by someone. Jane said "Hello Bill". Bill said "Hello Jane". The forward slash has the exact same meaning to a Regular Expression. It shows what is part of the Regular Expression as distinct from everything else around it.
Now when it comes to URL's this can be nightmarish, as to be correct, a full URL would need each forward slash escaped.
Instead of a nice simple: /shop.gifts/expensive/
You would need: /\/shop.gifts\/expensive\//
Yuk. Sadly you do get used to it. Or so I keep telling myself...
The full list of Regular Expression Articles I've done:
- What and How? - Wherein we look at What a Regular Expression actually is, and How they basically work.
- Wild Cards - A starting look at the use of wildcards. Ways to find lots of stuff.
- Positioning - Finding stuff is good. Being able to specify where stuff may be found is better.
- More Wild Cards - More ways to find lots of stuff.
- Just Like a Box of Chocolates - The Ultimate Analogy. Single choices from between Square Brackets.
- Choices, Choices, Choices.... - Exposing bigger choices: Err... Or.
- Filtering IP Addresses - Wherein I bore the reader to tears explaining how to RegEx for IP Addresses.


