So You Want to Learn Regular Expressions?
The full list of Regular Expression Articles I've done:
- What and How? - Wherein we look at What a Regular Expression actually is, and How they basically work.
- Wild Cards - A starting look at the use of wildcards. Ways to find lots of stuff.
- Positioning - Finding stuff is good. Being able to specify where stuff may be found is better.
- More Wild Cards - More ways to find lots of stuff.
- Just Like a Box of Chocolates - The Ultimate Analogy. Single choices from between Square Brackets.
- Choices, Choices, Choices.... - Exposing bigger choices: Err... Or.
- Filtering IP Addresses - Wherein I bore the reader to tears explaining how to RegEx for IP Addresses.
Perhaps you've been forcibly inducted into the Joy Of Regular Expressions through the use of tools like Google Analytics. Unfortunately while perfectly correct, the Google Analytics help for Regular Expressions is brief and does not explain the why of when to use X vs Y.
Hopefully the following article will get you through the why. I’m going to assume you’ve had at least some exposure to using Regular Expressions already.
What is a Regular Expression?
Wikipedia has a very good explanation. Where good equals "For Computer Scientists". So here are my attempts:
- Regular Expressions are a means by which IT People confuse non IT people
- Regular Expressions are horribly cryptic
- Regular Expressions give the wrong answer
Hmmmm. That’s not really working is it? Accurate though….
Regular Expressions are a tool for finding "stuff".
"Stuff" can be pretty much anything you like. Spelling mistakes, numerical sequences, entries in web server log files, filters to use in Google Analytics. The list goes on and on.
You can wrap all sorts of funky and confusing terminology around what a Regular Expression is to achieve "precision", but that’s the gist of it.
How Do Regular Expressions work?
If you read the 1st paragraph on the Wikipedia entry you should have seen words like “string” and “syntax” and “grep” and possibly even come across “metacharacters”. Ignore all that. It will not improve understanding, and is better used as an aid to sleep.
- Think of a jail.
- There are people in the jail.
- Outside the jail people live normal lives and do normal things.
- Inside the jail otherwise normal tasks mean something quite different.
- Exercising externally to the jail means you can go for a walk around a lake, ride a bike in a large park or run over hill and dale.
- Exercising in a jail means you are stuck in a small yard (Apologies for stereotyping).
So how can someone in a jail become normal and do normal things?
Regular Expressions work the same way. Instead of people, you have the various characters on your keyboard. ‘A’ or ‘B’ or ‘c’ or ‘z’ or ‘?’ or ‘*’. And yes, even the single and double quote characters.
But apparently life is not good in Regular Expression Land.
Which means that otherwise normal, harmless characters no longer mean simple things. A fullstop, or dot character: '.' has a very different meaning when jailed. It means: match any other character. Sort of like a Joker in some card game variants. A wildcard.
I’d almost feel sorry for them, but they have an ace up both sleeves. Escaping from jail for a Regular Expression is easy. No dynamite or helicopters needed. Or ladders. All they need is to have a single backslash ‘\’ put in front of them and they’re FREE! Normal again.
Which means that a Regular Expression character of an asterix: *
Becomes normal simply by the application of a backslash: \*
Now the analogy choice of a jail was deliberate. As “escaping” is the phrase used to describe this special to normal conversion. Remember it well, it will be used repeatedly.
Can you give me an example of how someone would do regular expressions and not use a backslash with a period?
Typically I would use a dot or period when I really don't care about the character. The data I’m matching against is pretty clean or even fixed. Personally, I mainly use the dot with filenames.
eg. Here's the name of one of my log files:
- or to decode: accessYYYY-WW-D.server.name.log.gz
- WW == Week number
- D == Day of week
So to look at all logs in a single week (all seven of them), you would use a dot to wildcard the 'D'. And hence get:
to be perfectly precise.
But as the log names are very clean, no variations, I can usually drop all the escaped dot's as they will only ever match an actual dot. Which also makes the Regular Expression easier to read:
While not strictly necessary, I would advise to leave the first escaped dot in, as a means of highlighting the difference between those two dots. Two dots do look like a mistake, and reducing confusion is a Good Thing(tm).
A more complex example is one I use in awffull for matching FTP server log file lines:
- ^(... ... .. ..:..:.. ....) ([[:digit:]]+) ([[:digit:].]+) ([[:digit:]]+) ([^ ]+) ([ab]) ([CUT_]) ([oid]) ([ar]) ([^ ]+) ([^ ]+) () ([^ ]+) ([ci])
Where the start of a line is a date time stamp. An actual log entry looks like this:
- Fri Jun 2 02:22:13 2006 10 10.11.12.15 606221 /home/user/file.tgz b _ o r user ftp 0 * i
So the bit: (... ... .. ..:..:.. ....)
Matches against: Fri Jun 2 02:22:13 2006
Watch for the extra space between "Jun" and "2".
The reason for not caring about the date time stamp? Well date/time calculations are expensive when you do lots of them. In this case it makes more sense to leave the Regular Expression simple, and let a specialised chunk of code deal with the date time stamp.
Credits and Misc
If you're really keen to learn the nitty gritty of Regular Expressions and amaze friends and family with your l33t skills, get yourself a copy of "Mastering Regular Expressions" by Jeff Friedl.
You will not regret it. The first 4 chapters are about all you need to read. The first 6 help that much more.
This article, and hopefully others to come, are principally based on an email conversation I had with Robbin Steif from Lunametrics. Robbin's blog is a great read about the business side of Web Analytics. Lots of little tips and tricks that are the sort of commonsense that aren't that common.