Skip navigation.
Home

So You Want to Learn Regular Expressions? Part 5: Just Like a Box of Chocolates

The full list of Regular Expression Articles I've done:

"Just Like a Box of Chocolates"?? Yeah. Pretty cool analogy isn't it! Just wait! Smile

I hope we're all familiar with the principle of being offered to pick a choccy from a box of Chocolates. Pick one and one only. But any one of the myriad of choices arrayed before you.

Well those clever Regular Expressions supply a tasty Box of Chocolates as well.

The Box of Chocolates

Just like our box, you can only pick one of the various choices in this particular expression. Doesn't sound too useful does it? I mean, I tend to eat an entire tray of Chocolates on my own. Smile

Psst: Don't forget about wildcards, they make an appearance later on.

So how's it done?
The magical square brackets are the plastic and cardboard that forms our box. Anything between the '[' and the ']' are the chocolates. Some of those chocolates can be a bit special. You could even think of the box as being another inner jail - a high security jail inside the jail that makes up our Regular Expression.

An Example?

Why not. Suppose you have a series of pages named somewhat unimaginatively: page1.html, page2.html, page3.html and so on up to page9.html.

Suppose you want to only see these particular pages in a report?

Easy! We'd just use:

page.\.html

BUT! You also have pages named: pageA.html, pageB.html and so on. And they're sufficiently different from the numbered pages that you don't want to include them.
Hmmm. Not so easy any more. We need a way of specify the numbers from 1 to 9 only. Enter the Box of Chocolates!

page[123456789]\.html

Pick one, and only one, of the characters from 1 to 9. Pick one, and only one, of the chocolates from our Box of Chocolates.

Over the Range we go

Typing 1 thru 9 isn't too bad. You can generally work that along a keyboard fairly quickly. But what about A to Z? Painful on a QWERTY keyboard. And you might miss one or two, which could be a disaster.

Fortunately our Box of Chocolates has a trick. A nice easy way to specify a range. Just like I typed A to Z and you all understood, we can use A-Z within the square brackets to get the same meaning. This also works on numbers and the lower case A to Z. Or rather a to z.

So our page example becomes:

page[1-9]\.html

Now suppose we needed a new page. Numbered number Zero. page0.html. How does that fit in?

Same as 1-9. We just start counting at zero: 0-9

page[0-9]\.html

Now about those wildcards?

Ah yes. Like anything else in a Regular Expression, the Box of Chocolates can also have a wildcard applied to it. So you can have an optional series of characters. Zero or more characters and so on.

Suppose we suddenly grew our website. And instead of pages 0 to 9, we now go from 0 to 563,138.

The simplest way of capturing this new set of pages is keeping in mind that:
a. We only have numbers
b. We must have ".html" to finish with
c. We must have "page" to start with
d. We must have at least one digit or number, and currently, up to 6 digits.

Which means we can have a Regular Expression like so:
page[0-9]+\.html

1 or more (the plus) of Zero Thru Nine ([0-9]).

A slightly more complex answer to capture the requirement that we have at most 6 digits would be:
page[0-9]{1,6}\.html

Here we make use of the Counting wildcard I mentioned in the previous article on wildcards.

One trick to be aware of with wildcards and the square brackets: The wildcard applies to the entire box of chocolates as it were. So [abc]+ will match aaa, abc, defcghi and so on. This particular one is more like me with a box of chocolates. Pick 1 or more chocolates until you've had enough. Hmmmm. Yummy! Smile

More Mountains.. Err... Ranges

So we now know about: 0-9, A-Z, a-z for doing ranges. And by extrapolation we can have: [3-7] or [f-n] or join several together: [a-zA-Z0-9]. They're all pretty obvious. There are others. Quite a few others.

If you're really keen, you can grab a copy of the actual POSIX standard from The Open Group: The Single UNIX Specification, Version 3 (SUSv3).

The actual section within SUSv3 that you want is the Base Definitions: Chapter 9, Section 9.3.5.

Yes I do have a copy of the full SUSv3 standard, and yes I do refer to it on a regular basis. No I don't believe you would need a copy of the full thing yourself. Smile

The more common ones you're like to see and use help you choose common groupings.
So where you would use: [0-9A-Za-z] You could instead use: [[:alnum:]] alnum? Alphanumeric!
Instead of [0-9]? Use: [[:digit:]]
Instead of [A-Za-z]? Use: [[:alpha:]] alpha? Alphabetic!

The one I use a lot: [[:space:]] is very useful. Matches tabs and space. Generically called "white space".

Note 1: All of these are double wrapped inside the Box of Chocolates? The actual chocolate being [:alnum:] or [:digit:]. You can't use page[:digit:]\.html as you haven't put the Chocolate in it's box. And who knows where its been. Yukky.

But you can use:
page[[:digit:]]{1,6}\.html

The big advantage with using these special groupings is that they will also include special variants of letters where appropriate. Umlauts, accents and so on. But that isn't so useful to us here and now. But do be aware of it.

Note 2: When using normal ranges or indeed a series of characters, the order within the Box of Chocolates does not matter. [abc] is the same as [cba]. [A-Za-z0-9] is the same as [0-9A-Za-z]. I would advise that you try and be consistent in how you order things though. It'll make your life a lot easier in the long run.

The Secure Jail?

I mentioned that our Box of Chocolates is a more Secure Jail. You can't actually escape from it, but you can be special. There's two special prisoners you really should know about.

One you may have already noticed. The hyphen: -
It signifies a range. A-Z. But if you want to match the hyphen itself you need to put it first like so: [-A-Z]

The other special character you should know about is the caret. ^
Now previously we've seen that caret means "start of the line". Within the Box of Chocolates it has a very different meaning. Just to confuse you all.

It means "not" or "negation" or "opposite".

ie. [abc] would match against any line with an 'a' or a 'b' or a 'c' in it.
[^abc] would match any line that didn't have an 'a' or a 'b' or a 'c' in it.

A common use for ^ within the Box of Chocolates that I frequently use is to match everything that isn't a space character. [^ ]+
This is by no means the only way of doing this, but in many circumstances it can be the best way. YMMV!


In the next article we will focus on a way of choosing between choices larger than a single character.

The full list of Regular Expression Articles I've done:

Syndicate content