So You Want to Learn Regular Expressions? Part 2: Wild Cards
The full list of Regular Expression Articles I've done:
- What and How? - Wherein we look at What a Regular Expression actually is, and How they basically work.
- Wild Cards - A starting look at the use of wildcards. Ways to find lots of stuff.
- Positioning - Finding stuff is good. Being able to specify where stuff may be found is better.
- More Wild Cards - More ways to find lots of stuff.
- Just Like a Box of Chocolates - The Ultimate Analogy. Single choices from between Square Brackets.
- Choices, Choices, Choices.... - Exposing bigger choices: Err... Or.
- Filtering IP Addresses - Wherein I bore the reader to tears explaining how to RegEx for IP Addresses.
In the previous article (So You Want to Learn Regular Expressions?) I hopefully managed to explain the underlying concept of using regular expressions via the Analogy of a jail.
In this article we'll start to explore the use of wild cards - what they are when to use them, and more importantly, when NOT to.
When NOT to use Wild Cards
Yes. You see, a regular expression will usually have two implied wild cards.
Suppose you're looking for the word jail. And suppose that you're looking for all instances in my previous article.
So what would you search for?
Why not just plain old: jail ???
Well as it turns out, you can! What's more, based on my previous article, you'd even get the right answer. A Regular Expression does not require any funky character combinations to be a Regular Expression.
It works by virtue of the two implied wild cards.
A wild card will match "stuff". Lots of stuff usually, but it doesn't have to. So obviously to match jail we need to ignore anything before or after the 4 characters that make up jail. Our implied wild cards: One before and One after.
So we have: (wild card)jail(wild card)
| You can see the obvious problem can't you. We wanted the WORD jail. A regular expression of just "jail" will also match jailbird or jailbreak and so on. Now as it turns out, Regular Expressions do provide a facility to specify a WORD, but let's ignore it for now. That it exists is all we need to know at this point. |
How To Use Wild Cards
Now as to wild cards themselves? Lets look at the two most common wild cards. ".*" (dot-star) and ".+" (dot-plus).
The dot we've seen before. It means, keeping in mind that it hasn't escaped the jail, Any Single Character. The use of a star or a plus that hasn't escaped jail, is a shorthand way of applying a multiplier to the dot. As in, a method of saying that there can be several "any characters", we just don't know how many.
And this is where the principle difference between the star and the plus arise. The plus is more obvious - it means One or More.
So "jail.+" could be "jail." or "jail.." or "jail................". But it can't be "jail". The plus is a multiplier. But can't multiply by zero, only by one or more. Being able to multiply by zero? Well that's what the star is for.
A star means Zero or More.
So use of ".*" will match more than ".+" which is also why you'll usually see the use of ".*" and not ".+".
Oh yes. More. When you remove the implied wild cards by using one of your own, you effectively remove those implied wild cards.
Suppose you have the un-punctuated text:
|
If you have a regular expression of "jail" you'll get a match. If you have "jail.+" you won't. Why? Because you have STATED that there is at LEAST one character after "jail". But we've reached the end of the text and there is no more. Hence the use of ".+" is unable to successfully match.
Hopefully the distinction is now clear. If not, email me, send feedback or comment at the end of the article and I'll do my best to better explain.
When To Use Wild Cards
When? When you want to ignore stuff.
You know the start of something, you know the end, but you want to ignore a portion from the middle.
Ok silly example time. Suppose we wanted to find all words containing "jail" in my local dictionary. Using a Unix command for doing Regular Expression matches against text; we would use a regular expression of "jail". Thus the command used is:
|
grep is a Unix command for finding things, /usr/share/dict/words is the file to search thru. So jail is our regular expression wrapped in double quotes. A bit of history: grep actually stands for "Global. Regular Expression. Print". Everywhere in this data (Global) search for (Regular Expression) and display the matches (Print). Scroll down to 1968 in "The Most Important Software Innovations" for a quick overview of grep and Regular Expressions and their place in history.
Back to the lesson: Now, suppose we wanted all words in the dictionary that BEGIN with jail AND end in an "s".
So lets build that up:
Begin with jail: jail
And end in an "s": s
But there's "stuff" in the middle we want to ignore. Could be 1 character (jail's) could be no character (jails). So we would use? Yup. dot star: ".*"
Which gives us:
jail.*s
So lets run with that now:
|
As Bob the Builder says: "FanTASTIC!" Pity you can't type in an English accent....
I should perhaps explain. As I type this, our 3 year old son is singing the "Bob The Builder" theme song at the top of his voice. In an English accent. Check this web-sites country TLD. :-)
Gotchas!
Now two things to be wary of with using wild cards. In the above examples searching thru a dictionary for jail, we were searching through a very simple list of 212710 words, one per line. We have a great degreee of pre defined control over what will and won't match. In most real life situations things are not quite so neat. This is where wild card usage and can very incorrect results.
Expanding our earlier sentence:
|
Suppose we wanted to find all words staring with jail and ending in an s again. As to why we would do that? I have no idea. :-)
In this case we could expect that the Regular Expression of "jail.*s" should fail. jail is followed by a space. There is no 's'. Unfortunately the Regular Expression is more literal. it just keeps right on truckin' till it hits the end of the line or finds an 's'. And in this case, there's one inside Brisbane.
So it actually will match against this fragment:
|
Which is obviously not what we wanted.
The other issues to be wary of, is that the use of wild cards forces the computer doing the Regular Expression to work a bit harder. And technically, less efficiently. You may never notice on simple things, but if you want to do tens of thousands of checks a second, don't use wild cards if you can possibly avoid it.
Credits, Misc and Up Next!
Here Endith The Second Lesson. As ever: criticism, questions, comments, feedback, thanks and/or effusive praise are all welcome.
Again, this article was heavily inspired by an ongoing email discussion with Robbin Steif from Lunametrics.
In the next instalment, we'll be looking at use of setting positions. Start and End of line: "^" and "$".
The full list of Regular Expression Articles I've done:
- What and How? - Wherein we look at What a Regular Expression actually is, and How they basically work.
- Wild Cards - A starting look at the use of wildcards. Ways to find lots of stuff.
- Positioning - Finding stuff is good. Being able to specify where stuff may be found is better.
- More Wild Cards - More ways to find lots of stuff.
- Just Like a Box of Chocolates - The Ultimate Analogy. Single choices from between Square Brackets.
- Choices, Choices, Choices.... - Exposing bigger choices: Err... Or.
- Filtering IP Addresses - Wherein I bore the reader to tears explaining how to RegEx for IP Addresses.


