So You Want to Learn Regular Expressions? Part 4: More Wildcards
The full list of Regular Expression Articles I've done:
- What and How? - Wherein we look at What a Regular Expression actually is, and How they basically work.
- Wild Cards - A starting look at the use of wildcards. Ways to find lots of stuff.
- Positioning - Finding stuff is good. Being able to specify where stuff may be found is better.
- More Wild Cards - More ways to find lots of stuff.
- Just Like a Box of Chocolates - The Ultimate Analogy. Single choices from between Square Brackets.
- Choices, Choices, Choices.... - Exposing bigger choices: Err... Or.
- Filtering IP Addresses - Wherein I bore the reader to tears explaining how to RegEx for IP Addresses.
In this instalment of this series on Regular Expressions, I'll expose a wee lie from part 2, and show how wildcards can be less wild. More controlled. And hence more useful.
A LIE????
Urm. Yes. Not to put too fine a point on it. A school/teaching progression type of lie. You see a ".*" construct isn't actually a wildcard. The asterix is the wild card. All on its very own. Similarly the plus "+" in ".+".
The trick with these two and other wildcards, is how they apply to a previous Regular Expression.
| Technically a dot character: '.' could be considered a wildcard in it's own right. It will match any single character. It's useful to use a dot when you really don't care about what you're matching, beyond a generic "something". |
A wildcard, in the way we're going to use them, is a multiplier against the previous Regular Expression. A counter. How many "things" do we want.
A Regular Expression can be a single character like an "A" or "m" or something more complicated. So what do you think we would get if we had a Regular Expression of: A*
Well the "A" itself is significant. How does the asterix alter it? Recall: Zero or More. So this Regular Expression means we have "A" or "AAAA" or "AAAAAAAAAAAAAAAAAAA" or "". Nothing. Zero x Anything == Zero.
Pretty obviously "A*" isn't terribly useful in any pattern I can think of. It's useful when you apply these wildcards against the more complex Regular Expressions. Very useful. But I'm starting to jump ahead of myself.
1 2 3 Counting!
There may be some obscure variants I can't think of right now, but pretty much every other wildcard you will come across and use does pretty much one thing and one thing only. They determine how many of the preceding Regular Expression will be used to match something. Like our AAAAAAAAA example.
One wildcard you will see and use is closely related to the asterix and that is the question mark: "?"
Where * means 0 or more. A ? means 0 or 1.
The best way to see this is via an example. Suppose you want to find all JPEG images. Most people simply use file names like image.jpg. But some also use image.jpeg.
As you can quickly see, the "e" in jpeg can be either there, or not there. 1 or 0. 0 or 1.
So a simple regular expression to match all JPEG files is: \.jpe?g
.jp - this is fixed
e - maybe, maybe not
g - again fixed.
Beyond "Many"
The last wildcard you're likely to use is the generic counting one. Currently we can choose Zero or One or Many. But what if we wanted Three or Twelve or Thirty Five "somethings"?
If it's three, you could simply repeat the same match three times. But doing it 35 times could be painful.
Now I picked 35 for a reason. It just so happens to be the number of characters that are used in the identifying part of many of the URL's on a website I look after. 
Like so: 12B3C2F0-9D8E-9A99-7A1181C5F546F938
A generic match using .* or .+ wouldn't be very useful. We'd match too much and hence get an incorrect result. While not perfect, a more accurate method would be to specify: At least 35 characters and a maximum of 35 character. Which is just how it looks:
.{35,35}
The first 35 specifies the minimum. The 2nd the maximum. the use of the squiggly braces specify that this is a counting wildcard.
Again, more useful when you start to combine with more complex Regular Expressions.
|
As a teaser for those who want a challenge, a more accurate Regular Expression would be: [-A-F0-9]{35,35} The most accurate would be: Which will also show how those counting wildcards come in useful. The URL identifier is of the form: 8-4-4-16, so that's what we specify. (8 + 4 + 4 + 16 + 3 x 1 = 35). Note I said accurate and not best. That was deliberate. Best is highly dependant on your goals and even application. I've personally used all three of the examples above when parsing our logs and data files. |
In the next article we'll start to look at the various forms of grouping and choosing between alternate ... choices. 
Till then!
The full list of Regular Expression Articles I've done:
- What and How? - Wherein we look at What a Regular Expression actually is, and How they basically work.
- Wild Cards - A starting look at the use of wildcards. Ways to find lots of stuff.
- Positioning - Finding stuff is good. Being able to specify where stuff may be found is better.
- More Wild Cards - More ways to find lots of stuff.
- Just Like a Box of Chocolates - The Ultimate Analogy. Single choices from between Square Brackets.
- Choices, Choices, Choices.... - Exposing bigger choices: Err... Or.
- Filtering IP Addresses - Wherein I bore the reader to tears explaining how to RegEx for IP Addresses.


