Skip navigation.
Home

So You Want to Learn Regular Expressions? Part 7: Examples: IP Addresses

The full list of Regular Expression Articles I've done:

In this article I'm going to take you a little through the method and madness of creating regular expressions for filtering or identifying IP Addresses and Ranges.

Why IP Addresses?

It will demonstrate and combine the concepts explored in previous articles. Hopefully cast some illumination on the method of solving Regular Expression problems, and highly coincidentally, show how to filter your corporate network from your Google Analytics stats.

An IP Address

An IP Address is, essentially, the "street address" for a computer on the Internet. If you don't have one, you may find your Internet experience.... limited. Being the address, it can be an effective and simple way of filtering certain groups of computers from your analysis. eg. Removing all your corporate traffic from the corporate website.

IP Addresses have a particular form. Much like a normal street address.


number.number.number.number

Where the number can be pretty much any number from 0 to 255.
So these are ok:


1.2.3.4    10.10.10.10   230.245.250.131

But these wouldn't be:


1023.10.1.2    256.1.2.3    230.245.270.131

Pre-Analysis

My choices of the Ok Examples was quite deliberate. They highlight the major differences between the styles of IP Addresses. So lets note the immediate observations:

  • 4 numbers, each separated by a single dot
  • 3 dots in total
  • each number can be from 1 to 3 digits in length
  • each number can be from 0 to 255

The Lead-In

So how can we build a Regular Expression to match a generic IP Address?
The process I use is: lock on what's fixed, and then start to handle the changing portions.

So we know we have 3 dots:


  \. \. \.

Keeping in mind that we have to escape the dots, as a dot is one of our special jailed characters and needs to escape!

Technically we could just about finish here and have a working and correct solution. The obvious solution to put general ".*" wildcards between the dots. Like so:


  .*\..*\..*\..*

This will work. I would strongly urge you to keep reading though. Why? The solution as presented is very loose. It can easily match things you'd probably rather it didn't. Regular Expression writing is a trade off between being overly picky, and not picky enough. It's probable that only when you get it wrong a few times will the line between the two become clearer. Cry

Next up? A number. Now, lets start with the 3rd observation: 1 to 3 digits long.

  • A single digit: [0-9]
  • 1 or 2 or 3 digits?

Can be done several ways:


[0-9]|[0-9][0-9]|[0-9][0-9][0-9]

Which while correct, is obviously going to be a pain to type the complete Regular Expression for, and hence be error prone. Not a Good Idea(tm).

Recall that we can specifiy a minimal and maximum number of items with curly brackets: {min,max}

So that would give us:


[0-9]{1,3}

So if we join this fragment with our dots we get:


[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}

Or to make it a little easier to read, do what I do and break it down over several lines to significantly improve clarity:


[0-9]{1,3}
\.
[0-9]{1,3}
\.
[0-9]{1,3}
\.
[0-9]{1,3}

Just remember to glue it all back together when entering into GA or your favourite tool.

So what's the obvious problem here?

It will match IP Addresses like:


999.999.999.999
002.003.345.123

Which are invalid.

Next question: Does it matter?

This is where you need to make that judgement call. If you know a little bit about web servers and TCP/IP and Networking, you'll rapidly realise that: No. It doesn't matter. Why?

Two Reasons:

  • Because for all intents and purposes it's impossible to actually have such a dud IP address recorded in a log. And...
  • Because it's impossible that you would be trying to filter out a dud IP address that matches your corporate network.

But Wait! There's More!

Those who've taken the lessons to heart will realise that you can simplify the Regular Expression even further:


([0-9]{1,3}\.){3,3}[0-9]{1,3}

I wouldn't.

Why?

You are starting to lose clarity. It's not immediately obvious what this RegEx does and is for. And that can be just as big a danger to Getting It Right.

Anything else we can do?

Sure! We know two more pieces of information that can be used.

  • The left hand side is at the start
  • and the right hand side is at the end

of the IP Address field.

Recall the caret and dollar?
So this gives us:


^
[0-9]{1,3}
\.
[0-9]{1,3}
\.
[0-9]{1,3}
\.
[0-9]{1,3}
$

Will that help? The use of the caret to define the start sure will, as we'll soon see. Not so much the dollar.

One last observation.

You should never see an IP Address specified with a leading zero. eg 1.002.003.004
While "mathematically" correct. Sorta. It's just not how Things Are Done. There may be a formal reason why somewhere.
The IP Address should always be of the form: 1.2.3.4

A Real Example

This is a rewrite of a snippet I supplied to a question posed on the GA Groups.
The problem was to filter a moderately complex range: 192.18.0.0 - 192.19.31.255

Which is not always that unusual.

As ever, break down the problem into bite sized chunks.


192.18.
192.19.0 to 31.

Ok, that 2nd one needs a bit more work to reduce and simplify further.
How about: (For clarity, I'm not escaping the dots, but treat them as though they are.)


192.19.[0-9].
192.19.1[0-9].
192.19.2[0-9].
192.19.30.
192.19.31.

Note that I didn't worry about that last 0-255? Why? No Need. 0-255 covers all possible values, so we can safely ignore it.

So lets see if we can glue it together on a line by line breakdown.
We will need and use the caret herein, as we're ignoring the last field. We therefore need to anchor the Regular Expression to the start of the address field, as otherwise it could match 1.192.18.21
Which would be undesirable.

Lets start with just the 192.18's:


^
192\.
18\.

Easy!
What about the 19's?


^
192\.
19\.
(
[0-9]\.
|
1[0-9]\.
|
2[0-9]\.
|
30\.
|
31\.
)

And combine?


^
192\.
 (
   18\.
    |
   19\.
      (
         [0-9]\.
         |
         1[0-9]\.
         |
         2[0-9]\.
         |
         30\.
         |
         31\.
      )
  )

Hopefully the indenting makes the aligning a little easier. Use every trick you can with formatting and colouring to make it easier to read the Regular Expression. Clarity is the goal! Yes I DO do this sort of break down when I'm writing more complex Regular Expressions. You should too, if it helps!

Hopefully you can see that we have a few repeating elements that could be simplified. Specifically all those escaped dots. Is it worth doing? In this case yes. Why? It eliminates a common repeasting character, by doing so we reduce the odds of including a subtle error. You really want big errors. They're generally pretty obvious to spot. Subtle ones are hard.

So how about this one:


^192\.
  (
     18\.|
     19\.
        (
          [0-9]|1[0-9]|2[0-9]|30|31
         )
         \.
   )

We can't do the \. removal with the 18 and 19, as we need the dot to help define the 19 in greater detail - within the internal brackets. The 5th through 8th lines inclusively.
This isn't the exact solution I gave, but is pretty close, and I'd be perfectly happy to stop at this point.

But lets go a little further and see what other useful tricks we can do. It can be a bit Mt. Everest'ish at Times. Because It's There and I Can. Smile

Keeping in mind that we don't want to lose too much clarity in our efforts to be clever.

Tackling Mt. Everest

Note we have the repeating [0-9]'s?
if you group the first three of these you have:


(nothing)[0-9]|[1[0-9]|2[0-9]

or


[12]?
[0-9]

Where the leading 1 or 2 is optional, via the question mark.

Which now gives us:


19\.(
[12]?[0-9]
|
30|31
)\.

But ask yourself another question. Do we really need to explicitly specify the [0-9] at all? What else could it be? So why not replace it with a dot. Any generic character. It's still correct - But only because we know that the data will be clean. It won't be anything else like an A or Z.

So this then gives us:


19\.(
[12]?.
|
30|31
)\.

Obviously, we can do the same treatment to 30 and 31:


19\.(
[12]?.
|
3[01]
)\.

And thus the entire Reg Ex becomes:


^192\.
  (
     18\.|
     19\.
       (
         [12]?.
         |
         3[01]
       )\.
  )

or more correctly written as:


^192\.(18\.|19\.([12]?.|3[01])\.)

Earlier I mentioned we couldn't pull all the \.'s out? That was a fib. You can pull two more out:


^192\.(18|19\.([12]?.|3[01]))\.

So this is about as reduced a Regular Expression that my skill set will allow. Would I suggest you actually use it in anger? No. It's not all that clear, and just a little too clever by half. Any possible efficiencies it would give to the Reg Ex engine are probably not worth the loss in clarity.

So how would I write it?


^192\.(18\.|19\.(.|1.|2.|30|31)\.)

Why? It's pretty clear what it's trying to do. Even to using the dots as they help break up the visual look of the Regular Expression without interfering with the understanding of what is being achieved.
Using [0-9] clutters the visual look without adding any value. Knowing that we will have clean data.

The ordering of the expressions for the 0->31 is also critical for understanding. We count from 0 up, not 31 down.
So while:


^192\.(18\.|19\.(31|2.|.|30|1.)\.)

is effectively identical. It's not helpful written that way.

It may seem like harping on a trivial point, but clarity of these things is important. You will discover this the hard way. Trust me. Tongue out

The full list of Regular Expression Articles I've done: