Skip navigation.
Home

Data Mining Apache Logs - Just How Fast is the End User Experience?

One of the nice things about Apache is it's amazing ability to be configured. The logging sub-system is a lot more configurable than most people realise or make use of.
This article will give a brief overview of extending the log format to include logging of the time taken to serve a request, and then extracting some insights into the overall speed of the website for most users.

The key difference is this is not using external measuring tools and utilities, this is measuring what real users really get!

The information in this article predominantly applies to Apache 2.0.x and higher.

The Log Format

The LogFormat Apache configuration directive allows the systems administrator to easily customise what data from within Apache is logged. Things like Cookies, Keep Alive status, Compression levels (mod_deflate) and what we're interested in: Time Taken to Serve a Request.

The format string to get the request time is %D. Big Note!!! This is in microseconds not seconds, or milliseconds. 10-6 seconds! To account for that, I will typically also include %T which is in seconds, as a quick eyeball friendly number. The two thus being used as %T/%D.

The LogFormat directive I use is:


LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"
   \"%{cookie}n\" %X %s/%>s %I/%O/%B %{pid}P/%{tid}P %T/%D
   %{outstream}n/%{instream}n/%{ratio}n" combined

Up until you get to the \"%{cookie}n\", this is a normal combined formatted log.

Lies and Statistics

While I'm very open to better ways, I've found that expressing the percentages of users who achieve certain response times to be easily grasped by myself and those reported to. Averages and Medians can be highly misleading and don't really help understanding. Percentiles is the way to go.

"95% of all Page Requests were completed in under X seconds" and so on.

It's also a nice way of being able to compare major system changes and the impact on end users. What was it before the change, what's it like after.

Originally I tried using various spreadsheet packages – bit hard to analyse several million rows in a spreadsheet. Rapidly graduated to using a mix of Perl and Spreadsheets with only every Xth sorted value left in to get into the row limitations. Worked, but painful.

And then I re-discovered R.

Arrrrrrrrrrrrrrh me Hearties

No not Arrrrrrrrrrh. Speak Like A Pirate Day is so over. "R". As in: "R is a language and environment for statistical computing and graphics."

R is exactly the tool for doing this sort of analysis.

R is also GPL'd and provides binaries for most common platforms.

Pre-Analysis

The pre-analysis consists of a single pipeline: grab a chunk of logs, process/filter appropriately, strip down to just the time in microseconds field, and finally check for basic errors.

Something like this:


zcat access2006-27-?.*.log.gz | \
  gawk -f image_filter.awk - | \
  cut -f 9- -d \" | cut -f 6 -d ' ' | \
  cut -f 2 -d '/'  | egrep "^[0-9]+$" \
  > 2006-27.data

This gives us a raw dump of all the non-image requests from the 3rd till 9th of July 2006. A full weeks worth to get started. The awk script referenced is pretty simple, can be downloaded here: image_filter.awk

Why no images? Well they tend to be very fast (sub millisecond) on my data set and they do tend to be in a 10:1 ratio so they tend to swamp the more important page data.

Using R

Now for the fun part: producing useful information.

So let's start R:


$ R

R : Copyright 2006, The R Foundation for Statistical Computing
Version 2.3.1 (2006-06-01)
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 

Note the last line! How to quit. I would strongly advise going thru the R documentation if you wish to really explore it's power. It's quite impressive what R can do. And then some. Great documentation too.

So, now to read in our data:


> y1 <- read.table("2006-27.data", header=FALSE)

This assigns the entire dataset we dropped into the file: 2006-27.data into a variable "y1". Think of it as being like an array in this instance.

It may be necessary to clean the data, sometimes is for me - unsure if Bug in R, or my data, but if you get funky error messages, you may find doing this step may help:


> y2 <-as.matrix(y1)

I've had to do this most times, so will assume you have too.
The error messages generated are like:
Error in sort(y) : 'x' must be atomic
or: Error in sort(x, partial = unique(c(lo, hi))) : 'x' must be atomic
and so on.

A picture tells a thousand words, so let's draw a basic graph:


> qqplot(c(seq(0,1, by=0.001)), y2,
       main="Page Responsiveness",
       xlab="Percentiles", ylab="Time in Microseconds",
       type="p",tck=-0.01,col="blue",
       lab=c(20,20,20), sub="2006 Week 27")

Most of these should be fairly self explanatory. The main one to be aware of is the functions:


c(seq(0,1, by=0.001))

This generates our X values to use. From 0 to 1 in steps of 0.001. Change the 0.001 to 0.1 and you'll see a pretty obvious change in the output. Changing the "type" from "p" to "l" will give a line graph which may be preferred.

Anyway, the result should look a bit like this:

Which isn't terribly useful. For me. Your data may be more consistent. This is also why averages can get thrown out. Some very very large numbers can totally render life miserable.

Hmmm. So what's the current breakdown numerically then?


> quantile(y2, seq(0, 1, by=0.05))
          0%           5%          10%          15%          20%          25%
9.900000e+01 1.892500e+04 3.729900e+04 4.163700e+04 6.507020e+04 9.271700e+04
         30%          35%          40%          45%          50%          55%
9.786600e+04 1.011730e+05 1.036100e+05 1.058850e+05 1.084090e+05 1.115770e+05
         60%          65%          70%          75%          80%          85%
1.161480e+05 1.299959e+05 1.608651e+05 1.929688e+05 3.345434e+05 4.287816e+05
         90%          95%         100%
6.966534e+05 1.009357e+06 1.004556e+10

Ok. A little better. This is telling us that 95% of non image requests were completed in 1 second or quicker. Not too bad. So lets try again, but limit to a maximum of 10 seconds, and to make it easier to read, lets convert to actual seconds.

So seconds conversion first:


> y3 <- y2 / 1000000

And now only look at those pages completing in under 10 seconds.


> y4 <- y3 [y3 < 10]

So graph, and display a grid to make it easier to read:


> qqplot(c(seq(0,1, by=0.001)), y4,
      main="Page Responsiveness",
      xlab="Percentiles", ylab="Time in Seconds",
      type="p",tck=-0.01,col="blue",
      lab=c(20,20,20), sub="2006 Week 27")
> grid()

Which is far more useful. We can now tell management all sorts of useful things:

  • ~ 95% of all page requests are 1 second or faster
  • ~ 85% of all page requests are half a second or faster
  • The flat spot from ~22% -> ~55% suggests that about a quarter of all requests are around 0.1 seconds, But about half of all requests are 0.1 seconds or faster. Pity. Can't justify getting new servers yet. Cry

if you want to get clever, you can approximate the above into visitor/people terms.

From here?

Oh the sky is the limit. Change your initial filters to:

  • Strip robots out
  • Instead of not-images (ie Pages), look at Images alone.
  • Examine over your peak period, stripping out times outside that.
  • Evening vs Middle of the Day
  • Group different Pages. May be different code chunks.
  • Home Page vs Internal vs News pages
  • Strip out administration/monitoring traffic as well as robots to get a better idea of normal users
  • Try and find the slow pages, why are they slow?
  • Can you identify broadband vs dialup users? They'll have an impact on your speed here.
  • Do a few trial accesses on a dialup connection. Noting your IP address - examine the logs just for that. Compare with a known broadband user. LAN user and so on.
  • Spend some time with the R manuals so you can draw prettier graphs than what I've done here. Don't forget to tell me how! Smile
  • Strip out 404's, 5xx's and all 3xx's. Look at pure 200 OK's only. How fast a 404 page was returned isn't generally all that useful.
  • Answer the $60,000 question: Which is faster: Firefox or MSIE! Tongue out
  • If you really want to discover the fine detail of the changeover point from horizontal to vertical, playing around with quantile and the sequence numbers used is your best bet.
  • This worked nicely for me:

    
    > quantile(y3, seq(0.95, 1, by=0.001))
             95%        95.1%        95.2%        95.3%        95.4%        95.5%
        1.009357     1.018532     1.027757     1.038272     1.049728     1.062575
    ...
           98.6%        98.7%        98.8%        98.9%          99%        99.1%
        2.116697     2.207339     2.307198     2.447547     2.605449     2.788837
           99.2%        99.3%        99.4%        99.5%        99.6%        99.7%
        3.005831     3.229211     3.665722     4.298540     5.154520     6.231762
           99.8%        99.9%         100%
        9.057905    23.704040 10045.556736
    
    

    Suggesting that for the complete data set, only 0.8% of all page requests are 3 seconds or slower. Which suggests to me wearing my sysadmin hat, that things are pretty good. For this system. Your systems may have very different needs. Over 10,000 users, this may imply that only 80 people were impacted. That may be acceptable or not.

Additional

Generating these graphs as png files, or many other output types is also pretty easy.
The basic steps are to change the "output device", do the various graphing commands (qqplot, grid etc) and then exit that device. Like so:


> png(filename="second_page.png")
> qqplot(....
> grid()
> dev.off()
X11
  2
>

But most of all: Have fun pulling all sorts of interesting insights out of the data. Smile

My total time playing around.... errr working with R now amounts to about 4 hours in total. It really is that easy to get going and generate useful insights.