Skip navigation.
Home

Web Proxy-Log Usage Forensics and Analysis

This article is based on my direct experiences with having to provide report(s) to the Powers That Be on Inappropriate Usage of the WWW. It is heavily based on a howto/summary I wrote for SAGE-AU tech mail list at the beginning of 2005.
Assumptions? That you are using squid proxy server(s) and that users cannot bypass them to get to the Net directly.
The information provided may be general enough for users of other proxy servers to use the concepts and methods, if not the exact same tools.

Lastly, this is not a pleasant task. It is also a very time consuming one and requires a fair amount of time devoted to writing the final reports. I'd go so far as to say that documentation, and lots of it, is far and away the most critical part of the task.

Inappropriate Usage typically falls into several categories:

  1. Images
  2. Movies
  3. Music
  4. Warez

There was an additional category of text stories that I have discovered thru the course of various investigations. Further analysis found that this tended to be joined with the a & b cases above.
Thus we need to filter down to these cases and look accordingly.

Clues?

  • Large volume users. (Gb vs Mb vs Kb)
  • Large activity users. Quantity.
  • Time of day or week. Activity after hours or on weekends.

Gotchas?

  • Multiple users using a single machine
  • Single user using multiple machines
  • Users bypassing the proxies - via SSL/SSH tunnels and similar
  • Users who have a legitimate need to access this material. It does happen...

Some of these are solvable and clearly seen thru manual reading of the generated reports. Some are not. There are no perfect solutions.

Automated Initial Analysis
First off: squid logs. So we can use standard squid analysis tools. Yay!

I found the calamaris analysis tool to be the most useful tool for generating a result set that was easy to parse and present. A simplistic typical run like so:


zcat log.gz | egrep -v "filter" \
 | calamaris -F 'html' -D 2 -d -1 -R -1 -s -t -1 \
 > results.html
        

I ended up having to do multiple runs with calamaris to produce final reports:

  • Initial everything+sink runs to get a feel of what was normal. And a base reference for double checking later assumptions.
  • To find sites that generate a lot of images that are in all probability ok and remove them. eg .gov.au; ebaystatic.com and many others. This helped reduce the noise and bring the questionable activity to the surface.
    eg: "egrep -v -f exceptions.list" as a pre filter to calamaris.
    It required multiple runs to extract a decent set of exceptions to pre-filter.
  • Set a cut off point. eg below 200 images we assume is ok; This will probably need management buy in.
    This is to help bring the result set down to a managable problem set. Otherwise, I suspect, I would have had to double/triple the time taken.
  • lynx is very useful for examing web sites that are ... not obvious or unsure.
    You can, ~ 98% of the time, categorise a site without seeing any images. Which is far perferable to the alternatives. Especially in an openplan office environment...
    Alternately disable images in a normal browser. Lynx worked nicely for me. YMMV.

Images

One thing I rapidly noticed was that questionable images are almost always jpeg's. By ignoring gif's, png's etc we also reduce the noise level hugely. There is a danger here as well. Someone who uses a different file ending can avoid this. This is a risk/judgement call. If I was investigating a single person I would not do or recommended this pre-optimisation.
Having said this, the initial reports can be used to show all extensions accessed. Which does help to identify "odd" extensions being heavily accessed. So I do feel pretty comfortable with using this pre-optimisation for generic scans. YMMV.

Reports were run to identify large requestors of images.
From here we identified userid's and ipaddress of "significance". From which we then run non-filtered reports against a single address/user.
eg:


for i in `cut -f 1 IPaddresses.COMBINED.txt | sort -u`
do
   j=${i//\./\\\.}
   echo "$i  $j"
   zcat sorted.log.gz \
    | egrep -w "$j" \
    | calamaris -n -F 'html' \
      -D 2 -d -1 -R -1 -s -t -1 -P 60 -N 3 -u \
    > USER.$i.html
done
        

Where IPaddresses.COMBINED.txt is a file holding manually identified IP address's of interest.

Hopefully the "-w" in the egrep will only catch complete IP Address matches for you: eg 192.168.1.1 vs 192.168.1.124. Trying to get a more accurate regexp may not be worth your time.

By seeing all of a given addresses' traffic; and when; gave a more accurate insight in a given users activity. Or multiple users, as the case may be.

Movies & Music

Similar to the jpeg pre-filter; based on early scans showing file types.
The extensions most "abused" were identifed as being of this list:


egrep "\.(wmv|mov|avi|ram|dvd|mp3|mp4|mpg|m3u)"
        

This may not be accurate for others, but worked for us based on earlier reports. This then identified more address of "significance" to be added to the address list.

Warez

Same same. Volume/Size was the key giveaway here.

Summary:

  • Keep good notes of what you did and why. It is possible that this can lead to internal disciplinary action, dismissal, formal prosecution and/or AIRC court cases.
  • Keep your mouth shut. :-) This hasn't happen to me, but I do strongly reccommend it as generic advice: Person(s) "of interest" may be on the other side of the partition from you.
  • Keep management informed of progress. This is a long, slow, tedious and very unpleasant task.
  • Keep tabs on assumptions you make. You may need to revisit them based on new information. eg text stories as mentioned above.
  • Document, Document, Document. Did I mention documenting near everything??
  • Coffee is very useful. Good coffee is better.
  • Reduce and simplify as much as possible before you start to analyse reports in detail. Inital non-filtered reports in this case were in excess of 12Mb long. Be careful of simplifying too much!
  • Keep a non filtered, no assumptions report around for a baseline double check.
  • There are multiple ways of helping identify "odd" activity. Use them. Time vs Volume vs Quantity.
  • Don't worry too much about writing super efficient shell scripts. The time a PC doing a scan takes is trivial compared to the time you will expend in brain power doing the manual analysis. And some of these runs I've done would take > 30 minutes on a 3Ghz HT P4.
  • Having said the previous: try and parallelise. You can be reading result set A; while result set B is being generated. This frequently meant that B would be stopped part way thru and A & B redone with new information.

Hope you the reader find this useful. Comments, criticisms always welcome.

spm@stedee.id.au

Syndicate content