Process a web log file for visitor statistics
Latest is version 1.1 visitors-1.1.tar.gz
visitors processes a web log file trying very hard to identify a single "person" as much as possible. This is typically achieved by use of either an identifying cookie in the log file; Or via the IP Address/Name & Browser ID combination.
Assumes that the logs being sent are already sorted into oldest --> most_recent date/time order.
A Berkeley Database is used to maintain visitor details between runs; as well as reducing the memory footprint within a run. All files are processed in the order specified.
visitors works best (currently only!) when the apache module "mod_usertrack" is enabled and a cookie entry stored in the resulting logfile.
The numbers produced are not perfect. There is an element of error, but hopefully these numbers are more accurate than alternate methods.
There is one major trick to be aware of; and that is in the separation of old vs new visitors. A new visitor will only be converted to an old visitor in a new run of the program. Which means that if you wish to compute new vs old over a month, you must send an entire months worth of logs to visitors in a single run. Doing multiple runs will not achieve the desired results in this instance.
Performance is typically enhanced by increasing available memory. A faster processor will help, but not as much as more memory. With over two years worth of data in a given DB, and run against ~3.5 million lines of logs visitors takes around 70 ish seconds on a 3Ghz Xeon with 4Gb of RAM. A similar run was taking ~11 minutes on a 2.4Ghz P4 that was quite memory challenged (due to other processes mainly).
visitors is released under the General Public License
Changes from v1.0 ==> v1.1
This release focuses on providing additional data analysis, and the cleanup of old and dead data. See the man page and examples below for details.
- add a cleanup/remove old data option
- add various data queries
The latest version is 1.1, released 11th August 2005.
The source code for visitors can be found here: visitors-1.1.tar.gz.
Older versions can be found here.
Feedback on the use of, or new feature requests for visitors are most welcome. A project page has been created at Sourceforge to assist. Feel free to leave bug reports, requests etc!
This package requires two additional libraries:
- Berkeley DB. Built with DB4, it may work with earlier versions.
- Perl-compatible regular expression library. Built against version 4.5.
It should be noted that it is generally preferable to feed visitors with data via a pipe-line. And hence do any data manipulation or filtering prior to visitors itself.
This last being a near copy of the commands used in the original and current production use of visitors. In this case logs are rotated (via cronolog) on a daily basis and named for week and day of week.
The examples given will use the following test log file
This simplistic log file shows two separate access to the same page by two different visitors. Both visitors have different IP Addresses and Browser Identifiers.
A near default run against the above logfile. We assume that "test.db" does not exist before invocation.
And the end results are? 2 Visitors, 2 Visits and 4 page Hits. This basic summary result provides some simplistic additional analysis. eg: How many Pages did each visitor average? How many visits by prior (Old) visitors vs New? And so on.
Using the "simple" option (-s), we reduce down to nearly all the non derivable numbers. Running at a single verbosity level will display the column headers.
With the default level of verbosity, we don't get any column headers with "simple" output. Ideal for long runs over many months of logs.
This example shows the use of the unseen-since query option. For each given day in over 366 days, show how many visitors we have not seen ever again. Finally a grand total of these defunct visitors is shown.
The second example filters the output via gawk to provide monthly summaries. Some may find the code example useful. Or not.
This example demonstrates the use of the access-delta query option. The concept being to display the number of visitors and the time difference between their initial and most recent visits in days.
Here we can see that there were 3456 visitors who have been visiting for 4 days. Keep in mind that those 4 days may have been years ago or last week depending on the individual visitor. Obviously very dependant on how far back your data/logs go.