Skip navigation.
Home

Visitors

Process a web log file for visitor statistics

Latest is version 1.1 visitors-1.1.tar.gz

Description

visitors processes a web log file trying very hard to identify a single "person" as much as possible. This is typically achieved by use of either an identifying cookie in the log file; Or via the IP Address/Name & Browser ID combination.

Assumes that the logs being sent are already sorted into oldest --> most_recent date/time order.

A Berkeley Database is used to maintain visitor details between runs; as well as reducing the memory footprint within a run. All files are processed in the order specified.

visitors works best (currently only!) when the apache module "mod_usertrack" is enabled and a cookie entry stored in the resulting logfile.

The numbers produced are not perfect. There is an element of error, but hopefully these numbers are more accurate than alternate methods.

There is one major trick to be aware of; and that is in the separation of old vs new visitors. A new visitor will only be converted to an old visitor in a new run of the program. Which means that if you wish to compute new vs old over a month, you must send an entire months worth of logs to visitors in a single run. Doing multiple runs will not achieve the desired results in this instance.

Performance is typically enhanced by increasing available memory. A faster processor will help, but not as much as more memory. With over two years worth of data in a given DB, and run against ~3.5 million lines of logs visitors takes around 70 ish seconds on a 3Ghz Xeon with 4Gb of RAM. A similar run was taking ~11 minutes on a 2.4Ghz P4 that was quite memory challenged (due to other processes mainly).

visitors is released under the General Public License

Changes from v1.0 ==> v1.1

This release focuses on providing additional data analysis, and the cleanup of old and dead data. See the man page and examples below for details.

  • add a cleanup/remove old data option
  • add various data queries

Source

The latest version is 1.1, released 11th August 2005.

The source code for visitors can be found here: visitors-1.1.tar.gz.

Older versions can be found here.

Feedback

Feedback on the use of, or new feature requests for visitors are most welcome. A project page has been created at Sourceforge to assist. Feel free to leave bug reports, requests etc!

Requirements

This package requires two additional libraries:

Examples

It should be noted that it is generally preferable to feed visitors with data via a pipe-line. And hence do any data manipulation or filtering prior to visitors itself.


zcat /weblogs/logfile.20050314.gz | egrep -v "$FILTER" | visitors

or

for i in `seq -w 1 1 52` ;
    do echo $i ;
    for j in 1 2 3 4 5 6 0 ;
        do zcat "$LOGDIR"/access2004-$i-$j.log.gz | egrep -v "$FILTER" ;
    done | visitors ;
done
        

This last being a near copy of the commands used in the original and current production use of visitors. In this case logs are rotated (via cronolog) on a daily basis and named for week and day of week.

The examples given will use the following test log file


$ cat test.08.2Vor.2Vt.2pgx2.log
H8a - - [12/Apr/2004:02:00:01 +1000] "GET / HTTP/1.0" 200 30614 "-" "B8a" "C8a"
H8b - - [12/Apr/2004:02:00:02 +1000] "GET / HTTP/1.0" 200 30614 "-" "B8b" "C8b"
H8a - - [12/Apr/2004:02:00:03 +1000] "GET /image.gif HTTP/1.0" 200 3614 "-" "B8a" "C8a"
H8b - - [12/Apr/2004:02:00:04 +1000] "GET /image.gif HTTP/1.0" 200 3614 "-" "B8b" "C8b"
H8a - - [12/Apr/2004:02:00:05 +1000] "GET / HTTP/1.0" 200 30614 "-" "B8a" "C8a"
H8b - - [12/Apr/2004:02:00:06 +1000] "GET / HTTP/1.0" 200 30614 "-" "B8b" "C8b"
        

This simplistic log file shows two separate access to the same page by two different visitors. Both visitors have different IP Addresses and Browser Identifiers.


$ visitors -d test.db -f test.08.2Vor.2Vt.2pgx2.log
Visits/ors:    Visitors:        2  New Visitors:        2  Old Visitors:        0
Per Day (  1): Visitors:        2  New Visitors:        2  Old Visitors:        0

Visits:    Total Visits:        2  New Visitors:        2  Old Visitors:        0
 Per Day:  Total Visits:        2  New Visitors:        2  Old Visitors:        0
             Visits Per:           New Visitor:         1  Old Visitor:         0

Pages:      Total Pages:        4  New Visitors:        4  Old Visitors:        0
 Per Day:   Total Pages:        4  New Visitors:        4  Old Visitors:        0
              Pages Per:           New Visitor:         2  Old Visitor:         0

Visitors Identified by Cookies:       2   -> 100.00%
Removed Cookies:                      0
        

A near default run against the above logfile. We assume that "test.db" does not exist before invocation.

And the end results are? 2 Visitors, 2 Visits and 4 page Hits. This basic summary result provides some simplistic additional analysis. eg: How many Pages did each visitor average? How many visits by prior (Old) visitors vs New? And so on.


$ visitors -v -s -d test.db -f test.08.2Vor.2Vt.2pgx2.log
Days    Visitors        NewVisitors     OldVisitors        Visits       NewVisits       OldVisits          Pages        NewPages        OldPages
   1           2               2               0               2               2               0               4               4               0
        

Using the "simple" option (-s), we reduce down to nearly all the non derivable numbers. Running at a single verbosity level will display the column headers.


$ visitors -s -d test.db -f test.08.2Vor.2Vt.2pgx2.log
1           2               2               0               2               2               0               4               4               0
        

With the default level of verbosity, we don't get any column headers with "simple" output. Ideal for long runs over many months of logs.


$ visitors -d /dev/shm/2004.db --query --unseen-since=366
Date         Nbr Unseen Visitors
22-May-2004                 1234
21-May-2004                 5678
20-May-2004                 9012
19-May-2004                 3456
...
================================
Total:                    987654


$ visitors -d /dev/shm/2004.db --query --unseen-since=366 | \
      gawk -F '[- ]+' '/^[0-9]+\-/ {
          if (mnth=="") {
              yr=$3;
              mnth=$2
          }
          st+=$4;
          if ( $2 != mnth ) {
              printf("%s-%s\t%d\n",mnth,yr,st);
              st=0;
              mnth=$2;
              yr=$3
          }
      } END { printf("%s-%s\t%d\n",mnth,yr,st) }'
May-2004        123456
Apr-2004        789012
Mar-2004        345678
Feb-2004        901234
Jan-2004        567890
...

        

This example shows the use of the unseen-since query option. For each given day in over 366 days, show how many visitors we have not seen ever again. Finally a grand total of these defunct visitors is shown.

The second example filters the output via gawk to provide monthly summaries. Some may find the code example useful. Or not.


visitors -d /dev/shm/2004.db --query --access-delta
 Days         Nbr      Avg     Avg
Delta    Visitors   Visits   Pages
    0      543210      1.0     5.0
    1        1234      2.1    15.1
    2        5678      2.2    16.2
    3        9012      2.3    17.3
    4        3456      2.4    18.4
    ...
        

This example demonstrates the use of the access-delta query option. The concept being to display the number of visitors and the time difference between their initial and most recent visits in days.

Here we can see that there were 3456 visitors who have been visiting for 4 days. Keep in mind that those 4 days may have been years ago or last week depending on the individual visitor. Obviously very dependant on how far back your data/logs go.

SourceForge.net Logo
Syndicate content