[AWFFULL] mangle urls?

Steve McInerney steve at stedee.id.au
Sun Jul 13 18:27:24 EST 2008


Hi Tony,

sorry for the delay in responding, I suspect from the time sent, you sent this
shortly after I left head office in London to fly back to Australia. Nice
timing! :-)
34 hours continuous travelling. ~ 20 of which stuck in a 747. The room is
drifting around me as I type, like a plane going through minor turbulence. Blurg.
So please read "Severely Jet Lagged!" into the below responses too! ;-)


on 12/07/08 02:47 Anthony J. Biacco said the following:
> Is there a way to get awffull to trim back urls for counting?

Within awffull? No. External pre-filtering via SED or AWK and pipe into
awffull is likely your best bet.


> i.e. I have some urls in my logs of the format /pt/t/1xxxxxxxxx
> I want to count these, but I'd like to actually like to just count them
> as 1 url, namely /pt/t. Kind of like a MangleAgent, but for urls?
> Reason being, all the /pt/t/1xxxxxxxxx urls fill up my incremental
> webalizer.current file, so that it's like 500+ megs 10 days through the
> month.

:-(


> Is this possible?

You'd need to watch for both the requesting URL and the Referral fields.
A prefilter *should* be more efficient, as hopefully, you'd be doing the
analysis on a multi core/cpu machine and thus will make better use of the CPU
resources of said machine.


> On a side note, anybody know of any other good ways of paring down the
> incremental file, or alternate methods of keeping the tracked data, as
> when awffull runs, I have it eating up to 500+ megs in memory while it
> processes, which can take a bit. I'm processing about 2G of logs a day.

Two Major issues to worry about:
1. the hash key is probably too small for your data set size. MAXHASH in
awffull.c should probably be increased.
2. any sorting will *hurt*. Basically to do a sort, the entire... "table" for
that data is copied and thence sorted. I've toyed around with more memory
efficient sorting, but haven't coded anything firm from that.

To help minimise the pain, a simple perl (python, awk etc) script could simply
ditch any results in the incremental file not in the top 2000 for that "data
table" for example. Not perfect, but a starting point.


I need to become far more proficient with python for my new job, so I'll take
this as a "sooner" vs "later" just to learn python. :-)


Cheers!
- Steve


More information about the AWFFull mailing list