[AWFFULL] mangle urls?

Steve McInerney steve at stedee.id.au
Wed Jul 23 19:35:18 EST 2008


on 22/07/08 09:02 Anthony J. Biacco said the following:
> El Dom, 13 de Julio de 2008, 3:50 pm, Anthony J. Biacco escribió:
>> Will I have to do an ignoreurl then on the same expression?
> 
>> I've never tried it but I suppose GroupAndHideUrl could work. Is this 
>> correct, Steve?
> 
> That won't work as it'll still keep the data in the Inc file, just not show
> it on the report.

Yup. Perversely in your case, a performance (speed) improvement was to leave
all Group'ing till report production time. At the cost of gobbling memory.


> I also just noticed I have a bigger problem. I got about 10M ip
> addresses/sites in my inc file for the month so far. I cant "ignoresite *"

*ouch*


> because that'll ignore every line in my log file. How would I go about not
> adding any sites to my Inc file, i.e. not doing any processing on site
> data? I suppose I could use sed/awk to change every ip address in the file
> to a single ip address, therefore giving me just 1 ip in the Inc file (as

No, that'll screw up all sorts of other things to. Visits and obviously Sites
principally - and any of the metrics that in turn use those. The IP or Host
address is a fairly common key in all sorts of places.

What could be do-able is to reduce the 4 part IP address to just 3.
so 10.20.30.1 to 255 becomes 10.20.30.1
Not ideal - what is, but may be a better.. "survival" tactic.

eg.
sed -re s'/^([0-9]+\.[0-9]+\.[0-9]+\.)([0-9]+) /\11 /' access.log


> steve recommended I do for my /pt/t/1* problem (I don't care a whole lot
> about site data)). Any sed/awk experts out there that can help a brother
> out on the replacement command I would do for these 2 fields? Given a log
> line in the effect of.. 199.231.48.128 - - [21/Jul/2008:16:49:41 -0600]
> "GET
> /pt/t/1216680581236?&d=2008&a=Microsoft%20Internet%20Explorer%20Mozilla/4.0
>  
> %20%28compatible%3B%20MSIE%206.0%3B%20Windows%20NT%205.1%3B%20SV1%29&s=7%2C4852%2COther%2C3430&u=http%3A//reviews.cnet.com/car-g
>  
> ps-navigation/garmin-nuvi-660/4852-3430_7-32078943.html%3Ford%3DcreationDate+desc&p=0&q=1.1
> HTTP/1.1" 200 49 "http://reviews.cne 
> t.com/car-gps-navigation/garmin-nuvi-660/4852-3430_7-32078943.html?ord=creationDate+desc"
> "Mozilla/4.0 (compatible; MSIE 6.0; Wi ndows NT 5.1; SV1)" -
> 
> To change it to (and given that I'd have to pipe the log file on stdin to
> awfull instead of using the LogFile config directive): 1.1.1.1 - -
> [21/Jul/2008:16:49:41 -0600] "GET /pt/t/1 HTTP/1.1" 200 49
> "http://reviews.cnet.com/car-gps-navigation/garmin-nuvi-660/4852-3430_7-32078943.html?ord=creationDate+desc"
> "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" -


Something like this?
gawk '{ if ($7 ~ /^\/pt\/t\/1/) { $7="/pt/t/1"; } print $0 }' logfile

should work? (hopefully... worked for me! ;-) )
May want to check for POST vs GET and other funkies, and act differently, but
the above is a starter.


You could possibly combine the two into a single sed or g?awk, but you'd still
end up having two edits, so the savings might be... small.
The sed one first fits easier to my tired brain than gawk would. Reversly gawk
for the 2nd.
Just pipe them together and away you go!


> -----Original Message----- From: javier wilson [mailto:javier at guegue.net] 
> Sent: Monday, July 14, 2008 10:25 AM To: Anthony J. Biacco Subject: RE:
> [AWFFULL] mangle urls?
> 
> El Dom, 13 de Julio de 2008, 3:50 pm, Anthony J. Biacco escribió:
>> Will I have to do an ignoreurl then on the same expression?
> 
> I've never tried it but I suppose GroupAndHideUrl could work. Is this 
> correct, Steve?


Ah I was wondering why I never saw this - apparently wasn't sent to me :-)



Cheers!
- Steve



> 
> javier
> 
>> -----Original Message----- From: javier wilson <javier at guegue.net> Sent:
>> Sunday, July 13, 2008 12:50 PM To: Steve McInerney <steve at stedee.id.au> 
>> Cc: Anthony J. Biacco <abiacco at formatdynamics.com>; awffull at stedee.id.au 
>> <awffull at stedee.id.au> Subject: Re: [AWFFULL] mangle urls?
>> 
>> El Dom, 13 de Julio de 2008, 2:27 am, Steve McInerney escribió:
>>> on 12/07/08 02:47 Anthony J. Biacco said the following:
>>>> Is there a way to get awffull to trim back urls for counting?
>>> Within awffull? No. External pre-filtering via SED or AWK and pipe into
>>>  awffull is likely your best bet.
>>> 
>>> 
>>>> i.e. I have some urls in my logs of the format /pt/t/1xxxxxxxxx I
>>>> want to count these, but I'd like to actually like to just count them
>>>>  as 1 url, namely /pt/t. Kind of like a MangleAgent, but for urls? 
>>>> Reason being, all the /pt/t/1xxxxxxxxx urls fill up my incremental 
>>>> webalizer.current file, so that it's like 500+ megs 10 days through
>>>> the month.
>>> :-(
>>> 
>>> 
>>>> Is this possible?
>> I think you can use GroupURL, like this /pt/t/1*  MyGroup
>> 
>> You'll get hits an volume on 1* as a total, but not vistis :(
>> 
>> javier
>> 
>> 
> 
> 


More information about the AWFFull mailing list