16 Oct 2005

ajazzy web stats

over the couple of months, i've been experimenting with analysing web stats. as much as i like modlogan i can't seem to bring myself to write text parsing things in C. c'mon, that is probably the worst language to do pattern matching. drupal's stats are alright, except that it get overwhelmed by spammers very easily.

i went through a couple of iterations with using a postgresql database, using mod_python frontend and a text file backend and finally to a bunch of python/shell scripts to produce summaries that are then read via javascript to display things in a nice table. the earlier two methods were just way too slow once you get to around half a million lines of logs.

so now after some painful javascript hacking -- i have stats dot liquidx dot net. its a bit rough on the edges right now, but its the closest thing that has come to a usable web stats thing. the backend data is all generated by a combination of python scripts that act as filters (breaking up the apache log lines and filtering things like all search engine referrers or spammers) and categorisers (having an index of search engines, user agents) and an evil combination of sed, awk and grep chained together by bits of bash.

now i know why it is so painful to do web stats well, and why people are actually paying for someone to do the hard work. figuring out what you should throw away, and what you shouldn't is pretty difficult.

i should stop wasting my time on something as mundane as web stats :(

You can reply to me about this on Twitter: