21 Jul 2005

parsing apache logs in python

recently, i've been intrigued with getting more data from my apache logs than current apache log analysers allow. i'm currently using modlogan, while it is powerful and fast, it is a bit limiting on what i can do. for instance, i want to do queries on the log data instead of looking at a bunch of spam referrers in my logs.

i want to do queries like:

"tell me what user agents most referral spammers use"
"tell me the number of referral spammers who do not get a 403"
"in the last 7 days, how many people were referred from apple.com and rank them by their user agents"

other things i want to be able to do is have a list of top referrers, and being able to click on each one and see the break down of statistics from each referrer. it would make procrastination a lot more interesting.

so i've hacked up some neat modules over the weekend to eventually end up being a collection of components that allow me to make interesting statistics from my logs. right now they're in my subversion repository viewable at here. (NOTE: the code is really sucky!! but it does the job - i'll rewrite it when i figure out what exactly works and doesn't work!)

the system is a bit odd and relies on sqlite. i chose sqlite as a storage medium because it allows construction of sql queries which after being satisfied that is a good way to search true my apache log, i'll write the db backend to use pgsql (which should be relatively easy given that pysqlite conforms to the python DB-API PEP, and i can only assume there's another postgresql python binding that does the same!)

the backend simply looks at each log line, extracts all the vital components using an ugly looking regex (in apache.py), then runs through a series of regex patterns on specific components to "tag" the log line.

for instance, i go through the referrers to look for search engines and tag those lines as "ref_search" and also ones with spam referrers "ref_spam". other interesting and fun stats include looking at user agents and determining what OSes and browsers people are using. those are tagged like "os_series60" (YES! someone wasted their bandwidth by looking at my webpage on a mobile phone! i hope that was a well spent $0.60 for my massively useless weblog :) you can also define a bunch of custom tags that match any thing you find particularly interesting. for mine, i've just written some file types that i see mostly in my logs and also major referrers. i like to be able to construct a query to filter them all out and so i can see if there are any tiny blogs that for some reason or another linking to me :)

after tagging each line, the engine puts all this into an sqlite relational database. two weeks worth of logs equates to around 200K lines. i have around 317 unique tags, this includes browser make and version numbers, eg. cl_msie_5.5 or cl_safari_412. with just 317 tags against 200K lines, i end up having nearly 1.2M of associations (log <-> tag). searching through them using SQL on sqlite is not fun. it's actually pretty slow to get a match on keywords.

that maybe due to the fact i never took a database course, so i decided to go googling on some of the basics of doing efficient joins and narrowing down searching tag like systems. i've come across some sites that talk about tagging and theorising how things like flickr and delicious work efficiently (if at all!) but it seems a lot of them don't have much of an idea of how to make reasonable and scalable databases of tags searched efficiently and with arbitrarily complex logic. for instance, delicious only allows searching with intersections and not unions, possibly for that reason.

maybe i'm an sql dummy, but on my p4 2.4GHz machine, a search for a single keyword takes around 10 seconds. my sqlite database is over 100M big as well, although it is very compressible, mainly due to sqlite storing things in text (i presume.)

anyway more updates as i squeeze small amounts of time to hack on this. statistics is an oddly addictive thing!

You can reply to me about this on Twitter: