So I came into work the other day and the first thing one of our web admins says to me is, “Were we Slashdotted yesterday?” I had just been reviewing our web activity and didn't think that was the case. However, I did a quick check on our Google Analytics account and, as expected, nothing was out of the ordinary.
The reason he asked the question was that our Apache log file that day was over 10 times the size of the file from the previous day. It sure looked like the server was getting hammered.
So, I decided to take a look and see what the problem was. I pulled down the Apache log and imported it into Strata. See the video below for a step-by-step look:
(And here's an embeddable YouTube version…)
Now, as an aside, if you've ever tried to look at a raw Apache log in Excel or notepad, you'll see that it is space-delimited and the date/time format is not trivial to deal with. Not only that, but the sheer size of a log file makes them almost impossible to handle in a spreadsheet. The one I was dealing with was over 100,000 records long — and that was just one day.
Strata can easily handle the data size, but the format is enough to give any software fits. So, we wrote a quick Apache log parser extension that makes it really simple to just point the software to your Apache log and import it. The resulting table is nicely formatted and everything is ready to go (including those pesky date fields). You can get the extension here.
So, back to the issue at hand… after I imported it, I played around with the data to identify what was causing the problem. I grouped the IP addresses together to see if I could pinpoint a few culprits. And, indeed, I found two:
- An unknown bot
- Our own server
After a little more research, I found out that the bot was searching for all kinds of non-existent URLs and was basically appending one path to another to get some really bizarre URLs:
I then took a look at the records from our own server and saw that for each of these non-existent URLs, we were serving up a “Not Found” page, thus doubling the trouble this bot was causing.
In the end, I had our web admin look into the problem. It turns out we were poorly formatting some of the URL paths on the site. Most bots can handle both absolute and relative paths, but some can't. These bots that can't handle the relative paths end up going a little nuts as they spider the website. (I couldn't find a really nice, clean explanation of this issue via Google, but this thread is close enough for those who are interested.)
Anyway, it was nice to be able to just pull out Kirix Strata and, within a few minutes, figure out what the issue was. For those of you who are interested in your web logs, give the Apache Web Log import extension a spin and let us know what you think.