Extracting the Source of Apache Web Requests

I had a query come in recently from one of our users that told me our website was running slowly. We have quite a beefy cloud-based VM so I knew that it had enough oomph, so what was the cause of this? To begin my hunt, I headed over to the Apache logs which would help me find the source of any web requests. This you can (usually) find at:

1
/var/log/apache2

At this point, it can all depend on how you set up you sites in Apache, but for our purposes, let’s assume my file is called: access.log

Log File Contents

If we take a look inside, this is the sort of thing you can expect to see:

1
2
127.0.0.1 - - \[10/Oct/2018:15:01:20 -0700\] "GET /file.jpg HTTP/2" 200 1222
1 2 3 4 5 6 7 8 9

To make it easier to discuss, I have numbered the various columns; most are obvious, but let’s take a look at each in turn.

  1. The IP address. This one is the localhost, but it could be anything.
  2. The identity of the client - often unknown so marked as a hyphen in our system.
  3. The user which made the request - again, unknown often for us.
  4. The time and date the request was made.
  5. The resource referenced.
  6. The type of request method - most will be GETs.
  7. The protocol used.
  8. The status code - in this example 200 means OK and,
  9. The size of the request in bytes.

For now, I was just interested in who was making lots of requests because it might have been someone deliberately hitting the site (DOS style) or someone being a bit mean and scraping the contents too anti-socially.

Extracting the IP Address

Let’s pull out the IP address which sits in column one, but I’ll only work with the first 6 entries to keep things small. Oh, needless to say, these are completely fake IP addresses!

1
2
3
4
5
6
7
8
> awk '{ print $1 }' access.log | head -6

123.12.11.133
34.245.11.22
123.12.11.133
233.2.2.33
233.2.2.33
123.12.11.133

Counting Occurences

OK, that’s a good start, but what would be more useful is if I could see how many times each IP address appeared. Let’s start using some in-built UNIX commands to achieve that.

1
2
3
4
> awk '{ print $1 }' access.log | head -6 | sort | uniq -c | sort -n 
1 34.245.11.22
2 233.2.2.33
3 123.12.11.133

Much better. To get the effect I wanted, I have sorted the IPs (sort), removed duplicates but counted the total at the same time (uniq -c) and then numerically ordered the results (sort -n) so that the biggest offenders are at the bottom.

Looking Up Hostnames

That really helps - I now know that the machine at 123.12.11.133 is being completely outrageous and making 3 requests to my website. Goodness gravy! But who is that person or entity? I need to do some digging.

1
2
3
> dig +short -x 123.12.11.133

madeupname.com

Ah - so madeupname.com is the culprit. That’s much easier to talk about and gives me a clue about why they might be hammering my site with not one, not two but a ridiculous three requests.

Combining the Commands in a Script

Now let’s combine the two commands so that my output includes a count of requests, the IP address AND the hostname it refers to. For this, I am going to use a BASH script so type the following into a file called ips.sh and save it.

1
2
3
4
5
6
7
8
9
10
11
12
#! /bin/bash

IFS='
'
command=`awk '{ print $1 }' $1 | sort | uniq -c | sort -n | tail -15`

for line in $command
do
ip=`echo $line | awk '{ print $2 }'`
hostname=`dig +short -x $ip`
echo $line $hostname
done

Don’t forget to set the executable flag or you won’t be able to run it.

1
> chmod +x ips.sh

Looking Under the Hood

To explain how this works, I’m going to start at the top and work my way down for this. On lines 3 and 4, I initialise IFS, which is the internal field separator. By entering the command above, I am telling BASH not to chop up things based on spaces, which is the default, but to treat tokens as whole lines. That will make more sense in a minute. Next we run our command to grab the IP addresses and place them in a variable called command. I’ve made a couple of changes here, too. I’m now reading from the first argument of the script on the command line which means we run it like this:

1
> ./ips.sh access.log

but also, I am no longer just reading the first 6 lines of the file - I am using ALL of it, which can amount to tens of thousands of lines. Once done, though, I only want the top 15 abusers - hence the tail -15 at the end. Lines 7, 8 and 12 handle the looping over the results stored in command. Remember we set the IFS to use whole lines? Well, without that, we would be working on space delimited words on each line, rather than the whole line. Ultimately, I needed to treat each line as one entity. OK, onto the meat of the script. Line 9 uses awk again to yank out the IP address from column 2. Bear in mind, our first awk command produced output like this:

1
100 127.0.0.1

where 100 was the count of IP instances and the second part was the actual IP address; that’s why  we take $2, the second column. Line 10 looks up the hostname using that IP address and dig. Lastly, line 11 puts it all back together and outputs it for your perusal, and there you go - one quick and easy script to work out who is hitting your website the most.

1
2
3
1 34.245.11.22 innocenttraveller.com
2 233.2.2.33 whitehouse.gov
3 123.12.11.133 madeupname.com

Now all I need is another script called: stophittingmywebsite.sh.


Hi! Did you find this useful or interesting? I have an email list coming soon, but in the meantime, if you ready anything you fancy chatting about, I would love to hear from you. You can contact me here or at stephen ‘at’ logicalmoon.com