Ruby Script to Search Apache Logs for High Frequency Clients

I wrote a quick Ruby script to scour through my Apache access logs and look for IPs that are hitting my site too frequently, e.g., bad bots, etc. The command line arguments are simple:

$ ruby find-frequent-clients.rb \
--apache-access-log=/path/to/your/log \
--seconds=3600 \
--request-limit=7200 \
--log-time-zone=PST

That command is going to find any client IPs that are hitting my web server in the last 10 minutes more twice or more per second. The output will be a line separated list of IP addressess (optionally with a hit count if --show-count=1 is added). Here’s how it works:

File: find-frequent-clients.rb
require 'date'
require 'time'
# Process command line arguments.  Filter only args starting with --
args = {}
$*.each do |arg|
  spl=arg.split("=")
  if spl[0][0..1] == "--"
    args[spl[0][2..spl[0].length-1].gsub("-","_").intern]=spl[1]
  end
end

# Check that we have the bare essentials to proceed
raise "You must specify the full path to an Apache access log file with --apache-access-log" unless args[:apache_access_log]
raise "You must specify the maximum amount of recent seconds to consider with --seconds" unless args[:seconds]
raise "You must specify the maximum requests allowed per #{args[:seconds]} seconds with --request-limit" unless args[:request_limit]
raise "You must specify the time zone of the Apache logs with --log-time-zone e.g., EST" unless args[:log_time_zone]
raise "The Apache access log file specified does not exist or is not readable: #{args[:apache_access_log]}" unless FileTest.readable?(args[:apache_access_log])

# Open the file and read the lines in reverse; exit once time stamps are beyond our time threshold
file = File.open(args[:apache_access_log], "r")
log_array = []
log_snapshot = file.readlines
file.close
start_time = Time.now.to_i
log_snapshot.reverse_each do |line|
  line_array = line.split(" ")
  date_time = line_array[3][1..line_array[3].length-1]
  date_time[11] = " "
  date_time = Time.iso8601(DateTime.parse(date_time+" "+args[:log_time_zone]).to_s).to_i
  if date_time > (start_time - args[:seconds].to_i)
    log_array << [line_array[0], date_time]
  else
    break
  end
end

# Use a hash to collect the counts of the IPs
log_hash = Hash.new(0)
log_array.each do |log|
  log_hash[log[0]]+=1
end

# collect the offenders in an array
offenders = log_hash.to_a.collect{|h| h if h[1] > args[:request_limit].to_i}.compact

# output the offending IPs, 1 per line; optionally show the offending count
offenders.each{|o| puts o[0].to_s+"#{" => "+o[1].to_s if args[:show_count]}"}

Note: This makes the assumption that your logs are in the format: aa.bb.cc.dd – - [datetime]

Installing Gitweb on Fedora Linux and Apache

My next natural step after getting my projects up and running with Git was to install a web interface. Gitweb was my choice because:

  • it’s available via yum with Fedora
  • it provides up-to-date diff information
  • it’s part of the overall Git package, so it’s tightly integrated

Installation was ultimately quite simple, but I found the install docs to be less than helpful for people like me who want immediate functionality and will get to the tweaks and details later.

Step 1: Install Gitweb

sudo yum install gitweb

This will install a few files at /var/www/git.  You shouldn’t need to do anything to them.

Step 2: Create /etc/gitweb.conf

You need a configuration file to tell Gitweb where to look for your project.  You can change this folder to wherever your project will be.

$ echo "\$projectroot = '/srv/git/';" > /etc/gitweb.conf

Step 3: Edit Apache Configuration File

This configuration file assumes you are running your site as a virtual host.

/etc/httpd/conf.d/git.conf

    DocumentRoot /var/www/git
    ServerName git.yourproject.com

          Allow from all
          AllowOverride all
          Order allow,deny
          Options ExecCGI

               SetHandler cgi-script

     DirectoryIndex gitweb.cgi
     SetEnv  GITWEB_CONFIG  /etc/gitweb.conf

Step 3: Tweak Your Repository’s Config File

Gitweb lists two key elements at the start of your project’s page: description and owner.  To have these display something appropriate, edit /srv/git/yourproject.git/description:

My Awesome Project

… and add this to /srv/git/yourproject/.git/config:

[gitweb]
        owner = "Mark McBride"

Step 4: Restart Apache

That’s it.  Just restart Apache and you should find Gitweb running at the domain you’ve specified.

References: