Ruby Script to Search Apache Logs for High Frequency Clients

I wrote a quick Ruby script to scour through my Apache access logs and look for IPs that are hitting my site too frequently, e.g., bad bots, etc. The command line arguments are simple:

$ ruby find-frequent-clients.rb \
--apache-access-log=/path/to/your/log \
--seconds=3600 \
--request-limit=7200 \
--log-time-zone=PST

That command is going to find any client IPs that are hitting my web server in the last 10 minutes more twice or more per second. The output will be a line separated list of IP addressess (optionally with a hit count if --show-count=1 is added). Here’s how it works:

File: find-frequent-clients.rb
require 'date'
require 'time'
# Process command line arguments.  Filter only args starting with --
args = {}
$*.each do |arg|
  spl=arg.split("=")
  if spl[0][0..1] == "--"
    args[spl[0][2..spl[0].length-1].gsub("-","_").intern]=spl[1]
  end
end

# Check that we have the bare essentials to proceed
raise "You must specify the full path to an Apache access log file with --apache-access-log" unless args[:apache_access_log]
raise "You must specify the maximum amount of recent seconds to consider with --seconds" unless args[:seconds]
raise "You must specify the maximum requests allowed per #{args[:seconds]} seconds with --request-limit" unless args[:request_limit]
raise "You must specify the time zone of the Apache logs with --log-time-zone e.g., EST" unless args[:log_time_zone]
raise "The Apache access log file specified does not exist or is not readable: #{args[:apache_access_log]}" unless FileTest.readable?(args[:apache_access_log])

# Open the file and read the lines in reverse; exit once time stamps are beyond our time threshold
file = File.open(args[:apache_access_log], "r")
log_array = []
log_snapshot = file.readlines
file.close
start_time = Time.now.to_i
log_snapshot.reverse_each do |line|
  line_array = line.split(" ")
  date_time = line_array[3][1..line_array[3].length-1]
  date_time[11] = " "
  date_time = Time.iso8601(DateTime.parse(date_time+" "+args[:log_time_zone]).to_s).to_i
  if date_time > (start_time - args[:seconds].to_i)
    log_array << [line_array[0], date_time]
  else
    break
  end
end

# Use a hash to collect the counts of the IPs
log_hash = Hash.new(0)
log_array.each do |log|
  log_hash[log[0]]+=1
end

# collect the offenders in an array
offenders = log_hash.to_a.collect{|h| h if h[1] > args[:request_limit].to_i}.compact

# output the offending IPs, 1 per line; optionally show the offending count
offenders.each{|o| puts o[0].to_s+"#{" => "+o[1].to_s if args[:show_count]}"}

Note: This makes the assumption that your logs are in the format: aa.bb.cc.dd – - [datetime]

3 Reasons to Switch to Git from Subversion

Dozens of articles outline the detailed technical reasons Git is better than Subversion, but if you’re like me, you don’t necessarily care about minor speed differences, the elegance of back-end algorithms, or all of the hardcore features that you may only ever use once.  You want to see clear, major differences in your day-to-day interaction with software before you switch to something new.  After several weeks of trials, Git seems to offer major improvements over Subversion.  These are my reasons for jumping on the Git bandwagon.

Let’s start with a few assumptions for the scenarios we’ll walk through:

  • you’re one of many developers for a project
  • all changes going into production must first be peer-reviewed
  • you all use simple GUI text editors like TextMate or an equivalent
  • you have 4 features that you’re working that are due soon

Let’s get to work.

Continue reading

Skip All Rails Filters

It took me a while to figure this out, but it’s quite simple.  If you want to skip all of the filters a Rails controller will run, simply put the following at the top of your controller:

skip_filter filter_chain #both documented in the Rails API

For example, if your application controller defines a filter to check if a user is logged in, it makes sense that this filter might run for all controllers, except in rare cases.  In my case, I have a dynamic image controller that doesn’t require all of the overhead that most controllers do.  For that controller, I use the above to skip all of the filters.