Archive for October, 2008

Ruby Script to Search Apache Logs for High Frequency Clients

Wednesday, October 29th, 2008

I wrote a quick Ruby script to scour through my Apache access logs and look for IPs that are hitting my site too frequently, e.g., bad bots, etc. The command line arguments are simple:

$ ruby find-frequent-clients.rb \
--apache-access-log=/path/to/your/log \
--seconds=3600 \
--request-limit=7200 \
--log-time-zone=PST

That command is going to find any client IPs that are hitting my web server in the last 10 minutes more twice or more per second. The output will be a line separated list of IP addressess (optionally with a hit count if --show-count=1 is added). Here’s how it works:

File: find-frequent-clients.rb
  1. require 'date'
  2. require 'time'
  3. # Process command line arguments.  Filter only args starting with –
  4. args = {}
  5. $*.each do |arg|
  6.   spl=arg.split("=")
  7.   if spl[0][0..1] == "–"
  8.     args[spl[0][2..spl[0].length-1].gsub("-","_").intern]=spl[1]
  9.   end
  10. end
  11.  
  12. # Check that we have the bare essentials to proceed
  13. raise "You must specify the full path to an Apache access log file with –apache-access-log" unless args[:apache_access_log]
  14. raise "You must specify the maximum amount of recent seconds to consider with –seconds" unless args[:seconds]
  15. raise "You must specify the maximum requests allowed per #{args[:seconds]} seconds with –request-limit" unless args[:request_limit]
  16. raise "You must specify the time zone of the Apache logs with –log-time-zone e.g., EST" unless args[:log_time_zone]
  17. raise "The Apache access log file specified does not exist or is not readable: #{args[:apache_access_log]}" unless FileTest.readable?(args[:apache_access_log])
  18.  
  19. # Open the file and read the lines in reverse; exit once time stamps are beyond our time threshold
  20. file = File.open(args[:apache_access_log], "r")
  21. log_array = []
  22. log_snapshot = file.readlines
  23. file.close
  24. start_time = Time.now.to_i
  25. log_snapshot.reverse_each do |line|
  26.   line_array = line.split(" ")
  27.   date_time = line_array[3][1..line_array[3].length-1]
  28.   date_time[11] = " "
  29.   date_time = Time.iso8601(DateTime.parse(date_time+" "+args[:log_time_zone]).to_s).to_i
  30.   if date_time > (start_time - args[:seconds].to_i)
  31.     log_array << [line_array[0], date_time]
  32.   else
  33.     break
  34.   end
  35. end
  36.  
  37. # Use a hash to collect the counts of the IPs
  38. log_hash = Hash.new(0)
  39. log_array.each do |log|
  40.   log_hash[log[0]]+=1
  41. end
  42.  
  43. # collect the offenders in an array
  44. offenders = log_hash.to_a.collect{|h| h if h[1] > args[:request_limit].to_i}.compact
  45.  
  46. # output the offending IPs, 1 per line; optionally show the offending count
  47. offenders.each{|o| puts o[0].to_s+"#{" => "+o[1].to_s if args[:show_count]}"}

Note: This makes the assumption that your logs are in the format: aa.bb.cc.dd - - [datetime]

3 Reasons to Switch to Git from Subversion

Saturday, October 18th, 2008

Dozens of articles outline the detailed technical reasons Git is better than Subversion, but if you’re like me, you don’t necessarily care about minor speed differences, the elegance of back-end algorithms, or all of the hardcore features that you may only ever use once.  You want to see clear, major differences in your day-to-day interaction with software before you switch to something new.  After several weeks of trials, Git seems to offer major improvements over Subversion.  These are my reasons for jumping on the Git bandwagon.

Let’s start with a few assumptions for the scenarios we’ll walk through:

  • you’re one of many developers for a project
  • all changes going into production must first be peer-reviewed
  • you all use simple GUI text editors like TextMate or an equivalent
  • you have 4 features that you’re working that are due soon

Let’s get to work.

Endless, Easy, Non-File-System-Based, Local Branches

You’d like to work on each of your 4 features A, B, C, and D independently and somewhat in parallel, though B looks like a quick win.  Let’s compare the branching features offered by both Git and Subversion side-by-side as we get going:

Task Git Subversion
1. Get a copy of the project on your local machine. git clone /srv/repos /local/copy svn checkout /srv/repos /local/copy
2. Create branches A-D to represent the features you’re working on. git checkout -b A
git checkout -b B
git checkout -b C
git checkout -b D
svn copy /srv/repos/trunk /srv/repos/branches/A; svn checkout /srv/repos/branches/A /local/copy/branches/A

svn copy /srv/repos/trunk /srv/path/repos/branches/B; svn checkout /srv/repos/branches/B /local/copy/branches/B

svn copy /srv/repos/trunk /srv/path/repos/branches/C; svn checkout /srv/repos/branches/C /local/copy/branches/C

svn copy /srv/repos/trunk /srv/path/repos/branches/D; svn checkout /srv/repos/branches/D /local/copy/branches/D

3. Feature B is very simple and you want to knock it out and get it into production ASAP. git checkout B
[work in editor]
git commit -a

[peer review]
git format-patch
git send-email [options will vary]
[peer gives you "thumbs up"]

git checkout master
git merge B
git push

[open text editor for branch B]
[work in editor]
svn commit

[peer review]
[send email to peer with branch name]
[peer checks out your branch locally to review]
[peer gives you "thumbs up"]

cd /local/copy/trunk
svn merge /local/copy/branches/B .
svn commit

4. Get rid of unnecessary branch B. git branch -d B svn delete /srv/repos/branches/B
svn update

Note the key advantages Git offered in each step:

  1. Git creates a full repository with this command.  With Subversion, you’re just checking out the files in the repository.
  2. With each branch, no new files are created in the project file hierarchy on your system.  Since you have a full local repository, Git creates the files you need on the fly by processing the recorded changes.  With Subversion, you have to create every branch remotely on the server.  This can get messy depending on the size of your team.  If you decide to control branching to keep things clean, you forfeit the power branching offers.
  3. With Git, we only push our work to the server AFTER collaboration (more below).  With Subversion, it all hits the server.
  4. Again, no file system work.  Since we’re using a local repository, we let Git handle the details of removing the branch.  With Subversion, you still have the old copy until you update.  You either have to clean up manually, or “update” to clean up local and remote copies.

In addition, try to do this scenario on your laptop while not connected to the Internet.  With Git, no issues, the repository is local; however, with Subversion, you’re out of luck.  Your new branches will have to wait. The advantages of Git for branching are clear in this simple branching scenario.  Let’s continue to look at our scenario with non-trivial features A, C, and D that we’re working on.

Stashing Temporary Work

You start working on A and you’re about 100 lines of code into it when you get stumped on a math function.  The math wiz on your team is out for the day and you’d rather not continue until you consult him.  You’ve got some ideas for C, so you decide shift gears and get started.

Task Git Subversion
1. Switch to branch A, write 100 lines of code git checkout A [open text editor for branch A]
2. Switch to branch C  while waiting on a co-worker’s advise for A git stash
git checkout C
[close text editor for branch A]
[open text editor for branch C]
3. Work on C for a while, get advise from co-worker and resume work on A git stash
git checkout A
git stash list
git stash apply [stash name]
[close text editor for branch C]
[open text editor for branch A]

At a glance, you might get the impression that Subversion is simpler, and you’re probably right.  However, this is one case where simple may not be what you’re looking for.  Let’s look at each step:

  1. The key thing to note in this and every step is how we switch between branches.  In Git, the repository handles this. With Subversion, you’re literally just working on a separate set of files.  Ultimately, it’s up to you to manage retrieving and editing these files.  If you’re using TextMate, you’ll probably save a TextMate project file every time you branch simply to give you quick access to the branch.  If you branch a lot, this quickly becomes annoying, time consuming, and non-productive.  With Git, when you checkout a separate branch, it “magically” changes all of the files on your file system for you.  That means 1 project file is all you ever need.  Git handles the rest.
  2. Git will “float” uncommitted changes.  This means that if simply did a “git checkout C,” you’d bring with you all of the uncommitted work you did for A.  However, you don’t want to commit A because it’s not in a good working state.  Instead, you “stash” your work.  Stash is like a work in process commit.  Using it will tuck away your WIP changes without a formal commit, which allows you to change to C without “floating” any of your A changes.  The Subversion method is simpler, but you could potentially end up with several half-baked branches and no record of when you abandoned them.  Git’s stash allows you to list all stashes, and even write a message when you stash.  It is far more powerful in this scenario.
  3. Same as 2, but shows the “git stash list” feature.

So now we can work smoothly between multiple branches without worry of the consequences of interruption.  Git thus far has shown immense strength in two key areas, but let’s revisit collaboration to seal the deal.

Collaboration Before Public Commits

It’s now weeks later and you’re working on D.  After some round table discussions with the team, you all agree that D may not be the best approach.  A co-worker starts working on his own branch E and a few days later wants to review it with you.

Task Git Subversion
1. Review co-workers suggested changes in his branch E [check your email for patch]
[review patch]
svn checkout /srv/repos/branches/E /local/copy/branches/E
[open text editor for branch E]
svn log [to find changes]
svn diff [to view changes]
[review branch]
2. Agree that E is better and destroy your branch D git branch -D D svn delete /srv/repos/branches/D
svn update

Git offers power by putting collaboration up front before commits are public for all to see.  Consider in each step:

  1. Git has a nice feature to create “patches.”  They are simply changes to code, very similar to a diff.  The idea is that you create patches from commits you’ve only made on your local copy of the repository.  When your co-worker sent you the patch for E, no one else on the team had to see his commit logs, branches, etc., in the public repository because they never existed there.  You are collaborating about E via emailed patches.  With Subversion, it’s all on the server, all the time.
  2. With Git, deleting an abandoned branch is simple and clean.  The work done in D will never be seen by the public, i.e., your team.  You’ve spared the team clutter both in the logs and on their file systems.  With Subversion, the clean up is on you.  Should you forget to delete D, it’s has the potential to get used and that could be bad.  Conversely, someone else may have quietly checked out D and been working on something.  When you delete it from the public repository, their commit will surely fail.

Conclusion

There are literally hundreds of features for both Git and Subversion.  While you may have detailed reasons to choose one over the other, I think these 3 high level reasons are strongly convincing in favor of Git.  If you have differing opinions, I’d love to hear them.

Skip All Rails Filters

Tuesday, October 14th, 2008

It took me a while to figure this out, but it’s quite simple.  If you want to skip all of the filters a Rails controller will run, simply put the following at the top of your controller:

  1. skip_filter filter_chain #both documented in the Rails API

For example, if your application controller defines a filter to check if a user is logged in, it makes sense that this filter might run for all controllers, except in rare cases.  In my case, I have a dynamic image controller that doesn’t require all of the overhead that most controllers do.  For that controller, I use the above to skip all of the filters.