Replacing –, ’, “, etc., with UTF-8 Characters in Ruby on Rails

Recently I upgraded some older Rails applications to Rails 3.1 and Ruby 1.9.2 (from 2.3 and 1.8.7 respectively). One post upgrade issue was that text content had a lot of garbage showing up like –, ’, “, etc. For example, here’s an actual example from a comment in one of the applications:

One of my “things to do before I’m 50” is

This should read:

One of my “things to do before I’m 50” is

It turns out these are just special characters that were improperly encoded for utf-8. The fix is simple enough: loop through your content and replace where needed.

If your database is big, this could take a long time unless you disable callbacks. The script below highlights both how to replace the characters using Ruby and how to disable your Rails callbacks to make this script run in seconds instead of hours (depending on the complexity of your callbacks).

replacements = []
replacements << ['…', '…']           # elipsis
replacements << ['–', '–']           # long hyphen
replacements << ['’', '’']           # curly apostrophe
replacements << ['“', '“']           # curly open quote
replacements << [/â€[[:cntrl:]]/, '”'] # curly close quote
klasses = [Comment, Article]           # replace with relevant classes

klasses.each do |klass|
  klass.all.each do |obj|
    original = obj.body
    replacements.each{ |set| obj.body = obj.body.gsub(set[0], set[1]) }
    unless (original == obj.body)
      #### Remove or Customize ####
      # This should reflect your models' callbacks.  It should be safe
      # since we're just doing a simple find/replace.
      Comment.skip_callback(:save, :after,  :do_after_save_tasks ) if obj.is_a?(Comment)
      Article.skip_callback(:save, :before, :do_before_save_tasks) if obj.is_a?(Article)
      #### End Remove or Customize ####
      obj.save!
    end
  end
end

If you noticed, I used a regular expression for the curly close quote. This is because there is an invisible control character that is not easily copy/pasted into your code. Using [[:cntrl:]] is just an easier way to catch it.

CrashPlan for Large, Distributed, Cheap, Off-Site Backup

In the early 90s, my friend’s father took me to EDS where he worked at the time.  I remember him saying, “this is one of the largest data centers in the world.  They have over 3 terabytes of data in there.”  In the homemade box tucked away quietly in my hall closet is a 6x1TB RAID with another 1TB disk for the OS.  Add in the media center and 3 laptops and I’ve got a lot of data just waiting to be lost with a disk failure, theft, or an accidental rm -rf.

What I Need in a Backup Solution

As I thought about my data, I came up with a few criteria before I started scouring the net for a solution.

  1. No constraints on backup size – The data I want to backup exceeds 2TB and is growing.  I’ve used cool apps like DropBox that have arbitrary upper limits like 100GB.  However, the coolest app though won’t do me any good if I can’t backup everything I need. (To be fair, backup is just one tiny element of what DropBox does.  I highly recommend that app for the other things it does, like sync.)
  2. Highly configurable – That 2TB I mentioned lives amongst tons of other stuff that I keep as sort of a cache, but wouldn’t miss it too much if it got deleted.  I need to be able to clearly specify what data I actually want backed up.  Moreover, I need a high degree of control about backup policies, security, etc.  I like solutions that make things simple, but in this case there also needs to be a way to get as complicated as I like.
  3. Distributed backups – Part of the reason I have that 6x1TB RAID array is for super-fast local backup.  Obviously that won’t do me any good if my house burns down, but if a laptop crashes is way easier to grab 500GB from a local machine than it is to pull it across the net.  I want to be able to backup to a service as well as many other computers that I specify both in my house and on the Internet.
  4. Smart, low profile application – Modern OS’s like Mac OS X keep a log of what files have changed.  I don’t want a dumb service that does things like that on its own and consumes my computers’ resources.  I need something that will run in the background and not make any noise.
  5. Accessibility – I need a service that runs on any platform, specifically Mac OS X and Linux.  Moreover, I need to be able to access my backups from the web.
  6. Cheap – I want to pay for storage, not bandwidth.  Less than $10/mo is my general rule of thumb.

There are other minor points, but those are the non-negotiable items.

Continue reading

5TB LVM Volume with an LSI 9265-8i RAID Controller

This article outlines how to get a 5TB LVM volume created with an LSI 9265-8i RAID controller.

Background

RAID Array

I’ve been running software RAID for a while. Specifically, I’ve got an ASUS P6T Deluxe V2 motherboard with 6 SATA ports. Up until now, I’ve had 1 SATA connected to a single 1 TB drive with the Fedora OS on it, one to a SATA DVD/Blu-ray drive, and the other 4 to a 4x1TB software RAID 5. This has worked great. When I started to max that out, I had a decision to make. It seems I could either:

  • Continue with the small array and just continue to increase the disk size.  This is easiest, but given that 4 disks in RAID 5 give you a 25% loss of storage space (i.e., 3 used, 1 for parity), you have to buy bigger disks and the biggest ones usually cost the most.
  • Make the 1-time investment to get an 8-port RAID card and grow the array with disks that are large, but not necessarily the largest out there.

I decided latter made more sense for me and went with the LSI 9265-8i based on various reviews.  My plan was to build a 6x1TB SATA array (5TB storage) with 2 available ports on which I could add 2 additional drives when/if needed.

Continue reading

Ruby Script to Search Apache Logs for High Frequency Clients

I wrote a quick Ruby script to scour through my Apache access logs and look for IPs that are hitting my site too frequently, e.g., bad bots, etc. The command line arguments are simple:

$ ruby find-frequent-clients.rb \
--apache-access-log=/path/to/your/log \
--seconds=3600 \
--request-limit=7200 \
--log-time-zone=PST

That command is going to find any client IPs that are hitting my web server in the last 10 minutes more twice or more per second. The output will be a line separated list of IP addressess (optionally with a hit count if --show-count=1 is added). Here’s how it works:

File: find-frequent-clients.rb
require 'date'
require 'time'
# Process command line arguments.  Filter only args starting with --
args = {}
$*.each do |arg|
  spl=arg.split("=")
  if spl[0][0..1] == "--"
    args[spl[0][2..spl[0].length-1].gsub("-","_").intern]=spl[1]
  end
end

# Check that we have the bare essentials to proceed
raise "You must specify the full path to an Apache access log file with --apache-access-log" unless args[:apache_access_log]
raise "You must specify the maximum amount of recent seconds to consider with --seconds" unless args[:seconds]
raise "You must specify the maximum requests allowed per #{args[:seconds]} seconds with --request-limit" unless args[:request_limit]
raise "You must specify the time zone of the Apache logs with --log-time-zone e.g., EST" unless args[:log_time_zone]
raise "The Apache access log file specified does not exist or is not readable: #{args[:apache_access_log]}" unless FileTest.readable?(args[:apache_access_log])

# Open the file and read the lines in reverse; exit once time stamps are beyond our time threshold
file = File.open(args[:apache_access_log], "r")
log_array = []
log_snapshot = file.readlines
file.close
start_time = Time.now.to_i
log_snapshot.reverse_each do |line|
  line_array = line.split(" ")
  date_time = line_array[3][1..line_array[3].length-1]
  date_time[11] = " "
  date_time = Time.iso8601(DateTime.parse(date_time+" "+args[:log_time_zone]).to_s).to_i
  if date_time &gt; (start_time - args[:seconds].to_i)
    log_array &lt;&lt; [line_array[0], date_time]
  else
    break
  end
end

# Use a hash to collect the counts of the IPs
log_hash = Hash.new(0)
log_array.each do |log|
  log_hash[log[0]]+=1
end

# collect the offenders in an array
offenders = log_hash.to_a.collect{|h| h if h[1] &gt; args[:request_limit].to_i}.compact

# output the offending IPs, 1 per line; optionally show the offending count
offenders.each{|o| puts o[0].to_s+"#{" => "+o[1].to_s if args[:show_count]}"}

Note: This makes the assumption that your logs are in the format: aa.bb.cc.dd – - [datetime]

3 Reasons to Switch to Git from Subversion

Dozens of articles outline the detailed technical reasons Git is better than Subversion, but if you’re like me, you don’t necessarily care about minor speed differences, the elegance of back-end algorithms, or all of the hardcore features that you may only ever use once.  You want to see clear, major differences in your day-to-day interaction with software before you switch to something new.  After several weeks of trials, Git seems to offer major improvements over Subversion.  These are my reasons for jumping on the Git bandwagon.

Let’s start with a few assumptions for the scenarios we’ll walk through:

  • you’re one of many developers for a project
  • all changes going into production must first be peer-reviewed
  • you all use simple GUI text editors like TextMate or an equivalent
  • you have 4 features that you’re working that are due soon

Let’s get to work.

Continue reading

Multiple Remote Git Branches With Different Local Names

I pounded my head against the wall for a bit when trying to play out this scenario in Git:

  • Remote repository has two branches: master and some-long-complex-name
  • Locally, I have cloned master
  • I have set up my config to refer to the remote master as “origin”
  • I have checked out some-long-complex-name using the following:
$ git checkout --track -b simple-name origin/some-long-complex-name

The key thing to note is that my local branch has a different name than the remote branch, i.e., “simple-name” is my local branch that’s tracking the remote branch “some-long-complex-name”.  I’ve used pseudo-branch-names in this example, but in practice I like my local branch names to be 3 characters or so such that they’re easy to type, and I like the remote branches to have long names such that there is no ambiguity about what they are.

My initial .git/config looked like this after the aforementioned checkout command:

[remote "origin"]
        url = ssh://myserv/srv/git/proj.git
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master
[branch "simple-name"]
        remote = origin
        merge = refs/heads/some-long-complex-name

Now, this isn’t too bad.  All pull related commands work, but push only works for the master branch.  What I wanted was git to push to “some-long-complex-name” whenever I ran “git push” from my local “simple-name” branch.

I changed .git/config to look like this:

[remote "origin"]
    url = ssh://myserv/srv/git/proj.git
    fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
[remote "simple_origin"]
        url = ssh://myserv/srv/git/proj.git
        fetch = +refs/heads/*:refs/remotes/origin/*
        push = simple-name:some-long-complex-name
[branch "simple-name"]
        remote = simple_origin
        merge = refs/heads/some-long-complex-name

Note the additional “remote” section and the “push” reference.  Now, when I’m in my simple-name branch, I can just type “git push” and this branch will push out to the remote branch names “some-long-complex-name”.

Thanks to those who offered help on the freenode IRC #git channel, namely RandalSchwartz.

Installing Gitweb on Fedora Linux and Apache

My next natural step after getting my projects up and running with Git was to install a web interface. Gitweb was my choice because:

  • it’s available via yum with Fedora
  • it provides up-to-date diff information
  • it’s part of the overall Git package, so it’s tightly integrated

Installation was ultimately quite simple, but I found the install docs to be less than helpful for people like me who want immediate functionality and will get to the tweaks and details later.

Step 1: Install Gitweb

sudo yum install gitweb

This will install a few files at /var/www/git.  You shouldn’t need to do anything to them.

Step 2: Create /etc/gitweb.conf

You need a configuration file to tell Gitweb where to look for your project.  You can change this folder to wherever your project will be.

$ echo "\$projectroot = '/srv/git/';" > /etc/gitweb.conf

Step 3: Edit Apache Configuration File

This configuration file assumes you are running your site as a virtual host.

/etc/httpd/conf.d/git.conf

    DocumentRoot /var/www/git
    ServerName git.yourproject.com

          Allow from all
          AllowOverride all
          Order allow,deny
          Options ExecCGI

               SetHandler cgi-script

     DirectoryIndex gitweb.cgi
     SetEnv  GITWEB_CONFIG  /etc/gitweb.conf

Step 3: Tweak Your Repository’s Config File

Gitweb lists two key elements at the start of your project’s page: description and owner.  To have these display something appropriate, edit /srv/git/yourproject.git/description:

My Awesome Project

… and add this to /srv/git/yourproject/.git/config:

[gitweb]
        owner = "Mark McBride"

Step 4: Restart Apache

That’s it.  Just restart Apache and you should find Gitweb running at the domain you’ve specified.

References:

Migrating a Subversion (svn) Project and Server to Git

I’m sold on Git. The branching feature alone was reason enough for me to move from Subversion. However, the decision to move was the easy part. Migrating my projects, while not too painful, wasn’t trivial. I found that my knowledge of svn was actually a disadvantage as it made it easy to assume things about Git that simply weren’t true. This is ultimately how I went about my migration.

First, some assumptions:

  • My projects are small. By small I mean that everyone working on code has shell access to the servers involved. It’s not open sourced, public, or any of that cool stuff. If you need to do something for large scale access, consider GitHub or gitosis.
  • I have a semi-centralized need. None of the people on my projects are co-located. There is a definite need for a place to post code for review that everyone can access.
  • I have a functioning, in-production Subversion repository and the transition must be seamless.
  • My apps are Rails applications deployed using Capistrano.
  • All of my servers/clients are running Fedora 9
  • You understand the basics of Git.

That said, let’s dive in.

Step 1: Install Git

This is the easy step:

[local]$ sudo yum install git git-svn

Step 2: Convert your Central Subversion Repository to a Local Git Repository

This is key step to wrap your mind around conceptually. With svn you would first make a central repository and import something into it to get started. We’re about to do quite the opposite. Remember that in Git every instance is a repository. When we grab the contents of your existing project, we will be building the new repository locally, not on your server (that will come later).

Build a text file list of your existing authors

In order to maintain some continuity with your existing svn logs, we need to peg the svn user names of people who have committed to your svn repository to git user names. This is quite easy to do. Just create a text file with lines that look like this:

markmcb = Mark A. McBride <mark@markmcb.com>
example = Example Person <person@example.com>
... etc.

Save this file. I’ll refer to it later as svn-to-git-authors. (If you have a lot of authors, check out Josh’s script to automate the creation of this file.)

Clone your Subversion database to a Git repository

This next step is so nice. With one command and the help of the file we just created, we’ll create an almost ready to use git repository. The command is simple: git clone “what” “where”. Or something like:

[local]$ cd /path/of/your/liking
[local]$ git svn clone svn+ssh://yourserver.example.com/path/to/your/repository \
             ./myrepos.git --authors-file=svn-to-git-authors

Hit return and relax as the magic happens. Depending on the size of your repository, this could take some time.

Set some basic configuration options

There are hundreds of configuration options for Git, but I’m only going to touch on a few critical ones. Specifically, let’s tell our new Git repository who we are and set the stage for working with a remote, semi-centralized repository.

[user]
name = Mark A. McBride
email = mark@markmcb.com

The user settings are pretty straightforward. Just ensure they match what you had before in the authors file and it’ll be very easy for you to keep track of who has done what. In addition, you may see a section relating to svn. Once you no longer need to pull data from that repository (which, unless your repository is busy, is right now), you can delete this section.

From here you’re ready to get to work. You have a functional repository. However, if you plan to work with anyone other than yourself, you may need to interact with a public repository.

Step 3: Setup a Public Repository

The steps to establish a repository that you can access over ssh are pretty simple. Just ssh to the public server and (you may need to setup permissions to write depending on the folder do the following in):

[publicsrv]$ cd /srv/git
[publicsrv]$ mkdir publicproject.git
[publicsrv]$ cd publicproject.git
[publicsrv]$ git --bare init

That’s all you need to do on the server. The critical thing to note is that bare reference. This tells Git that there is no working copy, i.e., the files you are coding. All this repository will track are the changes and not actually store the files (though anyone can clone this repository and get the files).

Point Your Repository to the Public One

Back on your local machine, you just need to run one command to make your repository aware of this newly created public version:

[local]$ git remote add origin ssh://publicsrv.example.com/srv/git/publicproject.git

Now you have a remote repository named origin from which your local repository can fetch all of its data from. Look in you .git/config file for details. The last step is simply to push the files you have in your local copy to the server.

[local]$ git push origin master

After running this, anyone on your team with an ssh account can clone the repository with:

[local]$ git clone ssh://publicsrv.example.com/srv/git/publicproject.git

If you’re lazy like me, and just want to be able to type git push/pull instead of typing out the public server’s name each time, add the following to your .git/config file:

[branch "master"]
        remote = origin
        merge = refs/heads/master

And with that, you’re done with the repository migration.

Step 4: Final Rails Tweaks

Your Git work is done.  These last items are final notes to make your new repository play nice with your Rails app.

Tell Capistrano about Git

The very last thing you have to do is tell Capistrano to pull your Rails app out of a Git repository during deployment rather than from Subversion.  This is quite simple.  In your deploy.rb file, add this line:

set :scm, :git

Also, be sure to set your repository URL to the new location.

Ignore logs and temporary files

You may need to create some of the directories depending on how your svn repository was set up.  Insert empty .gitignore files in them to ensure Git doesn’t ignore them.

[local]$ mkdir tmp
[local]$ mkdir log
[local]$ mkdir vendor
[local]$ touch tmp/.gitignore log/.gitignore vendor/.gitignore

Add the following to .gitignore in your root folder to ignore standard Rails files that you don’t want in your repository:

.DS_Store
log/*.log
tmp/**/*
config/database.yml
db/*.sqlite3

That’s it.  Your Rails app is now ready to modify and deploy from Git.

References:

Linux Server Load With Top

I always find it interesting when I believe something to be true for years and then one day someone says, “that’s incorrect.” Today was one of those days.

In Unix, Linux, OS X, or most any good operating system, you’ll find a standard command called top. It’s a simple programs that, according to the manual, “provides a dynamic real-time view of a running system.” At the top of the top (heh) output is a header that looks like:

top - 15:43:10 up 14 days, 15:25,  4 users,  load average: 0.27, 0.12, 0.10
Tasks: 106 total,   2 running, 104 sleeping,   0 stopped,   0 zombie

Note the top right load average numbers. I had always thought that those numbers represented the percent load on the processor averaged over the last minute, 5 minutes, and 15 minutes respectively. Well, I was right about the time intervals, but wrong about the information. The load actually reflects system load such that a value of 1 means the processor on average had 1 process waiting to run, i.e., was loaded 100%. So, in the values above, the CPU was only 27% busy over the last minute, 12% in the last five, and 10% in the last 15. Anything over 1 means that there is an overload and a queue of pending processes.

This would explain why I saw values of 40 when OmniNerd was last Slashdotted. I thought the server was doing well with a 40% load. In reality it had a 4000% load and was dropping requests left and right.

Oh well. Live and learn.