Replacing –, ’, “, etc., with UTF-8 Characters in Ruby on Rails

Recently I upgraded some older Rails applications to Rails 3.1 and Ruby 1.9.2 (from 2.3 and 1.8.7 respectively). One post upgrade issue was that text content had a lot of garbage showing up like –, ’, “, etc. For example, here’s an actual example from a comment in one of the applications:

One of my “things to do before I’m 50” is

This should read:

One of my “things to do before I’m 50” is

It turns out these are just special characters that were improperly encoded for utf-8. The fix is simple enough: loop through your content and replace where needed.

If your database is big, this could take a long time unless you disable callbacks. The script below highlights both how to replace the characters using Ruby and how to disable your Rails callbacks to make this script run in seconds instead of hours (depending on the complexity of your callbacks).

replacements = []
replacements << ['…', '…']           # elipsis
replacements << ['–', '–']           # long hyphen
replacements << ['’', '’']           # curly apostrophe
replacements << ['“', '“']           # curly open quote
replacements << [/â€[[:cntrl:]]/, '”'] # curly close quote
klasses = [Comment, Article]           # replace with relevant classes

klasses.each do |klass|
  klass.all.each do |obj|
    original = obj.body
    replacements.each{ |set| obj.body = obj.body.gsub(set[0], set[1]) }
    unless (original == obj.body)
      #### Remove or Customize ####
      # This should reflect your models' callbacks.  It should be safe
      # since we're just doing a simple find/replace.
      Comment.skip_callback(:save, :after,  :do_after_save_tasks ) if obj.is_a?(Comment)
      Article.skip_callback(:save, :before, :do_before_save_tasks) if obj.is_a?(Article)
      #### End Remove or Customize ####
      obj.save!
    end
  end
end

If you noticed, I used a regular expression for the curly close quote. This is because there is an invisible control character that is not easily copy/pasted into your code. Using [[:cntrl:]] is just an easier way to catch it.

Integer Compression in Ruby (Base-10 to Base-62)

A few days ago I was thinking about all those link shortening sites and wondered how easy it would be to compress a base-10 number like 1,234,567,890 to something much smaller like 1LY7VK. Here’s what I came up with:

class IntegerCompressor

  CompressionCharacterSet = %w(0 1 2 3 4 5 6 7 8 9
  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
  a b c d e f g h i j k l m n o p q r s t u v w x y z)

  def self.to_base
    CompressionCharacterSet.length
  end

  def self.compress(number_to_convert)
    digits_needed = Math.log(number_to_convert, IntegerCompressor.to_base).floor + 1
    compressed_number_string = ''
    previous_remainder = number_to_convert
    (digits_needed-1).downto(0) do |power|
      r=previous_remainder.divmod(IntegerCompressor.to_base**power)
      compressed_number_string << CompressionCharacterSet[r[0]]
      previous_remainder = r[1]
    end
    compressed_number_string
  end

  def self.decompress(compressed_number)
    power = 0
    base_10_integer = 0
    compressed_number.to_s.reverse.each_char do |digit|
      base_10_integer += ((IntegerCompressor.to_base**power)*CompressionCharacterSet.index(digit))
      power+=1
    end
    base_10_integer
  end

end

It seems to work:

$ irb
>> require '/path/to/file/integer_compressor.rb'
=> true
>> IntegerCompressor.compress 1234567890
=> "1LY7VK"
>> IntegerCompressor.decompress "1LY7VK"
=> 1234567890

If anyone has a more elegant solution, I'd be curious to see it.

(Thanks to Tamara for the log refresher.)

Ruby on Rails Diff Text to HTML <ins> and <del>

This code is perfect if you have 2 text objects in your Rails application and you want to compare their differences in one of your HTML views. It’s 99% pure Ruby too, so if you alter the first line, you can use it for other purposes.

Only one thing to note: you must have diff installed. I’m using: diff (GNU diffutils) 2.8.1.

#set up some variables to reference later
temporary_directory = File.join(Rails.root, "tmp")
max_lines = 9999999 #needs to be larger than the most lines you'll consider
diff_header_length = 3

# text_old and text_new should be the values of the string objects to compare
# these are just example strings to show it works
text_old      = "line1\ndeleted line2\nline3\n\nline4\nline5"
text_new      = "line1\ninserted line2\nline3\n\nline4\nline5"

# since we're using diff on the file system, we'll save the text we want to compare
# and then run diff against the two files
file_old_name = File.join(temporary_directory,"file_old"+rand(1000000).to_s)
file_new_name = File.join(temporary_directory,"file_new"+rand(1000000).to_s)
file_old      = File.new(file_old_name, "w+")
file_new      = File.new(file_new_name, "w+")
file_old.write(text_old+"\n")
file_new.write(text_new+"\n")
file_old.close
file_new.close

# diff will give provide a string showing insertions and deletions.  We will
# split this string out by newlines if there are difference, and mark it up
# accordingly with html
lines = %x(diff -­-­­­­­­unified=#{max_lines} #{file_old_name} #{file_new_name})
if lines.empty?
  lines = text_new.split(/\n/)
else
  lines = lines.split(/\n/)[diff_header_length..max_lines].
  collect do |i|
    if i.empty?
      ""
    else
      case i[0,1]
      when "+"; then "<ins>"+i[1..i.length-1]+"</ins>"
      when "-"; then "<del>"+i[1..i.length-1]+"</del>"
      else; i[1..i.length-1]
      end
    end
  end
end

#clean up the temporary diff files we created
File.delete(file_new_name)
File.delete(file_old_name)

#return marked up text
lines.join("\n")</pre>
If you fire up RAILS_ROOT/script/console and paste that code in, it will return a nicely marked up string like this:
<pre lang="html">line1
<del>deleted line2</del>
<ins>inserted line2</ins>
line3

line4
line5

Use CSS to make your ins and del tags render however you like.

Ruby Script to Search Apache Logs for High Frequency Clients

I wrote a quick Ruby script to scour through my Apache access logs and look for IPs that are hitting my site too frequently, e.g., bad bots, etc. The command line arguments are simple:

$ ruby find-frequent-clients.rb \
--apache-access-log=/path/to/your/log \
--seconds=3600 \
--request-limit=7200 \
--log-time-zone=PST

That command is going to find any client IPs that are hitting my web server in the last 10 minutes more twice or more per second. The output will be a line separated list of IP addressess (optionally with a hit count if --show-count=1 is added). Here’s how it works:

File: find-frequent-clients.rb
require 'date'
require 'time'
# Process command line arguments.  Filter only args starting with --
args = {}
$*.each do |arg|
  spl=arg.split("=")
  if spl[0][0..1] == "--"
    args[spl[0][2..spl[0].length-1].gsub("-","_").intern]=spl[1]
  end
end

# Check that we have the bare essentials to proceed
raise "You must specify the full path to an Apache access log file with --apache-access-log" unless args[:apache_access_log]
raise "You must specify the maximum amount of recent seconds to consider with --seconds" unless args[:seconds]
raise "You must specify the maximum requests allowed per #{args[:seconds]} seconds with --request-limit" unless args[:request_limit]
raise "You must specify the time zone of the Apache logs with --log-time-zone e.g., EST" unless args[:log_time_zone]
raise "The Apache access log file specified does not exist or is not readable: #{args[:apache_access_log]}" unless FileTest.readable?(args[:apache_access_log])

# Open the file and read the lines in reverse; exit once time stamps are beyond our time threshold
file = File.open(args[:apache_access_log], "r")
log_array = []
log_snapshot = file.readlines
file.close
start_time = Time.now.to_i
log_snapshot.reverse_each do |line|
  line_array = line.split(" ")
  date_time = line_array[3][1..line_array[3].length-1]
  date_time[11] = " "
  date_time = Time.iso8601(DateTime.parse(date_time+" "+args[:log_time_zone]).to_s).to_i
  if date_time &gt; (start_time - args[:seconds].to_i)
    log_array &lt;&lt; [line_array[0], date_time]
  else
    break
  end
end

# Use a hash to collect the counts of the IPs
log_hash = Hash.new(0)
log_array.each do |log|
  log_hash[log[0]]+=1
end

# collect the offenders in an array
offenders = log_hash.to_a.collect{|h| h if h[1] &gt; args[:request_limit].to_i}.compact

# output the offending IPs, 1 per line; optionally show the offending count
offenders.each{|o| puts o[0].to_s+"#{" => "+o[1].to_s if args[:show_count]}"}

Note: This makes the assumption that your logs are in the format: aa.bb.cc.dd – - [datetime]

Skip All Rails Filters

It took me a while to figure this out, but it’s quite simple.  If you want to skip all of the filters a Rails controller will run, simply put the following at the top of your controller:

skip_filter filter_chain #both documented in the Rails API

For example, if your application controller defines a filter to check if a user is logged in, it makes sense that this filter might run for all controllers, except in rare cases.  In my case, I have a dynamic image controller that doesn’t require all of the overhead that most controllers do.  For that controller, I use the above to skip all of the filters.

Ruby On Rails RSS Reader

We moved our Athlo blog to a WordPress app to separate it completely from the main app. One interaction I wanted between the two though was that I wanted the most recent blog entries to show on the Athlo site. I thought that RSS would offer an easy solution so I started looking around to find out if I’d need a Rails plugin or something like that.

The solution was far simpler. And pure Ruby (man I love this language!).

require 'rss'
rss = RSS::Parser.parse(open('http://blog.athlo.com/feed/').read, false).items[0..MaxRSSItems-1]

That’s it. That simple call supplies you with a full array of all the items from the RSS feed. In my specific example, I’ve used a range to limit the results to the value of MaxRSSItems.

No plugins required. No Rails required. Ruby RSS will do what you need to read feeds. (That should be in a poem.)