When importing data into a database you may want to use your Active Record models to help. Perhaps you want to call your validation rules in your model classes or you just want to put the code in your Rails’ project ‘rake’ tasks. You’d like to use Active Record for the job, but the data outside the database has a lot of records. The problem is that creating each record individually may take too much time.

If you have on the order of billions or hundreds of millions of records to import, especially if you need to import the data more than once, you’d be best off using the facilities provided by your database software. For instance MySql let’s you load CSV formatted data directly into tables with ‘load data infile’. If the data is not ready to use you could clean it up ahead of time or use the database to remove or fix invalid records.

If you have a moderate sized set of records to import you should consider using Active Record Extensions (ar-extensions.) Do ‘gem install ar-extensions’ and you will get the latest version.

You can find plenty of information on ar-extensions elsewhere, including here and the project documentation.. Here I will just focus on the import() method the ar-extensions provide to insert and validate lots of records quickly, compared to calling ‘create’ over and over on your model classes.

I have a simple class to wrap the import method and handle details of the import process. The first example shows how to use this class to import a CSV file into a table. The next example will be more interesting; it shows how to import fixed length record files into a database table. The CSV files are not too hard to deal with, and you could use the vendor supplied tools to import them. The fixed length record files are harder to work with unless you craft your own solution. Here is the Importer class:

class Importer
  RowGroupSize = 25_000

  def initialize(model,validate,  fields=nil)
    @validate = validate
    @model = model
    @rows = []
    @fields = fields
    @input_records_type = @fields ? :array : :active_record
    @total_rows_imported = 0
  end

  def << (record)
    @rows <@validate)
    else       # From FasterCSV
      @model.import(@rows.map{|r| @model.new(r.to_hash)},:validate=>@validate)
    end
  end

  def close
    import_row_group(@rows)
  end

end
    The above class is pretty straightforward. The most interesting points are:
  • You pass the model class backed by the table you want to fill to the initializer (not an instance of the model class.)
  • You pass either simple arrays, or pass rows from FasterCSV to the '<<' method. You can get arrays from rows out of the regular CSV class or from records stored as Structs via Struct#values().
  • You can call the initializer without passing the field names in which case the importer will use Active Record to determine the names of columns in the target table. This has the advantage that you don't need to know / pass in the field names to Importer.new() --(and those fields must be in the same order your data sent via '<<' will be in.) The downside is that the importer will create an Active Record instance of every row you send in, thus slowing down the import. I've measured about a 33% slowdown. That is, an import taking 20 seconds using the importer started with the field names will take 35 seconds if you omit them.

The Import class handles caching of data. When it has collected enough rows through ‘<<’ it calls import(). You can not just send very large groups of rows to the database server; the server has limits on the numbers of rows it can handle at once in an insert.

Examples:

Let’s say we have two tables of names we wish to populate. We have migrations for both:


class CreateSSANames 'engine=MyISAM' do |t|
      t.integer :rank,:default=&gt;0
      t.integer :sex,:null=&gt;false
      t.string :name, :limit=&gt;50,:null=&gt;false
      t.integer :occurrences,:default=&gt;0
      t.integer :decade,:null=&gt;false
    end
  end

  def self.down
    drop_table :ssa_names
  end
end

class CreateCensusNames"engine=MyISAM" do |t|
      t.string :surname, :limit=&gt;50
      t.integer :nativity,:null=&gt;false
      t.string :given_name,:limit=&gt;50
    end
  end

  def self.down
    drop_table :census_names
  end
end

(We’ll ignore indices for now.)

And we have models backed by the above tables:


class SSAName&lt;ActiveRecord::Base
  validates_presence_of :name
end

class CensusName &lt; ActiveRecord::Base
  validates_presence_of :surname
end

So, to import a simple CSV file of SSA names into the ‘ssa_names’ table we could write the following rake task:


require 'csv'
require 'ar-extensions'

 def load_from_csv(model, filename)
      rows = CSV::parse(File.open(filename,"r"){|f| f.read})
      fields = rows.shift
      importer = Importer.new(model, false, fields)
      rows.each{|row| importer &lt;&lt; row}
      importer.close
  end

desc "Import SSA names from file "
task :load_ssa_names=&gt;[:environment] do |t|
   filename = ENV['file']
    load_from_csv SSAName, filename
  end

  end

Or using FasterCSV you could rewrite the load_from_csv method as:

require 'fastercsv'
 def load_with_fastercsv(model, filename)
    options = {
    :headers =&gt; true,
     :header_converters =&gt; :symbol,
    :col_sep =&gt; ','}

    importer = Importer.new(model, false)
    FasterCSV.foreach(filename, options) do |row|
      importer &lt;&lt; row
      end
      importer.close
  end

The more interesting data import problem to solve is how to easily import data that’s not in such an easy form as CSV, like fixed length record files. To continue with our names examples, let’s say we have names in a fixed length format file we want to import into the ‘census_names’ table. The records look like this:

HEMPHILL        1LINDA H
WH??TINE        1WILLIE L
? SHUCKLEFORD   1SALLIE
SLONDENONISE    1JOHN
SLONDENONISE    1EGGIE
SLONDENONISE    1HARRY J
SLONDENONISE    1FANNIE B
DURAMUS         1JAMES C
DURAMUS         1BERTHA
DURAMUS         1LOUISE
TAYLOR          1ORSBORN
TAYLOR          1EMMA
TAYLOR          1? ELIZA
TAYLOR          1BESSIE
TAYLOR          1ORSBORN
TAYLOR          1ESTELLE
TAYLOR          1ANDERSON
PATTERSON       1SADIE
PATTERSON       1ETHEL
PATTERSON       1OLLIE
PARKER          1BAKER J

Note that this is a contrived example; it’s very easy to split up into the right fields. But consider the case where the input has hundreds of columns and you can only identify the meaning of half of the columns. That is a quite plausible situation.

In the next section I’ll show how to use the ‘hflr’ gem to parse this file as well as files with a more complex structure. .