Fixed Length Record Data (Part Two)

Last time I discussed the FLR format and showed how to use the ‘hflr’ gem to read in FLR format and produce Ruby structs.

In this post I’ll demonstrate how to combine ‘hflr’ with a simple importer class to load a database with the FLR data. Then I’ll show you how to get some real-world FLR data to import. Finally, I’ll discuss sspecific issues that come up when importing this data and explore the IPUMS microdata just a bit.

Efficiently Import Data

one could simply write code such as the following to import small numbers of records into a database:

	flr_file.each_record do |record|
  	  customer = Customer.new
  	  customer.name = record.name
  	  customer.street = record.street
	  customer.zip = record.zip
	  customer.save  
	end

However, performance will degrade unacceptably as the size of the import grows to a few thousand records.

What’s needed is a way to avoid: (1) Instantiating all records as ActiveRecord objects, and (2) Issuing an insert statement to the database for each of those records. . The ‘activerecord-import’ gem groups the inserts and, optionally, avoids instantiation into ActiveRecord objects altogether. In essence ‘activerecord-import’ allows you to pass in a large set of data as arrays or hashes or AR objects and do one insert for the lot of them. Doing this improves import time by around fifty times. The speed-up will depend on the database type, hardware, validations (they can be enabled or disabled.) Passing data to ‘import’ as arrays or hashes is substantially faster than passing ActiveRecord objects.

Using activerecord -import, Your code would look like

	  records_to_import = []
	  batch_size = 20_000 # arbitrary but reasonable batch size
	  fields = [:name, :street, :zip]
  
	   flr_file.each_record do |record|
	     records_to_import << record.to_a
	     if records_to_import % batch_size == 0
	       Customer.import(records_to_import,fields)
	       records_to_import = []
	     end
	   end
   
	   # Import the remaining records % batch_size
	   Customer.import(records_to_import,fields)

The ‘import’ method can take an array of arrays, and a ‘fields’ argument giving the order of fields in each inner array. Alternatively ‘import’ may take an array of hashes, removing the necessity for the ‘fields’ argument as the names of fields and their values are linked in the hashes.

You may instead pass in already instantiated Active Record models, but this will be slower than the first two options.

We can package up the buffering logic shown in the code snippets above, along with a way to manage the various options for passing data to ‘import’. In the file ‘importer.rb’ you can find the full source. flrdata example code on github.com.

Example Data

In the Git repository there’s an ‘input_data’ directory with a sample data file ‘usa_00001.dat’. This is a slice of a much larger data file created with the IPUMS-USA data extraction service. The dataset will let us study the housing boom, bust and recovery. The data comes from the American Community Survey, covering the years 2001 to 2013. We have both household records and person records in this dataset. We can create some interesting tables using just the household records, but some tabulations will require the people associated with those households.

The example tabulations that will follow later on were done with the full dataset. The slice of data provided in the example source is only for testing the import code.

To learn more about the IPUMS and to recreate the full 3.3 GB dataset in our example, See usa.ipums.org. You may browse the system and begin requesting data without signing in, but to get your data you must create a free account.

The ‘codebook.txt’ and ‘usa_0001.sps’ files were created by the data extract system and were downloaded with the data file. The ‘codebook.txt’ is a human readable description of the data file, and the .sps file is for the SPSS statistics analysis software to read, to help you to use the data in SP SS. The extract system also produces files for SAS and Stata. For extracts with only one record type, the extraction service offers CSV and native binary formats of all three stats applications.

To figure out what to pass to the FLRFile class initializer, you’d look at the code book or possibly the SPSS file supplied with the dataset:

record type "H".
data list /
  RECTYPE    1-1 (a)
  YEAR       2-5
  DATANUM    6-7
  SERIAL     8-15
  HHWT       16-25 (2)
  HHTYPE     26-26
  STATEICP   27-28
  METAREA    29-31
  METAREAD   32-35
  CITY       36-39
  CITYPOP    40-44
  GQ         45-45
  OWNERSHP   46-46
  OWNERSHPD  47-48
  MORTGAGE   49-49
  MORTGAG2   50-50
  ACREHOUS   51-51
  MORTAMT1   52-56
  MORTAMT2   57-60
  TAXINCL    61-61
  INSINCL    62-62
  PROPINSR   63-66
  OWNCOST    67-71
  RENT       72-75
  RENTGRS    76-79
  CONDOFEE   80-83
  HHINCOME   84-90
  VALUEH     91-97
.

record type "P".
data list /
  RECTYPE    1-1 (a)
  YEAR       2-5
  DATANUM    6-7
  SERIAL     8-15
  PERNUM     16-19
  PERWT      20-29 (2)
  RELATE     30-31
  RELATED    32-35
  SEX        36-36
  AGE        37-39
  MARST      40-40
  RACE       41-41
  RACED      42-44
  HISPAN     45-45
  HISPAND    46-48
  BPL        49-51
  BPLD       52-56
  YRIMMIG    57-60
  SPEAKENG   61-61
  RACESING   62-62
  RACESINGD  63-64
  INCTOT     65-71
  INCINVST   72-77
.

end file type.

This is fairly convenient, and if you regularly needed to import IPUMS data you might write a simple script to parse this sort of code book. Whatever the source of your FLR data you’ll need a description with comparable information.

In our example, we’re loading data into a database, so we need to define database tables with corresponding columns. See the ‘schema.rb’ for the translation of the above SPSS code into an Active Record migration script like one found in a Rails project. Again, you’d probably write a migration generator script if importing extracts was a frequent task.

As you can see from the ‘schema.rb’ migrations we’re making two tables, corresponding to the two record types in the data file ‘usa_00001.dat’: Household and person. We’ll use Rails conventions and call these tables ‘households’ and ‘people’.

Since the data come with keys to link the two tables together, we’ll make sure to index the two columns holding those keys. The ‘households’ table has a column ‘serial’ which will get the value of the SERIAL variable – it’s a unique household serial number. Within one dataset SERIAL is unique; since we’re extracting data across thirteen datasets, we’ll need to use ‘ACSYR (the year) to differentiate records. The ‘people’ table has a ‘serialp’ column that will hold the SERIALP variable; SERIALP has the value of the household serial number for the household the person record belongs to. Again, we’d need the ACSYR (which is on the people table as well) to differentiate between datasets within the ‘people’ table. The tables already have the Rails standard ‘id’ columns as unique keys. A nice feature for the importer would be to add a ‘household_id’ column to the people table and copy the value of ‘id’ from the correct household record, so we wouldn’t need to use composite keys in our joins. (I’ve done this in production code but it’s rather specialized and gets in the way of the example code.)

To read the data we use the ‘hflr’ gem as in the above example code and in the previous post. We need to provide a layout to ‘hflr’. To see the HFLR layout data, look in ‘extract_layout.rb’.

To run the example, just do

ruby import.rb

You may use JRuby instead. The required gems are given at the top of the ‘import.rb’ file.

The import process should take a few seconds on the data included with the example code. Before importing data, the script executes the migrations in the ‘schema.rb’ file, rebuilding the database schema from scratch. The full dataset referenced in the ‘codebook.txt’ file has approximately thirty-one million person records and fourteen million household records. It will take between one and five hours to import depending on your storage hardware, CPU and Ruby version.

Once the import has finished we should do a few simple queries to verify that all records got imported. To make effective use of this data you’d need to read documentation at usa.ipums.org. You can begin exploring the data just with the variable category labels included in the codebook, however. For now, you just need to know that each household record has a weight variable and each person record has a person weight variable. The value of these variables, in the ‘households.HHWT’ and ‘people.PERWT’ columns respectively, indicates the number of people or households the record represents from the larger population, as this dataset is only a sample of the entire United States population. Divide the weight values by 100 (the two least significant digits are fractional amounts) to get a useful whole number version of the weight. For instance, the value 10100 really means “this record represents 101.00 records like it.”

So, to see how many people the U.S. Census estimated were in the country in 2013, you would do:

	sqlite> select sum(PERWT/100) from people where  acsyr =2013;	
	316128839
	sqlite>

And we can check the population growth

	sqlite> select sum(perwt/100), acsyr from people group by acsyr;
	277075792|2001
	280717370|2002
	283051379|2003
	285674993|2004
	288398819|2005
	299398485|2006
	301621159|2007
	304059728|2008
	307006556|2009
	309349689|2010
	311591919|2011
	313914040|2012
	316128839|2013
	sqlite> 

From this result we know that not only were the right number of person records imported but they were records with correct weights, indicating that the data in those records is probably all correct. Of course we’d have to check every column to be absolutely sure. As a final check we can join the people with their household data to get geography and housing information for the state of Arizona (STATEICP=61) for every year in the data. The OWNERSHP variable will distinguish between people living in rented households (2) and owner occupied (1). OWNERSHP=0 is for other living situations like prisons, military bases, dormitories, etc.

	sqlite> select people.acsyr,ownershp,sum(perwt/100) 
	>from households,people 
	>where people.serialp=households.serial and stateicp=61 and people.acsyr=households.acsyr 
	>group by people.acsyr,ownershp;
	
1|3564953
2|1618471
1|3724464
2|1625033
1|3851262
2|1638885
1|3948678
2|1689667
1|4043947
2|1762319
0|109501
1|4185780
2|1871037
0|109370
1|4305747
2|1923638
0|119162
1|4327201
2|2053817
0|119010
1|4317435
2|2159333
0|139384
1|4086287
2|2188066
0|143926
1|4028111
2|2310468
0|149267
1|3991930
2|2412058
0|147624
1|4006706
2|2472294

This all looks quite believable. It’s interesting to notice how the ratio of people living in rentals to those living in owner-occupied households changed over the boom and bust cycle of the housing market. There are a few more tables included in the example source repository that explore the topic further.

More Efficient Analysis

The example pushes Sqlite3 to its limits; if you recreated my extract in full you’d need to wait an hour or two just to import it. Some of the queries require five or more minutes to run, even though the necessary indexes are present. (This shows why analyzing hierarchical data in a relational database can be challenging.)

The first option, clearly, is to use the same import script with a full fledged database server on the back-end. MySql or Postgres will not import much more quickly, but the queries will execute faster in many cases, especially if your server has a lot of memory and it’s tuned correctly. I’ve provided a ‘pivot()’ and ‘pivoted_html_table()’ function because Sqlite3 doesn’t support pivoting result sets. You won’t need those helpers with a full featured database.

If you want to run frequent ad-hoc queries on this data, you might find that even MySql, Postgres or other databases are painfully slow, especially when joining the two tables. Column-store databases will allow for very fast tabulations across large tables when the tabulation involves limited numbers of columns. You could use a script similar to the example to load a column oriented database such as InfiniDB. Check InfiniDB at Maria DB to download it. Or try InfoBright.

There’s a plug-in for Postgres which may not perform quite as well as the above two options but integrates with a standard Postgres install and is super convenient. See Citus Data for a quick introduction and installation instructions.

Alternatives to Databases

Consider looking into the IPython / IRuby projects. IPython provides tools to analyze data locally and has visualization tools. IRuby is built on the same libraries as IPython but lets you script your work in Ruby.

Get the IRuby project source and documentation from IRuby on github.com. You can also install IRuby with

gem install iruby

You should look into the JuPytR project as well (Julia, Python,Ruby). There are now many other languages supported as well as these three. Get it on the IPython page.

When using IRuby or JuPyter you’d find the HFLR gem useful in formatting your hierarchical fixed-length data into sets of CSV files for easy consumption. You may run into size limitations, but not with even the full sized example extract in my example code.

Have fun.