DCP – Data Conversion Program

(Part 1 of 3 in a series)

DCP creates the MPC version of microdata distributed publically for the following data projects:

DCP Data

  • IPUMS-USA: U.S. historical census data samples from 1850 to the present, 102 datasets.
  • IPUMS-International: Census samples from 89 countries, 243 datasets at last count
  • CPS Current Population Survey (U.S. BLS monthly economic survey): 1962 to present, 306 datasets complete, many more to come.
  • NAPP North Atlantic Population Project: Full historical censuses. 27 datasets so far.
  • IHIS U.S. Health Interview Survey: from1963 to present, huge number of questions. 50 datasets so far.
  • DHS international health survey: Lots of questions, many record types. 39 datasets with many more to come.
  • ATUS American Time Use Survey: 2003 to present 10 datasets, integrates with CPS, has many record types.
  • AHTUS American Historical Time Use Survey: From 1960’s to 2001. Only started at MPC.
  • MTUS: International Time Use Survey: International time use data, (coming soon.)
  • Terrapopulus: IPUMS-International datasets are sent to Terrapopulus
  • U.S. Census Internal: 1960 full count census inside Restricted Data Center
  • SESTAT: Survey of people in science and engineering (coming soon)

What’s Special About MPC Microdata?

Until the IPUMS (Integrated Public Use Microdata Series) project social science researchers needed to hire programmers or write code themselves to use more than one machine readable U.S. census sample at once. Research that spanned even two decades, or longer time series studies were therefore expensive. Also, even when the work was done it wasn’t necessarily reproducible by other researchers since each would develop their own methods. IPUMS dramatically lowered the cost to do time series research on the U.S. population and made it easier in general to use census microdata in research.

IPUMS did two things: (1) Produced comparable “harmonized” U.S. public use census data, and (2) Produce new machine readable samples of decades not available prior to the IPUMS.

Following the success of IPUMS the Minnesota Population Center (first called the Historical Census Project,) applied similar methods to international census data that was already machine readable. At nearly the same time MPC took on the harmonization of CPS and NAPP data. A few years later the other data projects were added.

The data conversion software program (DCP) harmonizes all the microdata in all the projects listed above. The Harmonized data has these advantages:

  • Coding schemes for common variables are consistent in all datasets within a ‘Project’ (such as CPS.)
  • Many new common variables introduced to the datasets for added comparability across projects.
  • New “constructed” variables added to the data such as relationship pointers
  • Missing values due to historical data entry or transcription errors are given imputed values to improve data quality
  • Data is checked for consistency problems and corrected when possible
  • One set of documentation for all datasets in a project.
  • Weight variables to adjust results to match the whole population

What DCP Does

**DCP takes two inputs: **

  1. Standardized format input data file.
  2. Metadata describing the output format and coding scheme for the project and a description of the input format and coding scheme for the input data.

Input Data

The input data file must consist of blocks of data. Typically these blocks are households of people consisting of a household record followed by records for all the people in the household. Blocks could instead be surveyed people followed by activities each person did that day. The main point is that the data is either flat or strictly hierarchical. Each record must have a record type identifier.

Currently DCP expects the input data to be fixed width records as this was historically the standard format for census data, though CSV for non-hierarchical data would be trivial to use instead. Typically the input data is sorted geographically but there’s no requirement for this, it simply improves data quality. The record type identifier is conventionally the first character on each line in the file but this isn’t part of the standard.

Here’s a sample from unit tests. Imagine this kind of data but with lines hundreds or thousands of characters long and millions of lines long.

H0020
P0351JONES       99
A110801
H0020
P0351JONES       99
P0351JONES       99
P0351JONES       99
A110801
H0020
P0351JONES       99
A110801
P0351JONES       99
P0351JONES       99
H0020
H0020
P0351JONES       99

Metadata

The metadata maps the input coding scheme to the harmonized output coding scheme for all variables where this is a simple mapping. Metadata contains all categorical values of variables in a project’s input and output even in cases where the conversion from input to output codes must be done in a complex way not expressable by a table.

Metadata also contains the information necessary to extract data from the input data (column start and width, since it’s a fixed width file.) Metadata also has the corresponding column information for the harmonized data which is used to produce the harmonized output. The default file format is hierarchical fixed width ASCII 8-bit ISO-8859-1. The data conversion program can also output data as CSV, one file per record type.

Metadata is too bulky to reproduce here. Conceptually there’s a many to many relationship between census variables and datasets they belong to. That joining is stored in the “translation tables”. The list of variables is in the variables list, “variables.xml” in the format DCP consumes, which originates from a spreadsheet.

The translation tables are one per variable with information about how to recode the variable for each dataset, which is where the joining of variables and datasets is derived.

There’s a list of datasets, called “samples.xml” and a samples spreadsheet.

This information is stored in a relational database for use by web applications.

DCP produces these outputs:

  • Harmonized data, one dataset produced per process (main product.)
  • Files with frequency tabulations for selected categorical and some continuous variables
  • Files of customized cross tabulations of the data
  • Optionally CSV files and DDL for loading the CSV data into a database
  • Log file listing non-fatal warnings and detailed error messages, as well as running time and performance statistics.
  • Layout of the original input data (for convenience.)
  • Layout of the output file (adheres to a known standard, but convenient here.)

How to Use

We have installed DCP in the standard public area: /pkg/ipums/programming. The file ‘dcp’ located there is a Ruby script that invokes the DCP binary and then calls a Ruby utility to produce SPSS, SAS and Stata files for reading in the DCP harmonized output, in order for our researchers to immediately check the data as they develop new datasets.

So, after logging into any compute node (frack, frell, shiny, angus, etc.) assuming you have /pkg/ipums/programming in your path you do:

shiny%>dcp datasetName projectName

for example:

shiny%> dcp cr1973a ipumsi

DCP looks for the ‘dcp.conf’ file in /pkg/ipums/programming. That file contains mappings from each supported project to its corresponding metadata in the project work area. Each data project has a shared work area, like so:

/pkg/ipums/usa
/pkg/ipums/ipumsi
/pkg/ipums/cps
....

The latest working version of metadata is in /pkg/ipums//metadata

You can substitute a different copy of metadata simply by changing the path in the ‘dcp.conf’ file or by adding a new configuration to the file.

The input data for DCP is also in the project work areas. Its location is given in the /pkg/ipums//metadata/control_files/constants_alt.xml file as are all other configurable paths. If you move or copy metadata to a new location you must change those paths accordingly. NDCP reads those paths but not all other utilities do; the locations of metadata can be expected to follow conventions and we ought to phase out the path configuration.

DCP output goes into /pkg/ipums//output_data/current. Under 'current' we have the datasets themselves (the *.dat files,) and a subdirectory 'frequencies' full of directories, one for each dataset in the project. There is also a 'log' directory with one log file per dataset for the last run of DCP on each of those datasets.

For a quick check of data you can do a diff on all frequency files under ‘current/frequencies//' against a previously saved collection to see what has changed.

Do ‘dcp’ with no arguments for brief help and a full list of options.

For datasets requiring missing data imputation you need to supply a third argument, the data allocation script. These live in ‘/pkg/ipums//metadata/scripts/'.

shiny%> dcp us1910a usa /pkg/ipums/usa/metadata/scripts/us1910a_usa.das

The *.das files contain detailed allocation table definitions that are interpreted at startup. They determine the content of the saved donor files as well.