Design Considerations

  • Must build and run in the future
  • Run in the RDC with no internet and few tools
  • Build using open source tools
  • Fast execution
  • Catch input problems quickly
  • Catch data editing rule errors ASAP
  • Apply editing and allocation methods that can be understood and documented clearly
  • Handle very large input data
  • Handle very small input data with minimum overhead

Design Overview

At a high level DCP has two parts: The core application and the data manipulation (editing) rules. The core application applies the data manipulation rules to input data to produce harmonized data as output.

As described in the previous section ,DCP performs a number of distinct manipulations on census or census-like microdata. These operations aren’t independent. They must be done in order. All of the following steps can be thought of as part of the data manipulation rules applied to the data.

| extract raw data| ->
> |recode and clean source data| -> 
|recode  source data to harmonized scheme|->
|Check basic consistency| ->

|Edit complex variables| <-> |allocate missing cases| <->
|construct synthetic variables|

Simple metadata (translation tables) control the recoding steps, while editing rules control the editing and constructed variable creation. Often a change to a translation table will have cascading effects (naturally, or there would be no point making the change,) so all the above steps must be performed when developing new datasets. Separating the steps into individual applications has bad performance implications. So it’s most practical to keep all these functions fairly tightly integrated in one application, while providing toggles for the post-recoding steps. A clear data change log / audit trail would be very useful as well.

The editing rules, while technically part of DCP, are not really part of the application proper but rather domain specific rules that could be expressed in any general purpose programming language. While 50% to 80% of the rules are simple enough to put into a rules engine, the remaining ones are too complex. Currently they are written in a sort of internal DSL using C.

Replacing the C rules with a scripting language such as Lua would have some good and some bad outcomes. The good outcome would be achieving better application and business logic separation; the application could be replaced and as long as it interpreted the editing rules that valuable work would be preserved. Building the application would be separated from adding new rules, though that step isn’t especially burdonsome. The bad outcome would be that compiling the rules turns out to be a great QA step in rule development that we get for free by compiling the rules. We want to fail immediately instead of at the last possible moment when problems can get ignored or missed. Also coding mistakes in interpreted languages are discovered at runtime barring a simple syntax error, and they are less localized to the part of code where the real problem originates, which can be confusing for inexperienced programmers.
I am mixing the concept of interpreted languages with dynamically typed languages but the two normally go together. We’d have few difficulties with an interpreted language such as Blaise (Pascal varient used by Statistics Netherlands…. but we are getting into the realm of exotic scripting languages at this point.)

For a more in depth discussion see the Conversion Steps document in the DCP docs directory in the DCP source code.

Parts of the Application

| Setup code |

| Converter (main app) |

| Data Model |
	| Data Block Reader and Writer |
	
| Metadata model |
	| Microdata Variable Info |
	| Sample Info |
	| Project-wide Info |
	
| Rules API | 	
	| Rules for Each project |
	
| Missing data imputation |
	| Allocation Table storage |
	| Allocation table parser |
	Missing data API |
	
| Other output |
	| Frequencies writer / tabulator |
	| Stats files generation |
	| DDL generator |
	| Binary output |
	| CSV output |
	| Logging  performance stats |
	
| Metadata Parsing |
	| XML Parser |
	| Trans table parser |
	| Variables Info parser |
	| Problem checking |

Originally the application was written in C with some namespaces and classes as code organization mechanisms. Over the years more classes have been added and the old code refactored.

The “Setup code” handles parsing command line arguments and reading config files and launching the metadata parsers.

The main Converter class determines the set of rules to be invoked, creates them and handles the data input, conversion and output.

The “Data model” refers to the way the input and output data are modeled and used by the rules API and the HouseholdReader and HouseholdWriter classes. The model is just sets of arrays of data – they represent only household and person records in the classic version of DCP so the model is quite simple.

The “Metadata Model” is the set of classes that model the project metadata. These all have corresponding parsers in the parsing code. The VariableMetadata class contains pointers to all the other types of metadata. It’s used to check metadata consisency and as a means of passing one pointer around for all the metadata.

The BasicRules class (rules API) contains all the rules API and many related methods. All project specific rules are derived from this class.

DataBlockReader and Writer are interfaces that are implemented by HouseholdReader and HouseholdWriter. The Reader class also handles wrangling non-standard input formats, most of which no longer exist.

The HouseholdWriter class handles outputting the hierarchical format as well as CSV. The binary output is handled by a separate library.

Editing Rule Organization

The editing rules fall into two broad categories

  • Rules where the variable to be changed is known. The rule determines what, if any, change to make.
  • Rules where the variable to be changed is determined by the rule. The outcome is to mark the chosen variable’s value in the data as “missing”.

The first type are variable specific logical edits and variable construction. The second type are generally consistency checks.

Consistency checks must preceed variable specific edits. They produce two possible outcomes: Logical edits to data or determination of bad / missing data. Variable specific edits will, among other things, impute missing cases some of which may have been identified in the consistency checks.

For each block of (household) data we do:

Recode Data -> data repair -> consistency checks -> variable specific edits ->  final cleanup and quality checks

For example, with a core set of variables: MARST, SEX, AGE, RELATE, OCC, GQ the pseudo-code for the process would go like this:

recode-everything()

consistency_check()
		check AGE, RELATE
	check AGE, MARST
		check SEX, RELATE, OCC
	check RELATE, GQ
	
edit_in_order()        
	edit GQ
	edit MARST
	edit AGE
	edit RELATE
	edit OCC

	construct MONLOC, POPLOC, SPLOC

	construct variables requiring MOMLOC, POPLOC, SPLOC

The recode-everytying() function is mythical; logically you can think of it as the first stage in converting a data block but the recoding process is actually handled just-in-time.

See the section on the Rules API for more on how the edits and checks are implemented.

Time Complexity

After startup time, execution time is linear to the number of variables by the number of records in the output data. Startup time can be significant and can be very large if not designed right. In the DCP logs we calculate the number of “data items” converted per second, which is the measure of overall speed for a run of DCP.

Since time is O(n), n is the interesting thing:

n =C * ( v (a) * r (a) + v (b) * r (b) + … v (n) * r (n))

(a),(b)…(n) are the different record types in the hierarchical blocks of data. Variables can only belong to one record type. v (a) is the number of variables for record type a and r (b) is the number of records of type a .

The constant C factored out is the thing we can control to some extent. It is the overhead needed to process one case of one variable.

Some variables take a lot more time to compute than others but it averages out.

I/O takes a share of the time but in a single process is no more than 10% of total runtime when using a fixed disk.

There’s a certain minimal overhead for processing each case of each variable (a “data item”.) This is something like:

  • A hash lookup of the variable by name: This can be optimized or avoided
  • Extract raw data: Some sort of substr( operation, possibly optimized
  • Recode input value into source data: A hash lookup plus possible formatting
  • Recode source value into integrated value: Hash lookup plus possible formatting
  • Any additional editing on value: Mostly simple math and looping
  • Donating values (when enabled): A hash lookup on each predictor
  • Allocating a missing case (when present): A hash lookup on each predictor
  • Final formatting into output buffer: Justification and replace() string operations

These steps can be speeded up by avoiding passing large raw or string data by value. Most of the data items are just four or eight byte numbers though. There are some unnecessary string allocations in the HouseholdWriter formatting code which if avoided could improve performance a bit. Missing data allocation takes as much time as the rest of the steps. It’s only used heavily in the IPUMS-USA project on the historical samples (1850-1930.)

The hash lookups can be done with unordered hash maps. The ‘map<>” template is an ordered map where the keys are kept in sorted order, so the time complexity is about lg(n) of n number of keys. You need to use unordered_map<> whenever it provides enough features to get the best performance. In some cases there are lots of keys in these hashes, the number will be the number of variables on a particular record type or the number of categories in a variable. These could both run to the thousands.

The Data Editing API

The types of edits required vary a lot from one type of data to another. The USA data require the most edits, partly because it is the best studied, so more is known about how to improve it. Other historical census data would be subject to similar improvement if time was available.

Here’s an overview of what the edits accomplish: Introduction to data editing and allocation

Here’s an English summary of what the USA rules set does: Edit and Allocation Procedures for the Minnesota Samples

Editing rules live in classes derived from BasicRules, which provides lots of methods for working with the data models.

Each rules class contains one method per variable in a project, so that variables may reference each other and the code will be easy to read.

Here’s a template for an editing rule:

return-type variable-name(int record-number){ 
     quick-return-statement

	universe-statement 

	//[separate with a switch statement if applying to more than 
	//one dataset] 
	logical-edits
	
	missing-data-allocation-block 
	post-edit-consistency-check // 	[rarely needed] 
	preserve value and return 
} 

Some actual example code for a theoretical variable:

	 long examplevar(int ln){ 
	   if (calculated[ln][ipumsusa_examplevar])
	     return result[ln][ipumsusa_examplevar];
	   universe(age(ln) < 18);
	   if (inUniverse){
	       if (occ1950(ln) == 101){
	       a = 12;
	        setValue(ln, ipumsusa_examplevar, a);
	 }
	  if(doAllocation[ipumsusa_examplevar]){
	     if (missing(ipumsusa_examplevar,a)){
	         a = allocate(ln,ipumsusa_examplevar);
	       }else{
	         donate(ln,ipumsusa_examplevar,a);
	     }
	    setFinalResult(ln,ipumsusa_examplevar,a);
	   }
	 }else
	     a = 99;
	     result[ln][ipumsusa_examplevar] = a;
	     return a;
	} 

Here’s an example from the IPUMS-USA project. The TRAILER variable needs no logical edits. The code is just a framework for managing the allocation of missing cases and donating good cases:

    int trailer(int ln) {
        int a=rc(ipumsusa_trailer,ln);

        if (dataSet==US1960s)
        {
            universe(rc(ipumsusa_trailer,ln) != 0);

            if (inUniverse)
            {
                if (doAllocation[ipumsusa_trailer])
                {
                    if (missing(ipumsusa_trailer,a))
                    {
                        a = allocate(ipumsusa_trailer,ln);

                        if (a==MISSING)
                        {
                            a=allocate(ipumsusa_failtrailer,ln);
                        }
                    }
                    else
                    {
                        if (vacinput(0)==1 || vacinput(0)==3)
                        {
                            donate(ipumsusa_trailer,ln,a);
                            donate(ipumsusa_failtrailer,ln,a);
                        }
                        else
                        {
                            donate(ipumsusa_failtrailer,ln,a);
                        }
                    }
                }
            }
        }
        return a;
    }

The code for language (abridged):

    int language(int ln)
    {
    	// Get recoded version of LANGUAGE
        int a=rc(ipumsusa_language,ln);

        // if they are born in the US and their parents are born in the US their language is English
        if (available[ipumsusa_bpl] && available[ipumsusa_mbpl] && available[ipumsusa_fbpl])
            if (bpl(ln)<=9900 && mbpl(ln)<=9900 && fbpl(ln)<=9900)
              setValue(ln, ipumsusa_language, 100);
            
	switchdataSet){
	// ... universe only edits ...
	case PUMS1910:
        case HISP1910:
        case US1910:
            universe(age(ln)>=10);
            break;
        case PR:
        {
            if (dataSet == US1920b)
                universe(age(ln)>=10);
            if (dataSet == US1910h)
                universe(age(ln)>4);
        }
        break;
        default:
            universe(age(ln)>=5);
        }


        if (doAllocation[ipumsusa_language])
        {
            // universe language:
            // Previously identified as missing by our code.
            if (inUniverse && a==0) a=MISSING;
            if (!inUniverse) a=0;

            if (inUniverse)
                if (missing(ipumsusa_language,a))
                {
                    if (available[ipumsusa_mtongue] && !missing(ipumsusa_mtongue,mtongue(ln)))
                    {
                        a=mtongue(ln);
                        imputed[ln][ipumsusa_language]=3;
                    }

                    if (available[ipumsusa_fmtongue])
                        if (mbpl(ln)<=9900)
                        {
                            a=fmtongue(ln);
                            imputed[ln][ipumsusa_language]=3;
                        }

                    // No "tongue" variables available:
                    a=allocate(ipumsusa_language,ln);
                } // if missing
                else
                    donate(ipumsusa_language,ln,a);
        } // allocation

        return a;
    }
        	

DCP Data Editing Guide