A couple of recent developments have prompted a refactoring of the DCP and addition of new features.

  • New data projects at MPC require DCP to harmonize datasets with more complex structure than simply household-person. The original model, while “sheer elegance in its simplicity,” doesn’t scale.

  • With a few large datasets in several different projects including internal U.S. census data, we’d like the option to convert data in parallel within one dataset for faster turnaround during data development.

Complex Data

The ATUS, IHIS and DHS projects all have data organized into hierarchies with more than just people belonging to households. This has implications not only for the data model and reading and writing data but for the data editing API as well.

The new data reader parses the input data in a simple recursive descent fashion, producing records linked together exactly as they are linked logically in the data. For the ATUS project this looks like:

HOUSEHOLDS ->  household PERSON-LIST

PERSON-LIST -> person ACTIVITY-LIST
		person ACTIVITY-LIST
		...
		last-person ACTIVITY-LIST
		
ACTIVITY-LIST ->      activity WITH-WHOM-LIST
			activity WITH-WHOM-LIST
			activity WITH-WHOM-LIST
			......
			last-activity WITH-WHOM-LIST
			
WITH-WHOM-LIST ->	who
			who
			...
			who

You end up with a tree for each household, which is easy to reason over. Each node of the tree holds one record from the input data. DCP will transform this record into an output record, then emit the tree of transformed records, applying a final formatting pass as a final step.

New Data Editing API

At minimum, the new, more complex record structure requires a way for data editing rules to access that data. Abstracting the data access is a basic starting point. The old model simply held arrays of data, like

household = [[hh_var1,hh_var2,hh_var3 ...],
	[person_var1, person_var2, person_var3 ...],
	......
	[ ... last person ...]]

There are parallel arrays for a working data buffer, editing flags, and raw data. Using arrays like this is very fast. Record 0 in the household array represents the data from the household record. Records 1 through n hold data for people 1 through n , which is nice since all location values in the census and IPUMS variables use 1-based person numbering, so everything “just works.” As nice as this scheme is it doesn’t extend to more than two record types.

The existing data editing API suffers from insufficient modularization of code (data editing rules) written against it. The initial design improved upon the FORTRAN system of putting all edits for a given dataset into one file, so that much identical code got repeated for many datasets, and updating code across dataset modules correctly was error-prone. The new (version immediately succeeding the FORTRAN version) model puts all edits for a given variable into one C function, arranging those functions into a ssingle namespace per project and then a class. That served us well enough when the project only contained around 15-20 census samples, but as datasets got added the functions each grew and the whole collection of code got huge. The code management issue is mitigated by a code generation system so that few human errors will make their way into the code compilation phase, but the process is then more complex than it ought to be.

The new API will organize code on a variable-by-variable basis and derive an editing API from an ‘Editor’ class. Editors will have access to instances of each other, and each one has permission to edit its own data and directly read any other data.

Using C++ functors we can access other class instances as if there were functions while maintaining state and having access to less often used methods on those objects.

	namespace ProjectRules;
	
	class Marst::public VariableEditor{
	
    	  Marst(RecordPtr record):VariableEditor(record){}
	  
	   int edit(int recordNumber){
	    	  int marst_value = recoded(recordNumber); // start here
	  
	      	if (relate(recordNumber)== Relate::INMATE)
	    	  marst_value == Marst::MARRIED_SPOUSE_ABSENT;
	      	  marst_value = allocate_if_missing(marst_value);
    	  
  	  	setData(marst_value, recordNumber);
	  }
	  	  
	  bool universe(int recordNumber){
	      return age(recordNumber) > 15;
	  }
	  
	  int allocate_if_missing(int marst_value){
	    if (missing(marst_value))
	          marst_value = allocate();
	      else 
    	         donate(marst_value);
	       
	       return marst_value;
	  }
	  	  
	};

As you can see, category codes get added to the class as static values, possibly through code generation.

Parallel Execution

Since most time consuming uses of DCP involve running as many datasets simultaneously as possible, parallel execution isn’t the most important speed improvement to make. Often a data project manager wishes to produce all new datasets for a project to incorporate new variables across datasets. Reproducing all this data is limited by the number of available cores in a cluster. Assuming the jobs get queued efficiently all compute resources are used and it doesn’t matter exactly in what order data gets converted. Simply making every process as fast as possible would be most benificial. This, though, is a simple code optimization problem.

A second use case is that of developing one particular dataset. When that dataset is especially large repeated iterations slow the data development process, as it requires a lot of human judgements and frequent regeneration of an entire dataset. This is where a parallel execution option would be helpful.

Currently DCP supports a simple parallel processing model via a customized ‘make’ file. It’s not ideal because the calls to DCP must be constructed by hand in advance, splitting up the input data into ranges of households. Also, the input data isn’t indexed so DCP must scan through the data to find a starting point (though this is relatively fast compared to converting data.)

Even if we implemented a very elegant version of this parallel processing feature it would be limited to one machine. In a lot of cases on a 16-core machine that would still be a great improvement, but it would be nice to take advantage of a distributed filesystem like Hadoop.

A complication in the parallel execution feature is that while simple data conversion is totally parallel and stateless, the missing data allocation feature is not. The DCP maintains state throughout its execution, holding the last seen donors matching predictors on each allocation table. One of the points in the DCP refactoring plan addresses this problem. We can make the allocation feature stateless by introducing a donor gathering phase (which would only need to be re-run upon changes to allocation table definitions or input data.) We could build the stateless allocation feature with a sophisticated key-value store or a database.

Further Notes:

Redmine task for the major enhancements