DCP-Multi

Summary of new features and some code examples.

Details below the fold.

New Capabilities

  1. Handle any number of record types in one input data source
  2. Allows for easy logical edits on multiple record types at once; can reason over a hierarchy of records
  3. Robust data recoding and editing runtime core with consistent error handling
  4. Extensible recoding and editing API
  5. No manual changes needed to DCP-Multi when new datasets and variables get added
  6. Can attach a “parasitic” data creation like custom cross-tabulations on the main output

Improvements:

  1. About 80% faster than DCP
  2. Requires less memory than DCP
  3. Easier to produce alternative output data formats
  4. Easy to install in a streaming architecture with standard I/O instead of files
  5. Data editing code is organized on a per-variable basis, with an alternate per-task basis where useful

Software Architecture

Some goals of the design:

  • Run quickly on small datasets
  • Run as fast as possible overall
  • Produce best possible error messages for rapid testing of user code
  • API to separate user code from application code; easy to read user code
  • Can run in an isolated environment on one host if necessary
  • Can run as part of a parallel job on one host or a cluster of hosts in a generic fashion
  • Don’t depend on an external infrastructure for parallelism, but can run inside one such as Hadoop or similar
  • Can configure to run in a pipeline (instead of monolithic mode) like: reader recoder editor formatter reducer
  • Can build software in future by using standard version of a language (c 11 / 14)
  • Break the single process restriction missing data imputation imposed on DCP / Impute missing data in parallel

Parts of the application can build independantly. Code is organized around distinct concerns of the application:

  • editing API: Interface between the editing/recoding runtime core and user created editing rules
  • metadata: Store / model metadata, and produce various DCP-produced metadata (frequencies, YML, etc.)
  • data: Model units of data (households), model record type relationships, track flags for recoding, editing, imputation
  • IO: Handle input and output of data units
  • imputation: Store donor cases and allocate to missing cases, parse allocation scripts
  • classic: Code required only for the classic DCP
  • libs: Miscellaneous support code – string utils, XML, CSV, Sqlite, stack traces
  • database: Abstraction for reading and writing metadata to a database
  • conversion: Main code for DCP-Multi, coordinates all the pieces
  • parsing: Format specific code like XML and CSV parsing and config file loading
  • test: Unit and integration tests
  • rules: The user created code, broken out by data product. Both classic and DCP-Multi rules live here.
  • supporting utilities: Input file splitter for simple parallelism, input file indexer for better parallelism

The classic module depends on shared code: parsing, metadata, rules, libs. No other modules depend in any way on classic. The classic module builds the dcp binary which converts IPUMS-USA, CPS, IPUMSI and NAPP.

The conversion module builds the main program for DCP-Multi; the binary is called “dcp_multi”. It converts DHS, ATUS, AHTUS, IHIS, SESTAT. There’s also a version of IPUMS-USA supported.

The build system uses rake rather than make. Do

rake -T

to get a list of tasks; these are mostly names of modules. The task will build the module. You may build a rules module for quick feedback on user created code:

rake rules_ahtus

To build DCP-Classic :

rake dcp_classic

To install it

rake install_dcp_classic

To build DCP-Multi

rake dcp_multi

Sample User Code

Simple data formatting


class Act_start : public Editor {
public:
    Act_start(VarPtr varInfo):Editor(varInfo) {}

    void edit() {

		// We need to justify the string because to_string() would otherwise return
		// just "0" when clockst was 0.
		long clockst_long =CLOCKST() == MISSING ? 0 : CLOCKST();
		
		string clockst = to_string(clockst_long);

		clockst = rightJustify(clockst,4,'0');

		const string hours = clockst.substr(0,2);
		const string minutes = clockst.substr(2,2);
		const string  act_start = hours + ":" + minutes + ":" + "00";
		setData(act_start);
	}

};

Slightly more complex editing. Construct the F_CLOCKSTOP variable from existing CLOCKST data:


class F_clockstop : public Editor {
public:
    F_clockstop(VarPtr varInfo):Editor(varInfo) {}

    void edit() {
        int value = getEdited();
        /// The editing rule will assign to clockstop the start time of the next episode
        /// For the last episode, historical samples will have 12am and ATUS samples 4am
  	   // Use 2800 hour scale:
 	   long final_f_clockstop = datasetYear() < 2003 ? 2400 : 400;

        vector<ActivityRecordPtr> activityRecords = person()->activities();

		// Ensure we go through the activities in order:
        for (int rec=0;rec< activityRecords.size();rec++){
			ActivityRecordPtr activityRecord = activityRecords.at(rec);

			// Last activity record?
			if(rec == activityRecords.size() -1){
				setData(activityRecord, final_f_clockstop);
			}else{
				ActivityRecordPtr nextActivityRecord = activityRecords.at(rec+1);
				long clockst = CLOCKST(nextActivityRecord);
				setData(activityRecord, clockst);
			}
		}       
    }
};

Possible approach to family pointers:


#pragma once
#include <functional>
#include <algorithm>
#include <memory>
#include "Relate.h"
#include "Sex.h"



class Momloc:public Editor {
public:
    Momloc(VarPtr varInfo):Editor(varInfo) {}

    void edit() {
        setData(createWithImplicitLoop());

    }

    // Return MOMLOC value for current person only
    int createWithImplicitLoop() {
        int momloc = 0; // People are numbered 1-n
        switch(RELATE()) {
        case Relate::HEAD:
            momloc=findForHead();
            break;
            //case Relate::SPOUSE:momloc = findForSpouse();break;
        case Relate::CHILD:
        case Relate::CHILD_OF_HEAD:
            momloc = findForChild();
            break;
        case Relate::CHILD_OF_PARTNER:
            momloc = findForChildOfPartner();
            break;
            //case Relate::GRANDCHILD:momloc = findForGrandchild();break;

        }
        return momloc;
    }

    int findForHead() {

        // Look for PARENT who is female

        return pernumMatching([&](const RecordPtr person) {
            return Relate::PARENT == RELATE(person) && Sex::FEMALE == SEX(person);
        });

    }

// a CHILD is the child of the head so the mom is either the
// female householder or the spouse of the householder or the unmarried partner,
// but there is a separate child code for that
    int findForChild() {
        return pernumMatching([&](RecordPtr record) {
            return SEX(record) == Sex::FEMALE && (Relate::HEAD == RELATE(record)  || Relate::SPOUSE == RELATE(record));
        });
    }

    int findForChildOfPartner() {
        int mom = 0;
        // A different way to do something similar to findForChild():
        for(RecordPtr person:people()) {
            if (Relate::UNMARRIED_PARTNER == RELATE(person) &&
                    Sex::FEMALE == SEX(person)) {
                mom = PERNUM(person);
                break;
            }
        }
        return mom;
    }

    int pernumMatching(RecordMatches func) {
        RecordPtr person = findPersonMatching(func);
        if (person == nullptr) return 0;
        else return PERNUM(person);
    }

};

A different approach:


#include <functional>
#include <algorithm>
#include <memory>
#include "Relate.h"
#include "Sex.h"



class Poploc:public Editor {
#include "../helpers/FamilyRelationship.h"

public:

    Poploc(VarPtr varInfo):Editor(varInfo) {}
    void edit() {
        // Examine each person in combination with the current person.
        // First possible match wins.
		for(RecordPtr potentialFather:people()) {
            if (possibleFather(potentialFather)) {
                setData(PERNUM(potentialFather));
                break;
            }
        }
    }

    bool possibleFather(RecordPtr parent) {
        int age_diff =  AGE(parent) - AGE();

        return (age_diff > 17 &&
                SEX(parent) == Sex::MALE  &&
                validFatherRelationship(RELATE(parent),RELATE()));
    }

    bool validFatherRelationship(int adult, int child) {
        // This is a list (vector) of pairs
        vector<pair<int,int>> relate_pairs = {
            //{Relate::HEAD, Relate::CHILD},
            //{Relate::HEAD,Relate::CHILD_OF_HEAD},
            //{Relate::UNMARRIED_PARTNER,Relate::CHILD_OF_HEAD},
            //{Relate::PARENT,Relate::HEAD},
            //{Relate::CHILD,Relate::GRANDCHILD}
        };

        return  relationshipPairMatches(adult, child, relate_pairs);
    }

};