DCP-Multi
Summary of new features and some code examples.
Details below the fold.
New Capabilities
- Handle any number of record types in one input data source
- Allows for easy logical edits on multiple record types at once; can reason over a hierarchy of records
- Robust data recoding and editing runtime core with consistent error handling
- Extensible recoding and editing API
- No manual changes needed to DCP-Multi when new datasets and variables get added
- Can attach a “parasitic” data creation like custom cross-tabulations on the main output
Improvements:
- About 80% faster than DCP
- Requires less memory than DCP
- Easier to produce alternative output data formats
- Easy to install in a streaming architecture with standard I/O instead of files
- Data editing code is organized on a per-variable basis, with an alternate per-task basis where useful
Software Architecture
Some goals of the design:
- Run quickly on small datasets
- Run as fast as possible overall
- Produce best possible error messages for rapid testing of user code
- API to separate user code from application code; easy to read user code
- Can run in an isolated environment on one host if necessary
- Can run as part of a parallel job on one host or a cluster of hosts in a generic fashion
- Don’t depend on an external infrastructure for parallelism, but can run inside one such as Hadoop or similar
-
Can configure to run in a pipeline (instead of monolithic mode) like: reader recoder editor formatter reducer - Can build software in future by using standard version of a language (c 11 / 14)
- Break the single process restriction missing data imputation imposed on DCP / Impute missing data in parallel
Parts of the application can build independantly. Code is organized around distinct concerns of the application:
- editing API: Interface between the editing/recoding runtime core and user created editing rules
- metadata: Store / model metadata, and produce various DCP-produced metadata (frequencies, YML, etc.)
- data: Model units of data (households), model record type relationships, track flags for recoding, editing, imputation
- IO: Handle input and output of data units
- imputation: Store donor cases and allocate to missing cases, parse allocation scripts
- classic: Code required only for the classic DCP
- libs: Miscellaneous support code – string utils, XML, CSV, Sqlite, stack traces
- database: Abstraction for reading and writing metadata to a database
- conversion: Main code for DCP-Multi, coordinates all the pieces
- parsing: Format specific code like XML and CSV parsing and config file loading
- test: Unit and integration tests
- rules: The user created code, broken out by data product. Both classic and DCP-Multi rules live here.
- supporting utilities: Input file splitter for simple parallelism, input file indexer for better parallelism
The classic module depends on shared code: parsing, metadata, rules, libs. No other modules depend in any way on classic. The classic module builds the dcp binary which converts IPUMS-USA, CPS, IPUMSI and NAPP.
The conversion module builds the main program for DCP-Multi; the binary is called “dcp_multi”. It converts DHS, ATUS, AHTUS, IHIS, SESTAT. There’s also a version of IPUMS-USA supported.
The build system uses rake rather than make. Do
rake -T
to get a list of tasks; these are mostly names of modules. The task will build the module. You may build a rules module for quick feedback on user created code:
rake rules_ahtus
To build DCP-Classic :
rake dcp_classic
To install it
rake install_dcp_classic
To build DCP-Multi
rake dcp_multi
Sample User Code
Simple data formatting
class Act_start : public Editor {
public:
Act_start(VarPtr varInfo):Editor(varInfo) {}
void edit() {
// We need to justify the string because to_string() would otherwise return
// just "0" when clockst was 0.
long clockst_long =CLOCKST() == MISSING ? 0 : CLOCKST();
string clockst = to_string(clockst_long);
clockst = rightJustify(clockst,4,'0');
const string hours = clockst.substr(0,2);
const string minutes = clockst.substr(2,2);
const string act_start = hours + ":" + minutes + ":" + "00";
setData(act_start);
}
};
Slightly more complex editing. Construct the F_CLOCKSTOP variable from existing CLOCKST data:
class F_clockstop : public Editor {
public:
F_clockstop(VarPtr varInfo):Editor(varInfo) {}
void edit() {
int value = getEdited();
/// The editing rule will assign to clockstop the start time of the next episode
/// For the last episode, historical samples will have 12am and ATUS samples 4am
// Use 2800 hour scale:
long final_f_clockstop = datasetYear() < 2003 ? 2400 : 400;
vector<ActivityRecordPtr> activityRecords = person()->activities();
// Ensure we go through the activities in order:
for (int rec=0;rec< activityRecords.size();rec++){
ActivityRecordPtr activityRecord = activityRecords.at(rec);
// Last activity record?
if(rec == activityRecords.size() -1){
setData(activityRecord, final_f_clockstop);
}else{
ActivityRecordPtr nextActivityRecord = activityRecords.at(rec+1);
long clockst = CLOCKST(nextActivityRecord);
setData(activityRecord, clockst);
}
}
}
};
Possible approach to family pointers:
#pragma once
#include <functional>
#include <algorithm>
#include <memory>
#include "Relate.h"
#include "Sex.h"
class Momloc:public Editor {
public:
Momloc(VarPtr varInfo):Editor(varInfo) {}
void edit() {
setData(createWithImplicitLoop());
}
// Return MOMLOC value for current person only
int createWithImplicitLoop() {
int momloc = 0; // People are numbered 1-n
switch(RELATE()) {
case Relate::HEAD:
momloc=findForHead();
break;
//case Relate::SPOUSE:momloc = findForSpouse();break;
case Relate::CHILD:
case Relate::CHILD_OF_HEAD:
momloc = findForChild();
break;
case Relate::CHILD_OF_PARTNER:
momloc = findForChildOfPartner();
break;
//case Relate::GRANDCHILD:momloc = findForGrandchild();break;
}
return momloc;
}
int findForHead() {
// Look for PARENT who is female
return pernumMatching([&](const RecordPtr person) {
return Relate::PARENT == RELATE(person) && Sex::FEMALE == SEX(person);
});
}
// a CHILD is the child of the head so the mom is either the
// female householder or the spouse of the householder or the unmarried partner,
// but there is a separate child code for that
int findForChild() {
return pernumMatching([&](RecordPtr record) {
return SEX(record) == Sex::FEMALE && (Relate::HEAD == RELATE(record) || Relate::SPOUSE == RELATE(record));
});
}
int findForChildOfPartner() {
int mom = 0;
// A different way to do something similar to findForChild():
for(RecordPtr person:people()) {
if (Relate::UNMARRIED_PARTNER == RELATE(person) &&
Sex::FEMALE == SEX(person)) {
mom = PERNUM(person);
break;
}
}
return mom;
}
int pernumMatching(RecordMatches func) {
RecordPtr person = findPersonMatching(func);
if (person == nullptr) return 0;
else return PERNUM(person);
}
};
A different approach:
#include <functional>
#include <algorithm>
#include <memory>
#include "Relate.h"
#include "Sex.h"
class Poploc:public Editor {
#include "../helpers/FamilyRelationship.h"
public:
Poploc(VarPtr varInfo):Editor(varInfo) {}
void edit() {
// Examine each person in combination with the current person.
// First possible match wins.
for(RecordPtr potentialFather:people()) {
if (possibleFather(potentialFather)) {
setData(PERNUM(potentialFather));
break;
}
}
}
bool possibleFather(RecordPtr parent) {
int age_diff = AGE(parent) - AGE();
return (age_diff > 17 &&
SEX(parent) == Sex::MALE &&
validFatherRelationship(RELATE(parent),RELATE()));
}
bool validFatherRelationship(int adult, int child) {
// This is a list (vector) of pairs
vector<pair<int,int>> relate_pairs = {
//{Relate::HEAD, Relate::CHILD},
//{Relate::HEAD,Relate::CHILD_OF_HEAD},
//{Relate::UNMARRIED_PARTNER,Relate::CHILD_OF_HEAD},
//{Relate::PARENT,Relate::HEAD},
//{Relate::CHILD,Relate::GRANDCHILD}
};
return relationshipPairMatches(adult, child, relate_pairs);
}
};