DCP-Multi has (I’m declaring now) emerged from experimental status and it’s time to formalize and document what we’ve created.
While developing and supporting the DCP-Multi program I’ve realized there are a few topics all users of DCP-Multi should know about:
- Data types in C++ and data types in DCP-Multi and our metadata
- Conventions used in the DCP-Multi data editing API
- C++ 11 features and idioms used in the editing code
- How to check for syntax mistakes, how to read errors, warnings and notes
- How to debug runtime errors generated by DCP-Multi
- Using audit flags
- Building and rrunning for a single data project
Details below the fold.
How to Make Data
You make a new integrated data file for a dataset by running DCP-Multi. Usually you do this because (1) Metadata has changed: New codes, new variables, new datasets (2) Input data has changed, or (3) New data editing rules were added.
For (1) and (2) you no longer need to rebuild the DCP binary (for DCP-Multi) It will pick up on new variables, samples and categories and recodings in metadata automatically. For (3) the process of adding data editing rules requires fewer steps and prevents mistakes, and the API with which you interact with the editing back-end results in more readable code.
Running DCP-Multi
You run with similar options and arguments to the classic DCP application:
dcp_multi [options] dataset project [optional args]
Options get printed if you run ‘dcp_multi’ with no arguments. Optional arguments following the dataset and project (required) can be an allocation script file path and / or a segment specification (this is new.)
New Feature: Segments
If you want to run a large or slow dataset in parallel you can divide it up into segments. You could do the following:
dcp_multi ih2012 ihis segment=1:5 &
dcp_multi ih2012 ihis segment=2:5 &
dcp_multi ih2012 ihis segment=3:5 &
dcp_multi ih2012 ihis segment=4:5 &
dcp_multi ih2012 ihis segment=5:5 &
Each of these instances of the program would operate on1/5 of the data. The output goes into the standard location, no file concatenation required at the end.
To do the segmentation efficiently rrequires an index file for the input data; this lives in /pkg/ipums/
You can create an index with -X:
dcp_multi -X ih2012 ihis
There is a “dcpSeg.run” script to simplify running DCP-Multi in segmented mode:
dcpSeg.run ih2012 ihis 25
This runs the IH2012 dataset in twenty-five parallel parts, so in principle it will complete in 1/25 the time of running without any segments. The script handles book-keeping tasks like waiting for all segments to complete and summing up all variable frequencies from all segments into a single set which will be needed when deploying data.
CSV Output
If you want the data in a CSV format rather than fixed width, use the -x (lower-case ‘x’.) You will get one file per record type with a header in the CSV file containing variable names. CSV output does not work with segments currently.
Database Cache for Metadata
Certain projects have lots of metadata which gets loaded on each startup of DCP-Multi. After loading metadata it gets saved to a local cache, either in the local current working directory (when you use -l (local) ,) or in the
dcp_multi -c ih2012 ihis
To create a cache of metadata for a specific dataset without running DCP-Multi, use the -m option.
Interactive Mode
To enter interactive mode, use -I. Instructions will follow and you can type “help” at the prompt.
Verbose Output
To get a progress bar and messages about the metadata loading status use -v. Note that some progress bars will not render correctly if the “pfreq” and “hfreq” values in the “samples” metadata isn’t correct. These values ought to be the number of household and person records in each sample.
Combining Options
You can combine options, like “dcp_multi -lcx ih2012 ihis”.
Options List
- D Donate to missing data server or create local donor file. (mdserver configured in dcp-conf.ini.)
- S Connect to missing data server for allocating missing cases.
- I Start interactive mode.
- X Create an index file.
- a Log all data edits with audit numbers
- c Load metadata from local cache project_dataset.sqlite.
- f Specify non-default input data filename.
- l Place all output files including data in the current directory.
- m Only create local cache of metadata, do not make any other output.
- v Verbose operation with startup status and progress meter.
- x Output data in CSV files, one file per record type. Makes a *.ddl file with create table instructions for loading CSV data into a database.
DCP-Multi does not yet support the other options listed. The -r for “reproduce frequencies” and “-e” for produce efficient layout will get added soonish.
Building DCP-Multi
Build DCP-Multi if you need to add data editing rules or you need a local copy of the executable for some reason. Otherwise an up-to-date copy should reside in /pkg/ipums/programming/ and you should use it.
See the directions in the README file in the source code. Essentially you invoke a rake task named after the target: “dcp_multi”, “dcp_classic” etc.
rake dcp_multi
DCP-Multi is written in C++ 11 and compiles with either GCC 4.8.x or Clang on Mac OS 10.7 or later.
Rake
The “rake” Ruby program is a task runner framework, named after the UNIX “make” utility, “r” rather than “m” is for Ruby Make. Instructions in rake are Ruby so complex build tasks can be scripted easily.
You should have a version of ‘rake’ on many UNIX systems and nearly all should have a system Ruby. No special version of Rake is required. You may also use Rake from JRuby rather than standard Ruby. If you have JRuby but no Rake “invoking ‘rake’ produces a “command not found” error,) then you would do
gem install rake
With system Ruby you need permission to install a gem,
sudo gem install rake
So sometimes JRuby is simpler since you can install it wherever, assuming you have Java on your system, which you should.
To see all tasks, do
rake -T
You get a list of all tasks, most of which are modules of the DCP-Multi application that can be built separately. The “dcp_classic” and “dcp_multi” tasks build the classic and new versions of the application, respectively. You pcan pass environment variables on the command line along with the task name, which I’ll show next.
Building the Application
The simmplest way to build the whole applicaiton is
rake dcp_multi
This will produce a file named “dcp_multi” in the “../code/bin” directory in your work area.
You can build a single project executable in order to save time and skip over other project’s rules in case either (1) a syntax error exists in the rules for that project preventing compilation or (2) code generation requires permissions to the metadata area and you’re not allowed to read that area. For instance, not everyone has permission to read the /pkg/ipums/dhs/metadata area. The ‘offline_metadata’ directory in the source code provides a back-up in this scenario, but it may not be up-to-date.
Build a single project executable with:
rake dcp_multi project=<project-name>
This task builds an executable file and puts it in “../code/bin”. The file will get named after the project so as not to over-write the “dcp_multi” executable you may also have in ../code/bin.
rake dcp_multi project=ihis
results in “ihis_dcp”.
When running “ihis_dcp” all command line opgtions and arguments apply, but if you try to run another project you get an error stating that the project is not supported; “ihis_dcp -l ih2012” works, “ihis_dcp -l at2003 atus” does not.
If you wish to build for a specific project to check for mistakes in editing rules you changed or added, you can skip compilation and just check syntax which goes fairly quickly:
rake dcp_multi project=ihis syntax=yes
You will not get any executable file in /bin.
Notes, Errors and Warnings
When building DCP-Multi the C++ compiler will emit three types of information: notes, errors, and warnings.
Even one error will halt compilation and prevent you from building a working version of DCP-Multi. However the compiler often will issue multiple errors before it stops. Address the first error first; subsequently printed errors may get resolved by fixing the first one.
Warnings get printed when the compiler finds code that triggers a warning, usually because the code indicates an unsafe practice of some sort. Warnings do not halt compilation unless a specific flag for that warning gets passed to the compiler. You often want to see as many warnings at once as possible. Warnings are not all right. 95% of warnings ought to be cleared up before continuing. In most cases the warning text will print the offending line and give an explaination of the problem. Occasionally the warning is bogus either because the thing triggering the warning isn’t a problem or there is a compiler bug (rare.)
Notes are suggestions of how to simplify your code. Sometimes they are annoying as you can’t act on the suggestion. They are worth considering though.
Changing and Adding Data Editing Rules
Code Organization
All user code lives in the “../code/rules/” directory of the DCP source code tree. User edited code lives mostly in the “../code/rules/
Steps to Adding an Editing Rule
Create a file named after the variable you want to edit; it must be in metadata either in the variables list in control files or in the internal variables list. In “../
In that filemake a class named after the variable, with the following boiler-plate constructor:
class Newvar{
Newvar(VarPtr varInfo):Editor(varInfo) {}
void edit(){
}
};
Now build the project with “rake dcp_multi project=
After testing the output make sure the file “Newvar.h” gets added to source control.
Order of Editing and Data Flow
Data Flow
Input Data -> DCP-Multi -> output data.
When producing output data, DCP-Multi does so while providing a description of that data, breaking it up into “variables”. If the data were flat these variables would be interchangeable with the concept of “columns” in a database table or CSV file. Multiple CSV files are just one format for DCP-Multi output; it also can produce – by default – a hierarchical text output where multiple record types get mixed together in a useful way, so strictly speaking columns don’t exist, but a codebook giving a layout and variables found in the layout is provided.
Every variable in output data (called an “integrated” variable,) must have sources for its data. A source may be (1) Results of computations, including computation on values of other variables, (2) Source variables, or (3) data taken from the input data used to create the dataset, the unharmonized data.
“Source variables” are variables identified in the input data and are by definition unique to that dataset. Their “source” can be other source variables or raw character offsets on a record type (7..11 on record type ‘P’ for instance.)
True “integrated” variables – those that exist across many datasets in the same data product – all must use source variables or computation as their source.
Source variable data values and integrated data values may get “translated” or “recoded” into a cleaned up / and / or an integrated coding scheme. Users supply tables of these recoding pairs (input -> output.) They enter them into spreadsheets known as “translation tables.” The tables contain information for all the possible input values across all datasets in a data product; they map all these inputs to one unified set of output codes, those that will get put into the output “integrated” data.
For a given integrated variable, data flow of one value would look like:
- input data copied from (start..end on one record) == “INP”
- Source data item taking “INP” as input, produces “OUTP” via its translation table.
- Integrated variable taking “OUTP” from the source variable as its input to its trans table; produces “INTEG” as output of its translation table
- Takes “INTEG” and adjusts it for cases needing further adjustment more complex than a table lookup (the “EDIT” step.)
-
Take the “EDIT” data and format it for inclusion in the output data.
(INP -> OUTP -> ) -> (OUTP -> INTEG) -> (INTEG - EDIT) -> FORMAT
Order of Editing
All recoding (translation table look-ups) happen before any data gets edited within a record. This is so that any variables other than the one in question will have data available in an “integrated” form whichwill allow editing logic to apply to all or most datasets in a data product. Often editing logic for one variable will need to use data from other variables to do its job.
Editing rules may also rely on results of editing for other variables. For this to work a strict order must be observed. This order is user defined. Find it in “../code/rules/
Setting up Editing Rules
Editing rules belong in the “../code/rules/
Within these classes user defined “rule” code goes into the edit() method and methods called from edit().
Each variable class in “variables_to_edit” must provide an edit(). method This gets invoked from within a loop over all records of the type that the variable belongs to. For instance
Example: For RELATE, DCP-Multi calls the edit() method for each person record. Therefore the code in edit() in the Relate class should be understood to execute in the context of the “current” person. Any variables you refer to also get their data from the “current” person if they are of the person type. Household type variables their value from the “current” household of which there can be only one.
Example:
if (GQ() <2 && AGE() <16 && MARST() != Marst::NeverMarried &&
RELATE() != Relate::Child) {... do something}
If you refer to a child typed variable, like “DURATION” on the activity record from an editing rule of a person record (this could happen in ATUS) you’ll get a runtime error, as DCP-Multi has no way of choosing between the multiple activity records. You can always pass a specific record pointer to that variable’s accessor method though (covered elsewhere.)
If you want to loop over all the people explicitly for some reason, just do so and set the data value. The edit() method will not get invoked for records where the variable already had its data set.
Over-all editing order: Variables get edited in an order found in the main project class. Look in “../code/rules/ihis/Ihis.cpp” for an example. You call the applyEdits() on each variable’s accessor object, like:
AGE.applyEdits();
MARST.applyEdits();
RELATE.applyEdits();
....
Internal Variables
You can add variables for internal use only. These do now show up in the output data files but in the editing code appear as if they were integrated variables. Add a row to the “../metadata/control_files/internal_variables”.csv file and DCP-Multi will automatically pick up the variable when you run.
Lookup Tables
Lookup tables metadata lives in “../
If you use group names you need a “groups.txt” file to define what those groups are.
The look-up tables have two or more columns. The first column is the “input” column,containing values or keys the look-up table will accept. The next column or columns hold values the first column’s values will map. The column headings after the left-most column name the variable where the lookup table will be available within data editing code, and they also declare the type of the output values. The input column must also have a type declaration.
For instance you could have a table like this:
input:integer,MBPL:integer,FBPL:integer
00057,9900,0000
527,10000,10000
60,10000,10000
66,10500,10500
69,10750,10750
73,11000,11000
78,11500,11500
.....
Then in the variable editing code in “../code/rules/atus/variables_to_edit/Mbpl.h” you might do
class Mbpl : public Editor {
public:
Mbpl(VarPtr varInfo) : Editor(varInfo) {}
void edit() {
auto input_bpl_code = MBPL();
auto bpl = lookup(input_bpl_code);
setData(bpl);
}
};
This code functions like a translation table; the result of “auto input_bpl_code = MBPL();” is some code that isn’t in the code scheme we want to output, and we can presume the trans table here simply passes the source variable value through. We use a look-up table because the number of codes in the output columns makes working with the translation table difficult, not that we’re doing anything conceptually different from a regular translation table. The main thing here is to notice that there is only one set of input codes and multiple output code sets (though in this example the output codes are mostly the same.) Translation tables are really meant for the case where input codes vary from one input data-set to another and need to get mapped onto one set of output codes and a look0-up table is rather the reverse situation.
A second type of use for the look-up table is that of “attaching” other data to the input data-set. You could construct a look-up table where the input column contained record numbers (serial numbers) for a record of a type such as person. You might assign values of a set of variables to the records identified in the input column, which could actually be all the records if you wanted. For instance you might need to add lattitude and longitude values to each household record:
input:integer, lat:float, lon:float
110001:45.0051,30.001
.....
Then you could create “LAT” and “LON” in the integrated variables set for the data project and in the programming do
class Lat : public Editor {
public:
Lat(VarPtr varInfo) : Editor(varInfo) {}
void edit() {
setData(lookupFloat(SERIAL()));
// Or if Lat is a string rather than a float
// setData(lookup(SERIAL()));
}
};
By doing this you have avoided actually modifying the input data.
Another compelling use for a look-up table could be assigning city populations based on city code. An alternative form of the lookup() method takes two arguments: The lookup input code and the name of the output column. By default the output column name is assumed to be the name of the current variable. However if you pass in the name it gets used regardless of the current variable name.
You could have a look-up table of city codes and populations:
input:integer, us1960:integer, us1970:integer, us1980:integer, us1990:integer
1220, 36333, 42911, 44181, 50112
3001, 925, 851, 803, 503
....
Then in your code for CITYPOP:
class Citypop : public Editor {
public:
Citypop(VarPtr varInfo) : Editor(varInfo) {}
void edit() {
auto city_code = CITY();
const string dataset_name = datasetName();
setData(lookupNumber(city_code,dataset_name);
}
};
This code would work as long as whatever data-set you run has an entry in the CITYPOP lookup table.
Look-Up Groups
The names of files in the ../
<data-set>_anyname.csv
all_anyname.csv
group_<group-name>_anyname.csv
If the first word in the filename for a look-up table names a data-set in the project, then the table gets loaded only when running that data-set. If the first word in the name is “all” the table file gets loaded for every data-set. If the first word is “group” then the second word must be a name of a “group” named in the “groups.txt” file. That file must also reside in the ../metadata/lookup/ directory.
In the previous CITYPOP example you would need to name the look-up table file something like “all_city-populations.csv” so that it was available for all data-sets.
Or let’s say no city population information is available before 1960 (totally not true) and you don’t want to write extra logic to handle that condition in the editing code; you could create a group. The groups file could look like:
[pop-available]
us1960
us1970
us1980
us1990
[pop-not-available]
us1950
us1940
us1930
us1920
us1910
Then name the look-up table file “group_pop-available_populations.csv” and it will get loaded only for those datasets where it makes sense to use the table.
Data Types
Every variable in the IPUMS metadata has a data type. The type defaults to “integer”. If the metadata has no type information for the variable “integer” gets used. The “data_type” column in the variables control file / list can be blank or “double”. There is also a “string” column which can be “1” or blank. This is a sort of broken specification; the “data_type” column ought to fully describe the variable’s type – and we will move to this as soon as possible. The type information is mutually exclusive in that you cannot specify a string value of “1” (true) and also data_type of “double”. Note that the “decim” column specifies fixed width decimal numbers for output formatting purposes only; it does not actually change the data type of the variable.
DCP-Multi has three data types for data variables: string, integer and floating point. Since we’re writing editing rules in C++ we need to use C++ data types to carry around the data. We use ‘string’ (C++ strings), ‘int64_t’ a 64-bit signed integer synonymous with ‘long’ on our machines; and ‘double’ which stands for ‘double precision floating point’.
Conceptually then, DCP-Multi has three data types. However since data editing code is written in C++ you have to express these conceptual types with actual C++ data types. Use those listed to be safe: string, int64_t / long and double.
Input Data Types
By saying that every variable in the IPUMS metadata has a type I really mean that the type of the output of that variable is specified and enforced at compile time as well as run-time.
Input data type has no specification. The input to a variable consists of data from a source variable or raw data in an input file or data resulting from a computation. Input for variables is either handled in input to a translation process or with programming in a data editing rule.
In a translation table the kind of the source is specified,and – if it makes sense for the kind of source – named, and (optionally) input data values are listed. The string literals representing the possible input codes will allow DCP-Multi to infer the input type, but this is actually quite difficult to do in a human-friendly way. Originally the input codes were meant to be literal character sequences found in the raw input data; but it became convenient to write numbers without leading “0” characters or spaces, as the width of the input character sequences is specified in metadata and the leading characters were redundant.
So, most input codes can be read in and internally converted to integer values, in principle.
That system is all fine until you encounter some junk values you want to translate and yet want to specify most input codes as a human would write a number:
12 3
00XY
00BB
The intent of the first two entries is that the input code will actually look like “00 12” or “0003”. The last two entries are to be taken literally as they could not be converted from a decimal representation – that’s the deciding factor for whether input codes get formatted internally or not. The goal of the formatting is that the input code will match what’s expected to be seen in the raw input data.
For this internal formatting to work properly the input data has to get reformatted systematically so that blanks are “0” when the data should represent a number or “B” when it is a blank representing non-numeric data. Some input data follows the leading zero convention, other has blanks (space characters) to hold the number’s place, so we have to make it consistent before integration or do multiple lookups at the time of recoding.
Internally all data structures holding translations are maps of strings for keys and different types for output values (mostly map<string, long> but also map<string,double> and map<string, string>.)
Source Variable Types
Like all other variables source variables have data types. Understanding implications of choosing good types for source variables will help you understand how to get the desired data and formatting in the integrated variables the source variables feed.
While every source variable has a specified output data type, any time that data is used by an integrated variable as a source that output data must get converted back into a string form in order to get used by the integrated variable’s translation table. This approach is the simplest way to handle the many cases where source variables simply pass through raw input data without recoding it, making the integrated variable responsible for recoding junk values in data that otherwise ought to be all integers. In these cases the output type of the source variable is “string”.
Ideally source variables would be responsible for cleaning / recoding junk values but some data projects have such a vast number of source variables that the time to construct these recodings just is not available from the staff.
In cases where the source variable type is “integer” we ought to – as a good optimization – store the keys of the integrated variable’s translation table as integers and skip the string conversion from the source variable’s output.
Working With Source Data in Editing Rules
The editing API provides a function called getSourceData() which will return the source data for any variable. The result type is always “string”. This value ought to match precisely an entry / key in the current variable’s translation table as those are always strings as well.
In cases where the source data came from a integer typed source variable, you can call the variable’s data access method directly to work with the data arithmetically:
// Get number value of the source data for AGE as it will appear in output
// if source variables are part of the output data, this is the source variable for AGE, and this /
//source variable is a integer type:
long input_age = US1950A_1051();
if (input_age > 99) {
... do some stuff because age was not top-coded and
that fact is interesting ...
}
// Get the string which the above source variable passes to the AGE translation table
string age_source_str = AGE.getSourceData();
Also remember that you can call getSourceData() directly in an editing rule and it will return the source of the translation table for the current variable. Sometimes this may be very useful if the source of the translation table is the concatenation of two or more source variables.
Type Conversion and Casting
It’s important to understand the difference between type-casting and type conversion. Type-casting does not change data, it just changes the ‘tag’ the C++ compiler attaches to the data, type conversion takes the data representing another type of data and transforms it.
For instance, to_string() takes a number and produces a string that represents that number: to_string(25) == “25”. This is a conversion. Type-casts are unfortunate because you might think they are converting something when most of the time they give you junk.
[a lot more to say here]
API
The editing API essentially is just a list of C++ functions available in any data editing rule edit() method. See the wiki. DCP-Multi editing API documentation.
Conventions in DCP-Multi and C++ Standard Usage
In editing rules variable names get written in upper-case. Other API functions and C++ functions get written in lower-case or camel-case for the most part, except that many C++ functions use snake-case (“to_string()” for instance.)
if (RELATE() == Relate::Child) { ...}
In this code RELATE accesses the value of the RELATE variable which is part of the data in the output for the current dataset and project.
RELATE has many methods attached to it, in fact all the methods in the data editing API. You access different data types in different ways:
Default is integer:
RELATE() // current case
RELATE(2) // person 2
RELATE(personRecord) // RELATE from the personRecord pointer
For strings:
NAMEFRST.getString();
NAMEFRST.getString(2);
For floating point
SPECIALWEIGHT.getDouble();
SPECIALWEIGHT.getDouble(2);
Debugging Editing Rules
- Use interactive mode
- Use cout when you can’t think of anything else
Interactive Mode
Start DCP-Multi with -I :
dcp_multi -I ih2012 ihis
There is online help.
Audit Flags
You can call setAudit() on any variable or record. A variable or record may have as many flags as you need to set. These are used to track special bits of editing code so you can determine how many cases a particular branch of editing logic gets used on. This feature is still experimental. Eventually audits will get written to a special log (for now they go into the general DCP-Multi log file under “../output_data/current/log”. Ultimately this log – when enabled – would contain every modification to every data item so you can step through edits and recodes without executing DCP-Multi, as a form of forensics.
Many times the interactive mode would suffice for tracking experimental edits, and you could display values of audit flags there rather than picking through a log.