IPUMS Data Conversion as a Slow Motion Entity System

Applying the Entity System design to the IPUMS DCP (Data Conversion Program) could improve performance as well as provide new insights into the underlying problem space.

Why an Entity Component-System Design May Benefit the Data Conversion Process

In short:

Increased flexibility for users
Better performance
More maintainable code

The first point is quite important, as we’d like researchers to have the ability to add new survey variables and datasets without rebuilding the application, while keeping the application in a statically typed compiled language. While we nominally have achieved this already we cannot easily add new attributes to the main entities of variables and datasets without writing new code and rebuilding and re-releasing the software.

Higher performance on simple computations across data as well as better caching of intermediate conversion results would be enabled; it remains to be seen if they would be worth the added complexity. More discussion later.

The code could potentially end up in a more maintainable state by grouping data storage concerns together, and data processing concerns somewhere else. More modular design would follow, as the code reflected the stages of data conversion. More discussion of this in the Systems section.

Entities and Components in the Data Conversion Domain

With the understanding that entities in an E-S design may relate to one another hierarchically, or more generally, their relationships may be modeled by a relational schema, we will ist them and their relationships. See this key to terminology used throughout the rest of this post. There’s a rather direct corespondence between our micro-data terminology and relational database terms. Entities may belong or have other entities as with tables modeling entities in a relational database.

The Entities

Variables (belong to records, belong to datasets)
Categories (belong to variables)
Datasets (have many records, have many variables)
Data Blocks (have many records, belong to datasets)
Records (belong to data blocks, have many variables, belong to datasets)
Cases (belong to records, variables)

Some redundancy exists in the sets of relationships; we’re not going to model all the joinings or all the “has many”…“through” relationships.

Variables have many cases per dataset. Think of all the cases for a variable as corresponding to all a column’s data in a database table. Cases have “data item” components storing each stage of processing each case.

In an ES design each entity can have many components. Some examples:

The Components

names
data items
labels
flags
start, width (formatting info)
data type
counts
time

Though we’re not designing an object oriented system here it will help to connect entities with the components they will need:

Variables have: name, label, data type, start column + column width (for fixed width formatting), and many more. Most importantly they have data items through the case entities belonging to them.
Categories have labels, counts and cross tabulations
Records have: sequence number, data items, type, parent and child types (schema)
Data Blocks have: record order and numbers and sequence numbers
Datasets have: name, time (month,year,day) , geography
Cases have (raw, recoded, source, edited) data item type components not all case entities have all components.

Some components will change more frequently than others; the dataset components don’t change at all, some variable components likewise remain the same (take names or data types for instance.) Data item components of variables change during execution of DCP – that change being the entire purpose of the application, after all. The category count components change most frequently as every change to a data item may alter a particular category’s frequency count.

Most components belong to the Variables entity. Data items (may) belong to both records and variables through their cases. We could model them as accessible through variables or through records. This is where designing an ES framework for DCP gets tricky. Cases are composed of different types of data items, representing stages of data processing. You can model cases as belonging primarily to records or variables or both.

The Systems

What the Systems Need to Do

The DCP transforms micro-data on a dataset by dataset basis. In simple terms, it transforms every field on every record in a dataset into another, where those values conform to standardized coding conventions, have some necessary data cleaning applied and most importantly have complex user-defined computations applied where those computations depend on values on many fields and neighboring records.

What the Systems Cannot Do

A system to transform the data won’t fit the map-reduce paradigm at all Because of the large amount of change to the data; it’s more of a map-only problem. Rather than any sort of product of a “reduce” phase such as summary statistics, the program needs to produce another, usually larger, dataset. That said, a very useful set of artifacts should rebe created in addition to the IPUMS data: frequency counts of each variable or stats derived from continuous variable values, some set of predefined cross tabulations, and checks on any missing data values.

Key computations depend on multiple fields and nearby records, so fine-grained parallelism isn’t possible either. We can’t transform all data for a given variable for a dataset independantly of others. And we cannot even transform records in isolation; this requires extensive use of surrounding records in the same data block (in the cases of census data these are households.)

However, data blocks represent boundaries beyond which parallel processing may occur. So the processing of the data may split along data block boundaries; those blocks don’t have uniform sizes which throws an additional wrench into the works, but mostly the size variations average out across enough workers.

The Systems to Update DCP Component Data

One advantage of organizing data into components in the Entity (component) System design is for efficiently updating components. Well organized components allow for fast updating which is critical in games. Every frame in a game requires some component updates, and a frame may be updated as frequently as thirty or sixty times a second by its “Systems,” hence the need for efficiency.

As component-systems update the state of the game, uman players react to the new situation, take some actions effecting the next round of updates, and the situation in the game evolves iteratively.

What’s a frame in the context of data conversion? Once we conceptualize the data items as components of variables it becomes clear that an update cycle on those components requires a complete execution of the data conversion process. At the end of execution, human researchers check data for errors and inspect it for changes they’ve attempted to make in the data. They take the new state of the output data into account, apply more change instructions, and re-execute, evolving the data until it’s ready for release. (And this isn’t the end of the story, there will be many releases of any given dataset…)

In a single threaded context this execution is relatively slow, owing to the large size of most datasets, on the order of minutes to many hours even with reasonably optimized C++ doing the work on a modern machine. The fact that only a small slice of the full data item set for each variable is in main memory at any given moment is an implementation detail. A version of the I/O system done with memory mapped files clarified this for me. Think of the entire dataset as the playing field of the game which must get a update during the current frame.

To achieve a quick turn-around we distribute computation along data block boundaries to as many cores as a machine has. By writing to shared storage we can also distribute the work across nodes in a cluster without needing to concatenate results of the distributed computation. Of course this puts a load on the networked connection to the storage and so there’s an upper limit on how much throughput such a setup has and thus how many cores it can effectively harness.

An entity component-system architecture could theoretically use all available cores more efficiently because we could avoid running through all steps of the conversion process every time we make new data (more details later.) I/O would still present a problem since even if we distribute the results across nodes, at some point all results must get stitched together to make the final dataset. In future, as datasets get larger, whichever data conversion architecture approach we take, we may use a partitioning scheme to avoid this final step (think Apache Hive and Spark.)

Actually designing an efficient component-system update process for the data item components may be difficult since the update system will need to read data from multiple records and other data items and need to know if they’ve been updated already. Further complicating matters, we have to keep pre-updated versions of the data item components around for debugging, data quality checks and some computations.

We can consider a couple of approaches:

Organize data item components on a record-by-record / row basis:

Only records in active use are mapped into the data item components and they are ordered by variable on the records. This is the most obvious approach and the current implementation more or less. Most directly addresses the multiple variable and record dependance issues.

This approach also allows straight-forward parallelism on a data-block level. We don’t make efficient use of the CPU cache on the simple data recoding phase, but in the more expensive computation phase data for the computations is close-by and should be relatively efficient to read.

Under this scenario, systems would include:

read and update (raw data making up a data block of record data)
extract (raw data into input data items)
recode(record worth of data items)
compute (variable data items needing additional computation on the record)
count(category values from data items)
format (record worth of data items)
convert(repeat above sequence until raw input data exhausted)

Organize data items into collections on a per-variable basis:

This is the more complex, but possibly more powerful design. The hard parts first.

Challenges:

Memory and I/O will constrain what goes in main memory but we can mostly abstract away that issue. Since some classes of computation require comparing variable data items on the same record, some virtual records will need to be constructed which can be expensive. The human oriented API for computing on the micro-data presents the data grouped by data block and records within those blocks and this interface must be preserved – a “row oriented” presentation vs a “column oriented” organization.

Another downside: Parallelism will be more difficult to do well. Not an insurmountable level of difficulty but significantly more complex.

In this approach copies of partially converted data must be preserved to be used by the parts of the downstream process; this means more disk storage is required, by about 4.25 times – except that we could store this intermediate data in more compact formats than the raw input data, to the tune of about 2.25 times more storage instead of 4.25 times.

Upsides:

However, the simplest class of computation, the one which applies to all variables, doesn’t need any data other than the current data item. We apply simple recodings to the input data items, then treat those results as inputs to other variables and apply simple lookups to those variables in turn. There are only two phases: Mapping “source” variables and trivially recoding them, then recoding those results into “integrated” variable values.

There’s an initial “extraction” or “data inversion” step, but this would only run when new source variables are added, which will be less frequent. I would classify the inversion step as a form of input data preparation and (possibly unfairly) locate it outside the problem space of the conversion program.

If we could group the various data item components by variable, application of the recoding tables could be done very efficiently without blowing the CPU cache. Before pursuing this strategy further we’d need to consult a Flame Graph of the application performance to determine if the change would be worth-while. On some datasets I know the majority of time is spent in recoding, on others not so much.

An additional – and potentially much larger – benefit, as mentioned earlier, could be that we could cache the entire recoding results and only re-run them when recoding tables changed. Since these tables change a lot during parts of data development the approach might not yield a lot if the caching process weren’t really efficient on its own. A cache for variable oriented (column oriented) data will by its nature be inefficient to update, and the size of the data will be as large as the final dataset, and that is large enough to present an I/O bottleneck on the entire process whether it’s updated in-place or written out fresh each time.

The best case scenario for the variable oriented data organization would be that we can detect changes to a single variable’s recoding scheme, apply it and update the cached “recoded” version in an efficient way (say if it were a single file or contiguous part of a file.) The Parquet format sounds appealing but as far as I know due to compression and the way it stores data in row-group chunks, doesn’t allow for in-place updates. Probably simple binary files, one for each variable are the best implementation here. Even 200 million items read in, updated and written out on their own will not take too long, and that is the largest dataset we currently have. (Keeping in mind that this set of data items represents just one of hundreds of variables in that dataset.)

Given that we have such “variable” files, we can then open only those needed for the heavy duty computations. There will be rather a lot of these, but still only a small portion of the entire set of variables.

Systems for this approach include:

extract/invert(raw data into data items grouped by source variable)
recode(variable worth of data items)
integrate(using source variable recodings, do further recoding)
renew(selected variable recoding sets)
virtualize_records(prep subset of data needed in the next step for use with the API)
compute(using only necessary integrated and source recoded data item sets)
assemble(computed data item sets, recoded data item sets into rows and data blocks)
format_and_output(assembled data blocks, output final row oriented data)

The “assemble” and “format_and_output” systems will be the most consistently slow because of the I/O penalty they must incur. The “extract” system will be slow but it won’t need to run on every conversion execution, only when new source variables get added.

As you can probably guess, applying these systems is more complex than with the first approach; some are conditional on cache invalidation, and you know what they say about that…

Implementing this approach in the real world would require using column oriented file format notions of “row groups” even for single threaded execution; we have to conserve memory. Implementing parallel execution requires some coordination between processes and keeping the data in sync at each stage could be tricky. While the Parquet format won’t do as an input or intermediate store of IPUMS data under conversion, it will certainly be a good format for the final converted data. The Apache Arrow representation of columnar data would probably do nicely as the interface between the recoding and the record virtualization stages – that way we could bolt any back-end data storage onto the user facing front end where the complex data conversion takes place.