Optimization Part V: Applying Data Oriented Design Principles

Today I’m going to focus on data-oriented design inspired optimizations. Previously, I also did a handful of API improvements made obvious from a flame graph profile, and then replaced all uses of the STL “unordered_map” type with a more efficient implementation of a flat hash map.

These topics were covered in earlier posts:

The profile-guided optimization in Optimization Part IV hasn’t been integrated into the DCP build process because it doesn’t fit the current build-release-deploy workflow well; it’s simply a nice demonstration of the power of PGO.

the first two rounds of optimizations (algorithm improvement, better hash tables) had already gotten integrated into the master branch before the work described in this post.

The optimizations were additive: I took the algorithm improvements and added the hash table substitution discussed previously. On top of that, I added some data oriented design inspired changes.

The data-oriented design improvements yielded the best speed-up by far.

This is a long post (sorry, you know the quote: “ The present letter is a very long one, simply because I had no leisure to make it shorter.’ “ – Blaise Pascal.)

As a reminder, here’s how the optimizations stack up:

Comparison of performance enhancement for four representative datasets from four IPUMS products

Higher is better.

Background

What’s Data Oriented Design

Proponents of Data Oriented Design make a strong case that a performance sensitive program needs to consider efficient data organization first, over source code organization. By “efficient” they mean efficient for a modern computer to access and transform. In other words, choose data structures for how the machine will access them in real life when your code transforms the data. This means placing data that will be accessed together in time, together in one place in code, minimizing moving and copying data in and out of the CPU cache. Such copying can kill performance.

Data-oriented design turns Object-Oriented design on its head: Where OO emphasises data hiding and encapsulation, DoD foregrounds the data, structureing code around it and common data access patterns.

For an existing mature application, realistically you won’t want to re-write all your code when most of it performs adequately already. Instead, you want to analyze your data flows, identify bottlenecks and optimize them aggressively. The most aggressive optimization will take the form of data oriented design.

An object oriented approach to organizing code and data will often not yield an efficient organization of data in memory at runtime. Still, in most cases the Object model’s strengths may seem to more than make up for slower speed outside the most performance critical areas of an application. However, a point made by DOD proponents is that, on top of better performance, Data Oriented Design side-steps some inflexibilities inherent in Object Oriented designs and leads to more flexible architecture.

Choosing to favor the machine over humans regarding source code organization actually may result in a more maintainable application, and not necessarily result in less human-friendly source code.

Entity Component Systems

Data Oriented Design principles, along with an Entity Component System architecture (these are not the same, but complimentary,), neatly resolve some common OOP design conundrums at a more abstract level. Any time you run into a class hierarchy inflexibility – one which might look like multiple inheritance or several levels of hierarchy would be a solution (dont do it!) – neatly resolves itself with an ECS approach.

An “Entity Component System” allows for a flexible, highly extensible style of application. Some changes to an OO-style application could necessitate painful refactoring if class hierarchies must change but require trivial to no changes if employing an ECS architecture.

What in an O-O design would need to be modeled in code – at best generated at compile time – can be defined and changed at start-up time or even during runtime with ECS. That allows designers and end users, rather than developers, to change important application behaviors. And, a data-first design tends to make implementing parallelism and serialization easier. In depth discussion here.

Think of data-oriented design, assisted with an ECS, as changing the modeling of your problem from an object model to a relational model. As in a relational database, data of the same type (the “components”) live together in the same data structure. You may end up denormalizing data to acomodate common types of access to the data as you do with relational databases. Design purity isn’t the goal; you apply data oriented design principles to the data components in an ECS architecture to maximize performance.

The Entity Component-System approach builds on the relational nature of a data-oriented design, moving what I’d consider merely a technique named data-oriented design to a nearly full blown programming paradigm.

“Entities” tie “components” (data) together; “systems” act on data; “entities” serve the same purpose as primary and foreign keys in a database, “systems” are analogous to queries. (Don’t take the analogy too far.) Ideally “systems” are totally decoupled from each other.

In an ECS, entities are collections of components tied together by their identity, nothing more; they are pure composition, no inheritance or definition of their own.

In OO you might have a “Animal” class, dogs, cats, and birds as child classes. Now, you have a game where a magic spell gives wings to cats or dogs. Do you make “Flying Cat, Flying Dog” classes, or have all Animal descendants have latent wings and flight methods, or what? Do you need a language that supports traits?

In ECS you simply add a wing component to the specific cat or dog subject to the spell. The “fly” system will update all wing components; it need not know they belong to cats or dogs or birds. When the spell wears off, delete the wing component and the dog will stop flying.

There are many ECS frameworks written in C++ and Javascript. A truely new paradigm probably deserves its own language or at least language extensions.

Profile Data Oriented Design

As a baseline, here’s a flame graph profile of one CPS dataset before applying the optimizations we just discussed, but after adding the faster hashtables. The IPUMS CPS data makes for a nice test subject: It’s got some significant data edits, large number of variables but few enough records to allow for complete test datasets to run in a few minutes.

At this point, editing rule processing and other work fielded by the core data processing engine in DCP dominates the flame graph. Nothing stands out at first glance as too slow.

Where to go from here? Considering that, like in many games, our job is to transform lots of data quickly, perhaps thinking in a data-oriented design manner as good game developers do will prove beneficial.

DoD says to put data you’re using at one time – one place in your code – together in your data structures. You want to minimize cache misses by not scattering data around the heap and by avoidance of indirection (pointers.)

Tracing the flow of data through the DCP application leads us to the Record class where data gets stored and transformed.

| input data | -> | Reader | -> | Records | -> | Writer | -> | IPUMS data |
									^	^
									|	|
							| recoder|	| Editor |

The “Record” class lies at the heart of the data conversion engine. Each “Record” instance holds values from one record’s worth of data from every stage, from source to final output values for each variable (columns.) This way, data is transformed in place, while preserving each stage of transformation.

All data processed by DCP passes through Record class instances, hash tables (the “unordered_map” STL container) act as storage of field values on each record.

Intuitively this isn’t great; maps and hash tables don’t necessarily store data contiguously and there’s a fair amount of indirection. In addition with hash tables you must apply a hash function. Maps – a more compact way to store data that preserves key order – require traversing a Red-Black tree (in C++ STL maps at any rate.)

Record data gets stored in one data structure for each data transformation stage and each type of data (floats, integers, strings.)

Knowing this, an obvious optimization would be to switch from hash tables to arrays in the core Record class for storing and accessing record data. Since we cycle through all fields to transform them we can place fields in the order they will be accessed, in keeping with data-oriented design principles. But this is just speculation so far.

Profiling With Valgrind

Before dismantling the core of the DCP application I needed to do some profiling. Perhaps there were bigger bottlenecks to address first. We already saw a profile based on function execution times; what we would like to know is which of those functions maybe unnecessarily slow. One clue to that information is cache misses; a lot of code that incurs cache misses can be restructured to be cache efficient.

Running DCP through Valgrind with the “cachegrind” tool confirms that the place in our code with the highest absolute number of data reading cache misses is the “Record::getLongData(int)” method. That is the method that retrieves long integer data for use in the data editing API and in the output formatter.

For more information on reading the results presented here, see The Cachegrind manual.

You run “Valgrind” with the “cachegrind” tool, then annotate the output. Here’s how to tell Valgrind to profile cache misses:

	$ cg_annotate --show=Dr,D1mr,DLmr --sort=Dr,D1mr,DLmr cachegrind.out.26042 >   usa_dcp.cg_data_slow

Here’s the relevant slice of the Valgrind output:

--------------------------------------------------------------------------------
            Dr          D1mr      DLmr 
--------------------------------------------------------------------------------
51,971,470,816 1,542,948,456 70,639,186  PROGRAM TOTALS

--------------------------------------------------------------------------------
           Dr        D1mr    DLmr  file:function
--------------------------------------------------------------------------------

617,263,824  48,814,189          0  /home/ccd/dcp-merge/code/data/Record.cpp:Record::getLongData(int)

The getLongData() method on the Record class is the first function that isn’t part of the standard library; there’s no other user source code with more cache misses. The total number of misses is rather low because I profiled on a very small dataset, because “Valgrind” is extremely slow.

By far the predominant data type in IPUMS data is long integer, so this makes sense and is the case we should optimize for it. The getLongData() is the method used to move data out of the core data processing engine.

The initial storage in the “Record” class was built with “unordered_map” – one per data type (float, string and integer,) repeating for each stage of transformation (source, recoded, edited.) We also track flags for each variable to show how data was modified (or not as the case mayt be,) and store flag values in “unordered_map”.

	unordered_map<int,string> sourceData;
	unordered_map<int,int64_t> recodedIntData;
	unordered_map<int,string> recodedStringData;
	unordered_map<int, double> recodedFloatData;
	unordered_map<int,int64_t> editedIntData;
	unordered_map<int,string> editedStringData;
	unordered_map<int, double> editedFloatData;
	unordered_map<int,int> flags;
	unordered_map<int,bool> finalized;

The getLongData() reads from the recodedIntData hash table. The keys in all these hash tables are variable IDs (think column numbers.) The total number of variables varies wildly from a few dozen to tens of thousands depending on the data product and dataset.

Clearly, just processing one single “data item” – one variable for one record – takes at least five insertions and several lookups in these hash tables. We’re doing the obvious like reserving the known required memory up-front, but that doesn’t help with cache misses and (relatively) slow executing hash functions. We would expect that a large speed increase will come from simply replacing hash tables with arrays or vectors, making lookup time constant and very small.

Replacing the hash tables with vectors indexed by variable ID would be a huge improvement, but each stage of the processing would still require individual look-ups, jumping from one vector or array to another as data makes its way from “source” to “edited” stages, and accessing different vectors depending on the current variable’s data type.

A data-oriented design would have us tplace these stages and data types of a “data item” together in one structure. This certainly makes intuitive sense if you think about reducing CPU cache misses.

How Far to Dive Into Data-Oriented Refactoring?

So is it really worthwhile to also combine all stages into one vector somehow rather than one vector per data transformation stage? I wanted to find out how much benifit could be gained by simply replacing hash tables with vectors, and how much more could be gained by combining all data into one structure and a single vector.

To test my guess, I made a simplified model of the core data transformation engine (hopefully not too simple.) It Shows the difference between grouping all data stages together in one structure (vector of structs), in separate structures (vectors of integers), and in separate hash tables, as the DCP Record class originally used.

Times in Milliseconds:

Data in one vector: 1867
Data in Separate vectors: 3000
Data in hashtables: 9394

As I expected, switching to a constant time lookup for data will help a lot (“Data in Separate Vectors”.) It seems like if I’m already switching the core data structures I could go ahead and store data contiguously in the most common access order for some additional performance (“Data in One Vector”.)

To prove the case we’ll have to build an alternative code path in the DCP data handling code, making use of the vector of DataItem, and do a head to head comparison both on performance and with a flame graph profile. It seems, however, that given the results of the model it’s worth pursuing the full data-oriented refactoring of the “Record” class data storage.

The new storage in the Record class will look something like:

	union  ipums_value_type{
	long double double_;
		int64_t integer_;		
		char * string_;
	};
	
	struct DataItem{		
		ipums_type  recoded_value;
		ipums_type  edited_value;
		int8_t flag;
		bool finalized;
		bool not_recoded;
		bool not_edited;				
		bool not_visited;
	};

With the above data structures we can store all data needed for transforming source data to its final edited version for a single case (“data item.”)

NOTE: It’s important to know that string type data is comparitively rare in our products; and when we encounter it we often do nothing but pass it through without changes. The string type holds a pointer to a char array rather than the string data itself so heavy access of string type data would incur a lot of cache line misses; this is a comprimise for the sake of code simplicity.

Also you might expect that if we really wanted to cram data into a smaller space we’d combine the boolean fields into a bit field based store of booleans. That won’t save us much space however, because of the “stride alignment” of this struct, the struct required 48 bytes minimum once the 32 byte size was exceeded, so there was plenty of room for all the booleans we needed without resorting to bit-packing. The “stride” is the size of the largest member in a struct; in our case 16 bytes of the ipums_value_t – the long double really eats space but unfortunately it’s necessary. For all you ever wanted to know about struct packing and more see The Lost Art of Structure Packing.

The records hold data for IPUMS “variables” – they are “variables” in the sense of a scientific dataset. The transformation of a record’s source data into data ready for final edits gets done (or can be done) in “variable-ID” (think column numbers) order. A vector of these DataItem structs could hold a complete record, providing a fast constant time look-up indexed by variable-ID, stored contiguously in memory.

	vector<DataItem> data_items = vector<DataItem>(MAX_VARIABLES);

Results

After implementing changes discussed above, the same CPS dataset in a flame graph profile has gotten thinner for sure:

Certainly the code in “Record::prepare_all()” takes less horizontal space, so we’re at least on the right track here. Since the flame graphs only show relative time spent it’s hard to notice when a change has small effects across many parts of the program. In this case, the editing rules section also gets faster, but rather evenly across all parts so no one change stands out. In the final performance graph and the concluding test runs we’ll see the absolute performance gains.

As described in the previous post, before making the change to the core data storage in the Record class for this experiment, I had swapped out the standard “unordered_map” for a more efficient hash table implementation everywhere I could, excluding the data storage described above (I tried it as a quick experiment, it didn’t help much.) The hash table substitution and DoD optimizations aren’t comparable and don’t effect each other much if at all.

Running some representative datasets from different data products and timing data processing rates before and after confirms the huge time savings

dataset		Better Hashtables		Cache Efficient
us2015b		753480					1265846
gh1988ir	704218					1708659
bj2002a		319014					611895
cps2016_01s	680882					1115662

In the chart below, the red portion indicates the speed in data items per second before either the “Better Hashtables” or “DoD” optimization was introduced. Each bar measures transformation rates for four data products, one dataset selected from each.

Comparison of performance enhancement for four representative datasets from four IPUMS products

The orange level measures performance gained by switching out the standard “unordered_map” everywhere possible for “ska::flat_hash_map” [ I built the fastest hash table!] This level basically shows how much performance can be squeezed out of the DCP by making every part of the code that requires a hash table – that fundamentally requires a hash table of some sort – as efficient as possible. As you can see, you can definitely claw back some time by using a better version of the hash table.

The blue level takes the gains from the orange level and adds in the improvements from changing data storage in Record to a vector of the DataItem struct on top of that. This is where the optimization really pays off. All data lookups are done with something like:

	data_items[var_id].edited_value.integer_

These lookups happen during the “editing rules” section, fairly evenly distributed.

So how do the cache misses look on that getLongData() call? Running cachegrind again:

    ccd@gp1:~$ cg_annotate --show=Dr,D1mr,DLmr --sort=Dr,D1mr,DLmr cachegrind.out.147086>   usa_dcp.cg_data
	

The getLongData() method is now much lower in the list of functions sorted by cache misses. Slicing out the same getLongData() function…

--------------------------------------------------------------------------------
            Dr          D1mr      DLmr 
--------------------------------------------------------------------------------
30,606,890,190 1,554,283,095 2,271,460  PROGRAM TOTALS

--------------------------------------------------------------------------------
           Dr        D1mr    DLmr  file:function
--------------------------------------------------------------------------------


208,961,200   8,195,478       0  /home/ccd/dcp-tmp/code/data/Record.cpp:Record::getLongData(int)

Conclusion: The Big Picture

Looking at the Valgrind cache miss profile lead to scrutinizing the data storage in the Record class and doing the most obvious thing. The speed increase was impressive.

We’ve seen four, hopefully representative datasets, one from each of four important IPUMS products.

It would be nice to know how well the gains are distributed across all datasets in the product, which is what ultimately matters the most. We often want to produce an entire set of datasets at once.

The simplest approach to measure the improvements in the real world is to reproduce all data and sum up running time for each dataset from a version of the DCP with the experimental optimization and without.

CPS: 29 hours standard, 19 hours DOD optimization
DHS 138 hours standard, , 41 hours DOD optimization
NHIS 15.3 hours standard, 10.7 DoD  optimization
ACS USA: 43.6 hours standard, 15.3 hours DoD optimizations

I couldn’t give good numbers on the International product because it’s being actively worked on.