Optimization Part II: Targeted Optimizations Assisted by Flame Graphing

Early last year IPUMS moved production of IPUMS-International micro-data to the latest version of the core DCP and a new data editing API. In doing so we discovered a number of places where the new API – while performing better than the old one on our USA and CPS test datasets – performed worse than expected on some of the IPUMSI datasets. Not a big deal except for a few datasets that took twenty or thirty times longer to process than we would expect.

What could be the cause?

For Easy Wins Start With a Flame Graph

Rather than thinking hard, I made aflame graph from profiles on two super-slow datasets, then systematically attacked the widest areas of the graphs, similar to the process described in my flame-graph introduction

It took a while to come up with a useful flame graph because the bad behavior was hard to capture.

When exploring the slowness with one of the slow datasets, I discovered most samples of “perf” data didn’t generate very unusual – varying from the baseline – flame graphs. Yet, running through the whole dataset took nearly a day when I’d expect it to take more like an hour at most. I monitored one such run, waited for the volume of data output by DCP to drop to a trickle and noted the region of the dataset where this occurred. Then I started DCP at that point and gathered data from “perf” and turned it into a flame graph.

Here’s a graph of one such problem dataset from our International data:

Click on the graph to zoom in

Diagnosing the Slowness

The editing rules are fairly simple. They shouldn’t comprise more than ten percent of total running time.

The corrections were mostly very obvious once illuminated by the graphs and didn’t take long to implement.

Our census micro-data is groupped into “households”, collections of person records which vary in size according to the number of people in the household. Different countries define “households” slightly differently but in general they consist of all people living together in a housing unit, usually a nuclear or extended family but occasionally a large institution like a barracks, prison or boarding school.

You have to know two thinggs about this kind of data: (1) Households vary in size (2) We like to compute values based on the entire make-up of the household which requires reasoning over all the person records within one household group. For instance we compute oldest and youngest child, number of children, number of couples.

Unsurprisingly, the flame graph revealed that user supplied logic at the center of some loops required optimization, Where collections of person records were retrieved for computations on them. We had places where the collections were copied instead of moved or referenced (the syntax looks nearly the same.) Also some values were computed once per person rather than getting computed once and stored.

But none of this could account for the extreme slowness.

Unlike previous sets of international data and unlike our test data, some of our newest data contained very large collections of person records. In most datasets the upper limit is 30 or 60 people, but some data that isn’t a sample but a complete collection of census records has these large groups of hundreds of people in a single enumerated “household.” The Uraguay 1975 and 1985 datasets were both full counts of a census with some full-sized institutional “households.” When creating a sample of a full census we also sample the very large households.

Turned out we had a couple of O(n^3) API functions that were trivially slower at under 100 people per household but terrible at larger sizes. These were easily rewritten to an O(n) growth rate. And the user-supplied logic was streamlined to memoize repeated computations and untangle extra looping constructs. In the end there are a few unavoidable O(n^2) functions in the user logic.

The slow user-defined logic is in the isPossibleParent() function and the posPop() function that calls it. The posPop() function determines which if any person is the father of every other person in a given household. So, it is unavoidably an O(n^2) algorithm where n is number of people in a household. For every person, we have to call isPossibleParent() on every other person in the household. The key, then, is to ensure isPossibleParent() is as fast as possible.

The code is too lengthy to show here, but the analysis showed a bunch of intermediate values in isPossibleParent() that wre computed on-demand instead of computed once and stored it.

The other issue that came up was that the new API for accessing fields on records (AGE(n), for instance,) was very innefficient when accessing data off of adjacent records (reading off the “current” record was pretty well optimized already.) The logic in isPossibleParent() made heavy use of ageDiff(__i, _j), a function that computed age difference between two people at positions ___i and j in the current household. Clearly for any pair of people this value should be computed only once, but it was getting computed many times in loops; and the calls to AGE(i) - AGE(j) were made much more than needed. Moreover the calls to read the age values off records of persons i and j were much slower than needed.

The change to the new API alone didn’t explain all of the dramatic slow-down. The slow-down coincided with the addition of the full-count data which exposed the inefficient user code – which hadn’t changed in years – as well as the newly inefficient API functions. All of the faults described previously didn’t add much to processing time when households were under sixty people and and average of five or six people.

So, simply (1) memoize some values in isPossibleParent() – especially results of ageDiff() – and (2) speeding up the data access API for data off of nearby records took care of all the slowness shown in the flame graph.

Those improvements reduced the running time on the problematic datasets nearly to the rate of the rest of the datasets in the International product.

Click on the graph to zoom in

How could this have happened? The “classic” version of the application (before modern C++ and the API re-write) didn’t have this performance bug.

The new code has built-in unit tests but those don’t test performance. The standard test datasets didn’t include any of the extremely large households since in the past those had never posed a performance problem (they required a bit too much memory but the new application design greatly improved that aspect.) We hadn’t included very large households in the international data however, so some specific algorithms written by researchers for the international data didn’t get exercised before.

Measuring the Improved Code

Overall running time on the International IPUMS data after the corrections dropped by over sixty percent: Total CPU hours to transform all IPUMSI data went from 330 hours to 138 hours. (To be clear this is done in parallel on our small cluster. Wall time is something like four to five hours.)

The USA product – because it includes some historical data with large households – went from nine-hundred some hours to five hundred and thirty or so, still requiring more than a day to process on our cluster. All other data products benefited from the same optimizations though not nearly as much.