Optimization Part II: Targeted Optimizations Assisted by Flame Graphing

Early last year IPUMS moved production of IPUMS-International micro-data to the latest version of the core DCP and a new data editing API. In doing so we discovered a number of places where the new API – while performing better than the old one on our USA and CPS test datasets – performed worse than expected on some of the IPUMSI datasets. Not a big deal except for a few datasets that took twenty or thirty times longer to process than we would expect. [Read More]

Optimizing a Data-Intensive C++ Application, Part I

At IPUMS we continuously enhance our data products with newly available datasets, adding new variables and improvements to existing variables. We do this with the “Data Conversion Program”, a C++ application built to transform census and survey data into “harmonized” micro-data. When you visit ipums.org and make data extracts, you’re downloading data developed with the DCP. [Read More]

Save the USPS

The U.S. Postal Service is required to fund itself by charging for services like a private business. Since the beginning of the COVID-19 outbreak mail volume has dropped by more than half, severely undercutting its budget. [Read More]

SF Worth Reading

The book recommendations list has moved to sfworthreading.com. It’s a static site built with Jekyll and a Ruby “updater” script I wrote to do some busy work for me that Jekyll won’t, like building a custom authors index. [Read More]

Large Data on a Laptop: Tools and Strategies

Apache Spark, Python and Pandas, Columnar data formats, migrating away from Excel: It’s essential you get familiar with these topics if you’re beginning to grapple with challenging amounts of data. Before immediately jumping to the conclusion that “The Cloud” is the only next step– and getting lost in studying all the services out there – consider what you can do on your own laptop. Good data engineering will take you far. This post is an abbreviated version of three detailed articles I posted last year on the ISRDI Tech Blog. [Read More]