Early last year IPUMS moved production of IPUMS-International micro-data to the latest version of the core DCP and a new data editing API. In doing so we discovered a number of places where the new API – while performing better than the old one on our USA and CPS test datasets – performed worse than expected on some of the IPUMSI datasets. Not a big deal except for a few datasets that took twenty or thirty times longer to process than we would expect.
[Read More]
Optimizing a Data-Intensive C++ Application, Part I
At IPUMS we continuously enhance our data products with newly available datasets, adding new variables and improvements to existing variables. We do this with the “Data Conversion Program”, a C++ application built to transform census and survey data into “harmonized” micro-data. When you visit ipums.org and make data extracts, you’re downloading data developed with the DCP.
[Read More]
Python 3 Language Notes
Notes on the Pythone 3 Language
[Read More]
Reparations
Introduction
[Read More]
Save the USPS
The U.S. Postal Service is required to fund itself by charging for services like a private business. Since the beginning of the COVID-19 outbreak mail volume has dropped by more than half, severely undercutting its budget.
[Read More]
The Parquet Data Format Landscape
As you begin to handle Parquet data with tools in more than one framework and language you’ll probably wonder how all these related pieces fit together. Here is a summary of data formats, libraries and frameworks you will encounter when working with Parquet data and Spark.
[Read More]
SF Worth Reading
The book recommendations list has moved to sfworthreading.com. It’s a static site built with Jekyll and a Ruby “updater” script I wrote to do some busy work for me that Jekyll won’t, like building a custom authors index.
[Read More]
Markdown Syntax Highlighting With Notepad++
Changing the default Notepad++ theme doesn’t change most of the colors in a Markdown document. This is especially apparent when using a dark-mode Notepad++ style and dark theme in Windows. You have to manually edit a special Markdown theme to change most of the colors and fonts.
[Read More]
Read the Report
Following the release of Special Counsel Robert Mueller’s report this spring we’ve heard lots of interpretations of the report offered that can’t survive even a casual reading.
[Read More]
Large Data on a Laptop: Tools and Strategies
Apache Spark, Python and Pandas, Columnar data formats, migrating away from Excel: It’s essential you get familiar with these topics if you’re beginning to grapple with challenging amounts of data. Before immediately jumping to the conclusion that “The Cloud” is the only next step– and getting lost in studying all the services out there – consider what you can do on your own laptop. Good data engineering will take you far. This post is an abbreviated version of three detailed articles I posted last year on the ISRDI Tech Blog.
[Read More]