Apache Spark, Python and Pandas, Columnar data formats, migrating away from Excel: It’s essential you get familiar with these topics if you’re beginning to grapple with challenging amounts of data. Before immediately jumping to the conclusion that “The Cloud” is the only next step– and getting lost in studying all the services out there – consider what you can do on your own laptop. Good data engineering will take you far. This post is an abbreviated version of three detailed articles I posted last year on the ISRDI Tech Blog.
- Part 1: Moving away from Spreadsheets
- Part 2: Columnar Format with Parquet, using Python or C++ saves huge time and space.
- Part 3: Parallel processing with Apache Spark
Introduction
If you’re running into slow access to your data or even running out of space, you may need to re-think your choice of tools and data formats. Average size and speed of storage is improving, but so is the amount of data generated by business processes and scientific studies. What’s not growing so fast: Computer processor speeds. (just recently performance has actually gone backwards.
Depending on your volume of data and sophistication of the software you use you may find your poor computer:
- Spending lots of time loading enormous spreadsheets and running macros and formulas. Effort organizing and versioning spreadsheets.
- Waiting while loading data into Postgres, MySql or Sql Server. Waiting for slow analytic queries / reports.
- Dealing with text data in CSV, JSON or other custom formats. Running out of space? Need a bigger hard drive?
Solutions include:
- Bring better tools to your data: Don’t just hack something together with Excel, find a way to query your data with SQL
- Change your data format: Columnar formats allow for faster analytic queries and save lots of disk space
- Parallelize work: Utilize the multiple cores on your machine. A modern machine will have at least four or more.
Let’s look at three tools to make your data life better. There’s nothing to lose, except excuses for coffee breaks.
Easier CSV : Get out of Excel, Use SQL
If you’re dealing with CSV data by importing into Excel you should consider applying query tools like Q (Text as Data) or CSV Kit directly to the CSV data. Here’s how we use CSVKit to make IPUMS data. If, on the other hand, your data starts out as spreadsheets, and is generic tabular data you could save as CSV and use the “q” utility on it.
The “q” tool allows SQL queries directly against CSV data as if it were a database table. If you’re importing into Excel, maybe you can just skip that step and write a query. You could import the CSV data into any number of databases or read it with Python’s Pandas or even Apache Spark; but if you already know the query you’d like to run, and the data is of reasonable size, those tools may be overkill. The “q” utility lets you query CSV directly. You can even join two or more “tables” (files).
Installing “q” is extremely simple and it’s free. If you know the UNIX command line, you can combine “q” with other command line UNIX tools, because the default output of “Q” is a plain list of results you can use other text based UNIX tools on.
Save Huge Space and Time with Parquet Files
The “Q” utility is great, but you can quickly exceed what “Q” can do in a reasonable time when your data grows too large. What’s the next step up?
Parquet is a columnar format for data used in Apache “Big Data” products like Spark, Drill and many others. It compresses to impressively high ratios while enabling super fast analytic type queries. In many cases you win both in huge storage savings and faster queries.
Parquet can represent a flat schema as well as nested record structures; my example here,the current Python libraries, and other tools don’t support nesting yet, (but Apache Spark and Drill do,)but this is an area of rapid development. For now consider Parquet files as extremely high performance, strongly typed versions of CSV files.
Thanks to the parquet-cpp project you don’t need to set up a “Big Data” execution framework to use Parquet. To get the most out of Parquet with limited resources, C++ is required; but it’s also tightly integrated with the “PyArrow” project for Python and there’s a pure Go-lang implementation too. The Java Parquet libraries can be used if you have the Spark libraries and just import the Parquet specific packages.
You can convert your data to Parquet format with your own C++, Java or Go code or use the PyArrow library (built on top of the “parquet-cpp” project) from Python or from within Apache Spark or Drill.
I have written a stand-alone tool in C++ for converting either CSV or fixed-width data (one of the IPUMS data formats) to Parquet format. It allows some conversions that would be difficult using Spark or PyArrow, as well as being super easy to deploy anywhere. It’s very fast and requires little memory; you can easily run it on your laptop. If you don’t want to dig into C++ I recommend the combination of PyArrow and Pandas for Python.
I covered how to read Parquet with C++ in depth in a previous post. Even if you never need to write one yourself, seeing how it functions will help you understand the strengths and limitations of the Parquet format, and what the absolute bare metal limits are.
Spark: Use Parquet, Harness all the Cores on Your System and Beyond
Apache Spark can be intimidating, but don’t worry, getting started is pretty easy. You can set it up in “local” mode requiring next to no configuration. You can run interactively with Python, Scala or even Ruby. When your data out grows your local machine you can put your Spark application onto a larger machine or even a cluster of machines, and as long as you move the data to be reachable by the new environment the application should function identically, except faster.
Spark supports “Spark SQL”, so you can stick with SQL if that makes the most sense; Spark also provides other ways to manipulate data with RDDs (resilient Distributed Datasets) and user defined functions callable from Spark SQL. Spark imposes a bit of overhead due to its need to coordinate multiple nodes; in “local” mode this overhead is minimal. By the time you need to reach for a Spark cluster the overhead, given the size of the data, should not matter, except when sub-second response times are critical to your application.
Spark was written in Scala – a JVM language – and provides a Java API in addition to the Scala API. There’s also an extensive Python API, but a few things are missing from it. I used JRuby to create a small interactive Spark environment; JRuby can use Java classes directly, so all I had to do was instantiate Spark Java classes in JRuby and use the Java API from Ruby’s “irb” REPL. I demonstrated this in more detailon my post on the ISRDI IPUMS Tech Blog.
Beyond Parquet and Spark
There’s one more type of software out there which I’d consider the expert power tool of high speed analytic data for single machines: The “Q” and K” language and KDB+ ecosystem, and the J and JDB open source variant.
The “Q” language is a humane coating of syntactic sugar for the “K” language; “J” is similar to “K”. Both are descendents of APL. KDB and JDB are column oriented databases; unlike Parquet format they allow very fast appends / inserts, while keeping very fast query times and compact size. JDB supports a fairly friendly query language, though it’s not SQL. The actual J and K languages are, like APL, extremely terse with single ASCII symbols used as keywords.
If you need the absolute maximum query speed against a constantly updated database you need “K” and KDB+. There’s a free non-commercial 32 bit version; the “J” and JDB combination is free, but it doesn’t scale as well.