Effectively Access Parquet Formatted Data From Rust

Effectively using Rust to access data in the Parquet format isn’t too dificult, but more detailed examples than those in the official documentation would really help get people started. In this article I’ll present some sample code to fill that gap.

Introduction

The Rust Arrow library arrow-rs has recently become a first-class project outside the main Arrow project.

The Arrow and Parquet projects have undergone a lot of change over the last few years. In 2018 I made a post about working with Parquet data in C++ and within a few months parts of it were out of date. Much of the change directly addressed things I found painful in the build process and with the API.

So, it wasn’t surprising that when I first checked out Rust support for Parquet and Arrow last winter it looked promising but seemed incomplete. Fast-forward to this summer and it seems like the “rs-arrow” project is ready to learn and use! My first real development project required a few key features I’ll outline in the post. The project was to allow the Stata statistical software to read from Parquet data. While that code has too many Stata specific quirks to make it easy to present as example code, the main points shown here were what made it possible.

In this post I’ll give example code to accomplish each of the following:

Printing the schema including logical and physical column types, column numbers and names
Getting parquet file metadata statistics like numbers of rows and row groups
Reading a subset of columns from the file (a “schema projection”.) Doing this could vastly speed up reads.
Extracting data row by row

All these are possible and fairly easy with Rust if you can see it done. The documentation has some extremely basic example code which may not be enough to get you started, especially if you’re not super familiar with the Parquet and Arrow APIs for Java or C++. To work out what to do you’ll need to read the source code to “rs-arrow”, but even then it helps to know where to look.

For reading Parquet data as records there’s a high-level Arrow backed API, and there is also a low-level Parquet API. Working with schemas is similar to how it works with the Java and C++ Parquet and Arrow APIs.

The Rust and C++ APIs offer some different things at the moment, but this could change soon: C++ has a high-level way to read in an entire column from a single Parquet file, spanning all row groups. The Rust API currently doesn’t appear to do the same, though you can do this with the low-level Parquet API in Rust. On the other hand Rust has a nice serialized record batch reader you can point at a Parquet file while C++ doesn’t make things quite so convenient. For the most part the two APIs are fairly similar. For my first Rust program reading Parquet data I didn’t need to worry about column-first access.

To install Parquet support to your Rust project, add this to your “cargo.toml”

[dependencies]
parquet = "5.0.0"

The 5.0.0 Parquet version requires a recent Rust version. I used version 1.53.0. You can update using the Rustup utility.

Simple Parquet Reader Example

From the Rust documentation in the source code, you see how to set up a reader and process records (not columns) from a Parquet file.

use std::fs::File;
use std::path::Path;
use parquet::file::reader::{FileReader, SerializedFileReader};

let file = File::open(&Path::new("/path/to/file")).unwrap();
let reader = SerializedFileReader::new(file).unwrap();
let mut iter = reader.get_row_iter(None).unwrap();
while let Some(record) = iter.next() {
    println!("{}", record);
}

This gets you started but leaves a lot of questions unanswered. You can dig into the rs-arrow source code and learn how to extract column values from the records; if you have an IDE like VScode that task will be easier. However, it’s not easy to know how to use the API effectively for some essential tasks.

Reading the Schema

Parquet allows nested schema definitions, but I’m not going to go into that much here. To start with you probably at least want to know how to read column information from a simple Parquet file made up of a group of columns.

To begin with, understand that Parquet files have file metadata, and schema information. The file metadata has statistics like number of rows and row groups; the schema has the description of columns in the file. A schema can be a group or primitive node: A group node has inside of it a list of other schema nodes which in turn may be group or primitive nodes. In the simplest (and most common) type of Parquet file the schema is a group node that holds a list of primitive columns. The file metadata holds the schema.

File Metadata

let file = File::open(&Path::new(&self.parquet_path)).expect("Couldn't open parquet data");
let reader = SerializedFileReader::new(file).unwrap();
let parquet_metadata = reader.metadata();
let _rows = parquet_metadata.file_metadata().num_rows();

Schema

You get the schema from the file metadata:

let fields = parquet_metadata.file_metadata().schema().get_fields();

// Iterate over fields
// We use the enumerate() so that we can get the column number along with the other information
// about the column/ column number can be  used to access a column in a Parquet file.
for (pos, column) in fields.iter().enumerate() {
	let name = column.name();
	println!("Column {}: {}", pos, name);
}

Column Types

The Arrow API can label types in a more general way or as a specific Parquet physical type. Until recently the generic types were called “Logical Types.” These are now known as “converted types,” but the LogicalType type still “works” in the sense it’s recognized. The actual type names have changed: Check the documentation in the source code carefully and use the “converted type” to avoid confusion. Just Googling for “logical types Rust” will serve up old or contradictory information.

You may actually want the physical Parquet type names. These include “double”, “float”, “int64”, “int32”, “byte array”, “bool,” and others. The physical type names correspond to types a programming language might use. To make things even more confusing, in the Rust module where the physical types are defined the enum is named “Type”.” You alias it to something else to avoid a name clash:

use parquet::basic::Type as PhysicalType;

Here’s how you could print a flat Parquet schema. This is a complete program:

extern crate parquet;
use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::basic::Type as PhysicalType;
use std::fs::File;
use std::path::Path;
use std::env;
use std::process;

fn main(){
	let args: Vec<String> = env::args().collect();
	if args.len()<2{
		println!("Usage: print_{} file.parquet",&args[0]);	
		process::exit(1);
	}
	
	let parquet_path = &args[1];	
	let file = File::open(&Path::new(parquet_path)).expect("Couldn't open parquet data");
	let reader = SerializedFileReader::new(file).unwrap();
	let parquet_metadata = reader.metadata();               
	let fields = parquet_metadata.file_metadata().schema().get_fields();              
	
	for (pos, column) in fields.iter().enumerate() {
		let name = column.name();
		        
		let p_type = column.get_physical_type();
		// print type names you'd need if a Rust program consumed the data...
		let output_rust_type = match p_type {					
			PhysicalType::FIXED_LEN_BYTE_ARRAY=>"String",
			PhysicalType::BYTE_ARRAY=> "String",
			PhysicalType::INT64=>"i64",
			PhysicalType::INT32=> "i32",
			PhysicalType::FLOAT => "f32",
			PhysicalType::DOUBLE=> "f64",
			_ =>panic!(
				"Cannot convert  this parquet file, unhandled data type for column {}",
				name),									
		};
		println!("{} {} {}",pos, name, output_rust_type);			
	} // for each column
}		

Read a Schema Projection

Reading only a part of the full schema saves time if you don’t need all the values in the rows. Not only will the rows be smaller but the big advantage is that the Parquet reader will only need to visit the parts of the Parquet file that have the data you want.

Imagine you need only an order number and serial number out of an “orders” table that also has eighty other columns and a column full of very large data: A picture of the delivery location and package. The whole file of 50,000 orders might take three hundred gigabytes, but the order number column and serial number column would only take up 781 KB. Since the order numbers and serial numbers are stored in contiguous blocks in the file it takes no longer than it would normally take to read less than a megabyte of data.

To read serialized records only containing your columns of choice, you must construct a “schema projection” which is a schema definition with only the columns you want. You pass this to the serialized reader’s get_row_iter() method. If you pass in None the reader reads the entire schema stored in the Parquet file; otherwise you would do something like:

	let mut row_iter = reader.get_row_iter(Some(schema_projection)).unwrap();

Building a Schema Projection

We’ll pass in the column names we want to extract as arguments on the command line. We want to retain the part of the Parquet schema that has columns with these same names.

// consider everything on the command line after the parquet file 
// name to be a column name
let requested_fields = &args[2..];
		
let mut selected_fields = fields.to_vec();
if requested_fields.len()>0{
	selected_fields.retain(|f|  
		requested_fields.contains(&String::from(f.name())));
}			

// Now build a schema from these selected fields:
let schema_projection = Type::group_type_builder("schema")
	.with_fields(&mut selected_fields)
	.build()
	.unwrap();

Reading Selected Data

Now that we have a schema projection we can make a reader and pass the projection to it:

use parquet::file::reader::{FileReader, SerializedFileReader};

let reader:SerializedFileReader<File> = SerializedFileReader::new(file).unwrap();
let mut row_iter = reader.get_row_iter(Some(schema_projection)).unwrap();

As we read rows, we’ll want to format the rows returned by the reader somehow. Here’s a small formatter to make a row with values separated by a delimiter. You can see how data from individual columns can be accessed:

// This is for demonstration purposes; if you have string data to format
// consider using the CSV Writer library or picking your delimiter carefully.
fn format_row(row : &parquet::record::Row, delimiter: &str) -> String {    
	row.get_column_iter()
		.map(|c| c.1.to_string())
		.collect::<Vec<String>>()
		.join(delimiter)
}

The main loop to read Parquet data and turn it into records would look like this.

while let Some(record) = row_iter.next() {	
	println!("{}",format_row(&record, &delimiter));
}

A Working Program

Let’s finish up by putting all the pieces together. The final program will be able to print a flat schema if it consists of simple data types that could reasonably be formatted as a CSV or other type of delimited text data. The same program will alternatively accept a list of column names to extract and format into CSV output or print all data if no column names are given.

While this isn’t much, it’s an actually useful program that lets us convert to CSV so that some tools that can’t read Parquet can work with data which is originally stored as Parquet format.

extern crate parquet;
use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::record::Row;
use parquet::schema::types::Type;
use parquet::basic::Type as PhysicalType;

use std::fs::File;
use std::path::Path;
use std::env;
use std::process;
use std::sync::Arc;


fn print_schema(
		fields:&[Arc<parquet::schema::types::Type>]
	){
	
	for (pos, column) in fields.iter().enumerate() {
		let name = column.name();		       
		let p_type = column.get_physical_type();
		let output_rust_type = match p_type {					
			PhysicalType::FIXED_LEN_BYTE_ARRAY=>"String",
			PhysicalType::BYTE_ARRAY=> "String",
			PhysicalType::INT64=>"i64",
			PhysicalType::INT32=> "i32",
			PhysicalType::FLOAT => "f32",
			PhysicalType::DOUBLE=> "f64",
			_ =>panic!(
				"Cannot convert  this parquet file, unhandled data type for column {}", 
				name),									
		};
		println!("{} {} {}",pos, name, output_rust_type);			
	} // for each column	
}


fn print_data(
	reader: &SerializedFileReader<File>, 
	fields:&[Arc<parquet::schema::types::Type>], 
	args:Vec<String>){
	
	let delimiter = ",";	
	let requested_fields = &args[2..];
		
	let mut selected_fields = fields.to_vec();
	if requested_fields.len()>0{
		selected_fields.retain(|f|  
			requested_fields.contains(&String::from(f.name())));
	}			
	
	let header: String = format!("{}",
		selected_fields.iter().map(|v| v.name())
		.collect::<Vec<&str>>().join(delimiter));
				
	let schema_projection = Type::group_type_builder("schema")
			.with_fields(&mut selected_fields)
			.build()
			.unwrap();

	let mut row_iter = reader
		.get_row_iter(Some(schema_projection)).unwrap();
	println!("{}",header);
	while let Some(record) = row_iter.next() {	
		println!("{}",format_row(&record, &delimiter));
	}
}

fn format_row(
		row : &parquet::record::Row, 
		delimiter: &str) -> String {  
	
	row.get_column_iter()
		.map(|c| c.1.to_string())
		.collect::<Vec<String>>()
		.join(delimiter)
}


fn main(){
	// Keeping argument handling extra-simple here. 
	// For anything more complex consider using the 
	// "clapp" crate.
	let args: Vec<String> = env::args().collect();
	if args.len()<2{
		println!("Usage: print_{} file.parquet [--schema] [column-name1 column-name2 ...]",
			&args[0]);					
		process::exit(1);
	}
	
	let parquet_path = &args[1];	
	let file = File::open(
		&Path::new(parquet_path))
		.expect("Couldn't open parquet data");
		
	let reader:SerializedFileReader<File> = SerializedFileReader::new(file).unwrap();
	let parquet_metadata = reader.metadata();               
	
	// Writing the type signature here, to be super 
	// clear about the return type of get_fields()
	let fields:&[Arc<parquet::schema::types::Type>] = parquet_metadata
		.file_metadata()
		.schema()
		.get_fields();              
	
	if args.len()>2 && args[2] == "--schema"{
		print_schema(fields);
	}else{
		print_data(&reader, fields, args);
	}	
}