Although the Parquet format allows extra metadata and the C++ libraries provide a means to read and write extra metadata the capability isn’t well documented. I’ll show some example code to clarify how to read and write key-value Parquet metadata. This advice is specific to directly using the C++ libraries in the Arrow project.
This post assumes you know what the Parquet format is and why you’d use it.
Parquet Metadata
The Parquet file format allows extra metadata to be stored along with the basic schema. This can be extremely useful to document datasets: Data publication versions, column / variable labels, top / bottom codes and more generally labels for column values. Some metadata applies to particular columns – and there’s a way to add column specific metadata – but let’s begin with how to add file-wide metadata?
Parquet can store metadata at the file level or individual field level. Multi-file Parquet datasets (usually a directory or directory tree of parquet files) may have a metadata “sidecar” file which applies to them all, rather than repeating metadata within every data file. Here I’ll show how to set extra key-value metadata at the file level. This is the most essential type of extra metadata which can do everything you need. In some cases though, it might be better to use the sidecar file or add metadata at the field level.
Writing Extra Key-Value Metadata in C++
I made the mistake of only reading the code in arrow/cpp/examples
at first and found no mention of the key-value metadata on file metadata. You need to read the source, which is a bit confusing as the parquet::KeyValueMetadata
type is an alias to the ::arrow::KeyValueMetadata
. You can find some more examples (but not complete usage examples) in arrow/cpp/src/parquet/metadata_test.cc
.
Use `parquet::KeyValueMetadata::Make()
Assuming the basic set of fields is already in schema_for_parquet
, the rest of the code for using the low-level Parquet API would look like this:
std::shared_ptr<FileClass> out_file;
PARQUET_ASSIGN_OR_THROW(out_file, FileClass::Open(output_filename));
auto schema = parquet_schema(schema_for_parquet);
parquet::WriterProperties::Builder builder;
builder.compression(parquet::Compression::SNAPPY);
// ... set other builder properties here...
std::shared_ptr<parquet::WriterProperties> props = builder.build();
// The test k-v metadata to attach to the file:
vector<string> k = {"test-key1","test-key2"};
vector<string> v = {"test-value1","test-value2"};
// Use the Make convenience function
auto kvmd = parquet::KeyValueMetadata::Make(k, v);
// The last argument is a shared_ptr<parquet::KeyValueMetadata> , which has a nullptr
// default, so most example code using FileWriter::open() doesn't show it getting passed.
std::shared_ptr<parquet::ParquetFileWriter> file_writer =
parquet::ParquetFileWriter::Open(out_file, schema, props, kvmd);
After this use the FileWriter
as usual to write row group data and then close it and the key-value metadata will be saved with the schema.
Read the Key-Value Metadata with C++
You can find this functionality in the arrow/src/parquet/printer.cc
source file in the DebugPrint()
method on parquet::ParquetFilePrinter
class. I’ve shortened it to show only the relevant code.
std::unique_ptr<parquet::ParquetFileReader> reader =
parquet::ParquetFileReader::OpenFile(filename, memory_map);
const FileMetaData* file_metadata = fileReader->metadata().get();
if (print_key_value_metadata && file_metadata->key_value_metadata()) {
auto key_value_metadata = file_metadata->key_value_metadata();
int64_t size_of_key_value_metadata = key_value_metadata->size();
stream << "Key Value File Metadata: " << size_of_key_value_metadata << " entries\n";
for (int64_t i = 0; i < size_of_key_value_metadata; i++) {
stream << " Key nr " << i << " " << key_value_metadata->key(i) << ": "
<< key_value_metadata->value(i) << "\n";
}
}
Read Key-Value Metadata with Python and Rust
You’ll probably want to access the key-value metadata from other applications. Here’s how that would look in Rust and Python.
In Rust
Assuming you have the parquet
crate and use the SerializedReader
, you can easily access the file level key-value metadata. This snippet shows how to access each key and value. It’s not returned as a simple tuple of key-value pairs but instead a reference to a Vec of KeyValue which you need to extract.
use parquet::file::reader::{FileReader, SerializedFileReader};
let file =
File::open(&Path::new(first_part_of_parquet)).expect("Couldn't open parquet data");
let pq_reader = SerializedFileReader::new(file).unwrap();
if let Some(kv) = pq_reader.metadata()
.file_metadata()
.key_value_metadata() {
kv.into_iter()
.map(|kvm| (kvm.key.to_string(), kvm.value.clone()))
.collect()
} else {
Vec::new()
}
Python and PyArrow
The Parquet key-value file metadata is returned as bytes, not a string. Normally it will be UTF-8 data but it could be anything. So you need to decode it in Python as UTF-8 or whatever encoding you know it to be. Other than that, it’s very simple.
import pyarrow as pa
import pyarrow.parquet as pq
md = pq.read_metadata(filename)
kvm = md.metadata
for k,v in kvm.items():
key = k.decode('utf-8')
value = v.decode('utf-8')
print(f"{key}: {value}")
Field-Level Metadata
You can put column-specific metadata with the field in the schema rather than the file level. Using it will require that the application using your data is aware that it needs to access this metadata. It would be a bit more complex to extract, but probably worthwile. Within the field definition you add a “metadata” value with key-value data.
Here’s a nice article showing good uses of field level Parquet metadata. Document Your Data.