Save Arrow Record Batches Fast to Parquet With Custom Metadata During Incremental Writes

Adding custom metadata is easy and documented when saving an entire table, but adding to batched output is different.

Saving custom metadata – “schema metadata” or “file metadata” – to Parquet could be really useful. You can put versions of an application’s data format, release notes or many other things right into the data files. The documentation is somewhat lacking on how to accomplish it with PyArrow – but you totally can. Last time I reviewed the docs for Polars and DuckDB they didn’t allow for adding your own metadata to Parquet output at all. [Read More]

Notes on simplifying complex Parquet data

Not all tools can read nested logical Map or List type data (often made by Spark.) Here are some tips to make the data more accessible by more tools.

The Parquet columnar data format typically has columns of simple types: int32, int64, string and a few others. However, columns can have logical types of “List”, “Map” as well, and their members may be more “List” or “Map” structures or primitive types. [Read More]

Better Code Organization by Nesting Functions

The other day I found myself writing a really long Python script full of small groups of “helper” functions. Each group only “helped” a single caller. Something felt off. What a mess. Hidden under all the clutter, the script had a fairly simple structure. There’s only one path through the code. Breaking it into separate files would only obscure the logic. So how could I make that more clear? [Read More]

Add Key-Value Metadata to Parquet Files in C++

File-level arbitrary metadata on a parquet file could be extremely useful but adding it in C++ isn't well documented. Here's how to do it.

Although the Parquet format allows extra metadata and the C++ libraries provide a means to read and write extra metadata the capability isn’t well documented. I’ll show some example code to clarify how to read and write key-value Parquet metadata. This advice is specific to directly using the C++ libraries in the Arrow project. [Read More]

My not so deep thoughts on AI

Speculation on our AI future races straight to where we fear it ends up, but we should think about what comes first (but I can't help myself in the concluding thoughts.)

Will A.I. Become the New McKinsey? Yes. And it will begin when McKinsey and the other big consultants start to apply A.I. routinely. If firms like McKinsey are “capital’s willing executioners”, A.I. will at first merely sharpen their axes. [Read More]

So Many New Systems Programming Languages II

Twelve new systems languages, and one that dates to the Carter administration

Here’s a non-exhaustive rundown of newish systems languages. I’ll list some notable things about them related to safety and syntax as I discussed in the previous post. Well, here they are, in rough order of production readiness and popularity. Sorry if I put something lower than it deserved. [Read More]