The DuckDB and Arrow rabbit hole: data modeling adventures

I didn’t understand the hype around Duckdb, until I was approached to query a set of proprietary-yet-close-to-json files I have totalling 300 GB uncompressed, 25 GB compressed.

For prior requests, I wrote a bespoke CLI program to process the files and spit out the statistics. It was finally time for the task: package the data in a format that allows one to make arbitrary queries.

With a little bit of upfront investment, I could answer future questions quicker and I could even distribute the dataset so others don’t have to ask me.

Duckdb fits this use case perfectly

Portable database, all the tables are packaged into a single file
Queries execute via SQL, so it is very approachable
Compression, I’m routinely impressed with how little space the columnar datastore consumes
Fast analytical database. Duckdb chewed through a hundred million rows in a second.

Parquet is nice, but doesn’t have as seamless a distribution story, in comparison to the self-contained Duckdb file thrown onto S3 and shared with the wider world.

Tagged unions

The cherry on-top was that I could model data perfectly with a tagged union.

CREATE TABLE province_history(
    external_id VARCHAR NOT NULL,
    province_id USMALLINT NOT NULL,
    date DATE NOT NULL,
    event UNION (
        owner VARCHAR,
        controller VARCHAR,
        religion VARCHAR
    ) NOT NULL,
)

The tagged union allows me to express that one of three events (the owner changed, the controller changed, or the religion changed) happened on a given date to a certain province location. This data model, by allowing multiple rows per province/date, can capture multiple distinct events, even of the same type.

Appender issues

However, tagged unions have a downside. Despite being a duckdb builtin and supported since Duckdb’s earliest days, tagged unions are esoteric and difficult to insert into the database.

DuckDB’s Appender API is recommended¹ when inserting millions of rows. While DuckDB officially supports Rust and supports the Appender API, it does not extend to complex types, including unions².

The C Appender API only received support for complex types only yesterday³:

// Either fetch the logical type for the union from the database
// or reconstruct it exactly how it is in the database
// union_type = 

duckdb_value owner_value = duckdb_create_varchar("France");
duckdb_value union_value = duckdb_create_union_value(union_type, 0, owner_value);
duckdb_appender_append_value(appender, union_value)

It should be possible to write Rust code that invokes the C API, but that is not an appetizing thought.

I had other workaround ideas, though all resulted in failure. One was to leverage Apache Arrow, another data format that supports union types, but Appender’s support for inserting data via Arrow RecordBatch yielded the same lack of support for complex types. Even persisting the Arrow data to disk and loading it via a Duckdb extension failed⁴.

Wide tables

The rough edges for tagged union columns (and the fact that they lack vectorizable operations in apache arrow⁵) convinced me to roll our own version of tagged unions with a wide table.

CREATE TABLE province_history(
    external_id VARCHAR NOT NULL,
    province_id USMALLINT NOT NULL,
    date DATE NOT NULL,
    event_owner VARCHAR,
    event_controller VARCHAR,
    event_religion VARCHAR,
)

The biggest downside is readability. This schema allows multiple events per row but still needs to spill multiple of the same event onto different rows. I don’t even want multiple events per row as that insinuates a potentially false relationship amongst the columns in a given row.

I’m not too worried with the table being sparse, as the only cost for an entirely null column is the specialized bit vectors that both Duckdb and arrow leverage to designate value presence.

An alternative design is to use an enum column with the value stored in a separate column.

CREATE TYPE province_event AS ENUM ('owner', 'controller', 'religion');
CREATE TABLE province_history(
    external_id VARCHAR NOT NULL,
    province_id USMALLINT NOT NULL,
    date DATE NOT NULL,
    event_type province_event,
    event_text VARCHAR,
)

Since all events have string values, we can get away with a single value column to hold all of them. If we want to add a new event type whose values were numeric, we’d need to introduce a new value column too (and now worry about the best way to forbid drift between the enum type and what column is populated).

One potential impact is from cardinality changes in the value column, as in the example we are working with, the domain of religions have no overlap with owner and controller. This can impact the effectiveness of compression and queries, so mileage may vary. In this case, I suspect a few thousand unique values, something very manageable.

Apache arrow can emulate Duckdb’s enum type with a dictionary of integers to strings.

Wasm compute

It may seem odd to keep bringing up Duckdb and apache arrow, and trying to find common ground with well supported data types, but that’s because I’m cautiously optimistic that if the data is encoded in a format conducive to queries then I can push the same queries to the web app that works with the same data.

I’d be able to standardize all the bespoke queries I’ve written over the past years that traversed hierarchies, filtered, grouped, and everything else that a data frame library can do.

There are data frame libraries built on top of apache arrow like Datafusion and polars, but if I want to also ensure that local compute is a first class citizen in the browser, I can’t have my users subjected to downloading large bundles. Polars is 23 MB, duckdb-wasm is 33 MB, and I hear datafusion is north of 50 MB.

So while it would be cool to share the same Duckdb queries on the web and locally, it is too tough of a sell.

The good news is that apache arrow library comes with compute kernels baked in. In a simple compute example, arrow compiled to Wasm ended up weighing 1.6 MB (300 kB compressed). By most measures this is kinda lofty, but the hope is that this high upfront cost can be amortized as more features are written.

Going pure arrow means giving up builtin support for joins and group by, so perhaps I’m getting ahead of myself, and the first step should be to transform internal data structures to resemble a columnar data store.

Conclusion

I’m going to take a pause and reflect on this adventure. I did not anticipate going down a rabbit hole of data analysis at the start of this week, but I’m glad I did. Duckdb has some serious strengths and it was fun to issue one off SQL queries to dive into the data. Duckdb isn’t perfect – the ecosystem’s lack of support for tagged unions can hamper data modeling efforts, but nothing that can’t be worked around.

Work remains, and I’m curious how much I can reuse in a system that blends a lightweight data analysis framework in Wasm and offers a Duckdb data export too.