A Quick Tour of Trade-offs Embedding Data in Rust

Sometimes it makes sense to ship an application with the necessary data already inside. No need to fetch it from an external source and deal with failure when everything is bundled together.

How the data is bundled inside the application is what we’ll be looking at today, specifically in the context of Wasm apps, though lessons learned here should be applicable to other use cases. The methods for embedding data will vary wildly in compile time and size, which, frankly took me by surprise so I wanted to share my findings.

The contrived example we’ll be working with is an app with pseudo-localization where a numeric id maps to a word. The words are randomly picked with:

shuf -n 2040 /usr/share/dict/words > data.txt

So data.txt looks like

buffered
transacting
coachman's

And the id we’ll assign to each word is its line number (starting at 0). We’ll use our library in javascript like:

console.time("load wasm");
const { Mapper } = require("./pkg");
console.timeLog("load wasm");

console.time("load data");
const myMap = new Mapper();
console.timeLog("load data");

console.log(myMap.get(100));
// for me, this will output
// outputs: nonplusses

Note: Don’t actually use what you’re seeing for localization, this is just example time.

Now that we have our data and we know what the end result should look like, I’m going to over 4 ways one can embed the data.

Baseline

The baseline method is what I have been using up until this point. The build.rs takes the data and outputs:

use std::collections::HashMap;
pub fn translation() -> HashMap<i32, String> {
    let mut data = HashMap::with_capacity(2040);
    data.insert(0, String::from("buffered"));
    data.insert(1, String::from("transacting"));
    data.insert(2, String::from("coachman's"));

    // ...
    data
}

Separate

Instead of having all the hashmap inserts unrolled, we separate the data from the insertion and have build.rs generate a loop over our data:

use std::collections::HashMap;

pub fn translation() -> HashMap<i32, String> {
    let mut data = HashMap::with_capacity(2040);
    let raw = &[
        (0, "buffered"),
        (1, "transacting"),
        (2, "coachman's"),
        // ...
    ][..];

    for (i, line) in raw {
        data.insert(*i, String::from(*line));
    }

    data
}

Parse

We can include unparsed data in the code and then parse it at runtime.

use std::collections::HashMap;

pub fn translation() -> HashMap<i32, String> {
    let mut result = HashMap::with_capacity(2040);
    let data = include_str!("../data.txt");
    for (i, line) in data.lines().enumerate() {
        result.insert(i as i32, String::from(line));
    }

    result
}

Gzip

We can include compressed unparsed data and then decompress and parse at runtime. The build script will need to output the compressed data so that our code can reference it.

use flate2::read::GzDecoder;
use std::collections::HashMap;
use std::io::Read;

pub fn translation() -> HashMap<i32, String> {
    let raw = include_bytes!(concat!(env!("OUT_DIR"), "/data.gz"));
    let mut gz = GzDecoder::new(&raw[..]);
    let mut s = String::new();
    gz.read_to_string(&mut s).unwrap();

    let mut result = HashMap::with_capacity(2040);
    for (i, line) in s.lines().enumerate() {
        result.insert(i as i32, String::from(line));
    }

    result
}

PHF

This section was added after initial publishing due to popular demand

This won’t be an exact apples to apples comparison but when talking when previous solutions are talking about constructing hashmaps around compile time data, a phf has to be included. I say this won’t be apples to apples as phf requires our string data to be &'static str instead of String. And this section wasn’t initially included, as it is not as general (not all compile time data will be used in hashmaps), but since there was enough popular demand, I’ve update this article to include it.

Since the map is generated by phf, it is more instructive to show how the build.rs code generates it:

fn build_phf(data: &[u8]) -> Result<(), Box<dyn Error>> {
    let reader = BufReader::new(std::io::Cursor::new(&data));
    let path = Path::new(&env::var("OUT_DIR")?).join("generated-phf.rs");
    let mut file = BufWriter::new(File::create(&path)?);

    write!(&mut file, "static TRANSLATION: phf::Map<i32, &'static str> = ")?;
    let mut map = phf_codegen::Map::new();
    for (i, line) in reader.lines().enumerate() {
        let line = line?;
        map.entry(i as i32, &format!("\"{}\"", &line));
    }

    write!(&mut file, "{};", map.build())?;
    Ok(())
}

Underneath the covers if you look at the generated code. It is very similar to the “seperate” method detailed above, which lends credence to that method.

Results

I took each of the 5 methods and ran them through an incremental suite of tests:

Method	Baseline	Separate	Parse	Gzip	Phf
Debug Build (s)	0.3	0.3	0.3	0.3	0.6
Release Build (s)	35.7	0.9	0.8	2.0	3.5
Wasm-pack Build (s)	113.0	1.1	1.0	3.3	1.3
Wasm (kb)	323.0	63.0	43.0	89.0	60.0
Gzip Wasm (kb)	42.0	31.0	21.0	54.0	30.0
Load Wasm (ms)	22.5	6.0	6.0	7.0	6.0
Load data (ms)	0.5	1.0	1.4	1.9	0.1

Let’s go over the findings:

There is no difference in debug time for any method
The baseline is by far the worst when it comes to release compiles. Taking nearly 2 minutes to create an optimized wasm package that is from 2040 lines without any dependencies is significant. In fact, when I was preparing this post, I initially started with a data file of 10000 lines but it took 11 minutes to compile so I jettisoned most of it. Here’s the playground link if you want to play with it.
The baseline is also by far the worst when it comes to Wasm size (at least it compresses well)
While the baseline has the quickest time to create the hashmap, if one is only creating the hashmap a handful of times (as one would probably expect), the cost for loading the Wasm outstrips any performance benefit.
The “separate” and “parse” methods both contained a good mix of quick compiles, small size, and good performance and are my recommendations.
PHF is also an excellent choice with quick compile times, small code size, and unbeatable runtime performance. Not all compile data is transformed into maps so this method is not universal.

Additional thoughts:

There should be no need to compress the data like in the gzip method as Wasm should already be compressed when transferred over the internet. Unless the data can leverage tailored made compression, it’s best to leave it uncompressed.
The parse method is entirely format dependent. In this post I used an extremely simple newline delimited format. This can be expanded to more complex use cases like storing multiple fields per line, but at some point you may reach the limits of hand rolling your own format and want to reach for csv, json, or another format. Just be sure that the benefits of the format outweigh the drawbacks (an additional dependency that brings in code and performance costs).
If the parsing logic becomes too intense (ie: nested data of varying types (numbers, dates, arrays, etc)), use the separate method to save your sanity.

The results when applying the “parse” technique to my real world project of Rakaly, an EU4 achievement leaderboard:

Incremental compile times reduced from 155s to 68s
Wasm size dropped from 2.1 MB to 1.3 MB
Gzip Wasm size dropped from 424.3 KB to 300.0 KB

This is a huge improvement – all at the cost of no, or very little, runtime performance.

Conclusion

When it’s appropriate to include data inside a rust application, one should prefer embedding the data in a dense, easily decoded format to realize compile time and size improvements. The alternative of having the cargo build script preemptively expand the data into insert or push calls can be used when performance is more important than compile and size.

Discussion:

Comments

If you'd like to leave a comment, please email [email protected]

2020-11-13 - Dylan Anthony

Did you try using either lazy_static ( https://docs.rs/lazy_static/1.4.0/lazy_static/ ) or phf ( https://docs.rs/phf/0.8.0/phf/ ) and if so, how did they compare? I use lazy_static to embed some data in one my programs and it works pretty well- though I’m not generating that from anything, it’s hard coded.

2020-11-14 - nick

Added section on phf – it works really well when the data fits! Quick compile time, small code size, and unbeatable runtime performance.