A Quick Tour of Trade-offs Embedding Data in Rust

Sometimes it makes sense to ship an application with the necessary data already inside. No need to fetch it from an external source and deal with failure when everything is bundled together.

How the data is bundled inside the application is what we’ll be looking at today, specifically in the context of Wasm apps, though lessons learned here should be applicable to other use cases. The methods for embedding data will vary wildly in compile time and size, which, frankly took me by surprise so I wanted to share my findings.

The contrived example we’ll be working with is an app with pseudo-localization where a numeric id maps to a word. The words are randomly picked with:

shuf -n 2040 /usr/share/dict/words > data.txt

So data.txt looks like

buffered
transacting
coachman's

And the id we’ll assign to each word is its line number (starting at 0). We’ll use our library in javascript like:

console.time("load wasm");
const { Mapper } = require("./pkg");
console.timeLog("load wasm");

console.time("load data");
const myMap = new Mapper();
console.timeLog("load data");

console.log(myMap.get(100));
// for me, this will output
// outputs: nonplusses

Note: Don’t actually use what you’re seeing for localization, this is just example time.

Now that we have our data and we know what the end result should look like, I’m going to over 4 ways one can embed the data.

Baseline

The baseline method is what I have been using up until this point. The build.rs takes the data and outputs:

use std::collections::HashMap;
pub fn translation() -> HashMap<i32, String> {
    let mut data = HashMap::with_capacity(2040);
    data.insert(0, String::from("buffered"));
    data.insert(1, String::from("transacting"));
    data.insert(2, String::from("coachman's"));

    // ...
    data
}

Separate

Instead of having all the hashmap inserts unrolled, we separate the data from the insertion and have build.rs generate a loop over our data:

use std::collections::HashMap;

pub fn translation() -> HashMap<i32, String> {
    let mut data = HashMap::with_capacity(2040);
    let raw = &[
        (0, "buffered"),
        (1, "transacting"),
        (2, "coachman's"),
        // ...
    ][..];

    for (i, line) in raw {
        data.insert(*i, String::from(*line));
    }

    data
}

Parse

We can include unparsed data in the code and then parse it at runtime.

use std::collections::HashMap;

pub fn translation() -> HashMap<i32, String> {
    let mut result = HashMap::with_capacity(2040);
    let data = include_str!("../data.txt");
    for (i, line) in data.lines().enumerate() {
        result.insert(i as i32, String::from(line));
    }

    result
}

Gzip

We can include compressed unparsed data and then decompress and parse at runtime. The build script will need to output the compressed data so that our code can reference it.

use flate2::read::GzDecoder;
use std::collections::HashMap;
use std::io::Read;

pub fn translation() -> HashMap<i32, String> {
    let raw = include_bytes!(concat!(env!("OUT_DIR"), "/data.gz"));
    let mut gz = GzDecoder::new(&raw[..]);
    let mut s = String::new();
    gz.read_to_string(&mut s).unwrap();

    let mut result = HashMap::with_capacity(2040);
    for (i, line) in s.lines().enumerate() {
        result.insert(i as i32, String::from(line));
    }

    result
}

PHF

This section was added after initial publishing due to popular demand

This won’t be an exact apples to apples comparison but when talking when previous solutions are talking about constructing hashmaps around compile time data, a phf has to be included. I say this won’t be apples to apples as phf requires our string data to be &'static str instead of String. And this section wasn’t initially included, as it is not as general (not all compile time data will be used in hashmaps), but since there was enough popular demand, I’ve update this article to include it.

Since the map is generated by phf, it is more instructive to show how the build.rs code generates it:

fn build_phf(data: &[u8]) -> Result<(), Box<dyn Error>> {
    let reader = BufReader::new(std::io::Cursor::new(&data));
    let path = Path::new(&env::var("OUT_DIR")?).join("generated-phf.rs");
    let mut file = BufWriter::new(File::create(&path)?);

    write!(&mut file, "static TRANSLATION: phf::Map<i32, &'static str> = ")?;
    let mut map = phf_codegen::Map::new();
    for (i, line) in reader.lines().enumerate() {
        let line = line?;
        map.entry(i as i32, &format!("\"{}\"", &line));
    }

    write!(&mut file, "{};", map.build())?;
    Ok(())
}

Underneath the covers if you look at the generated code. It is very similar to the “seperate” method detailed above, which lends credence to that method.

Results

I took each of the 5 methods and ran them through an incremental suite of tests:

Method Baseline Separate Parse Gzip Phf
Debug Build (s) 0.3 0.3 0.3 0.3 0.6
Release Build (s) 35.7 0.9 0.8 2.0 3.5
Wasm-pack Build (s) 113.0 1.1 1.0 3.3 1.3
Wasm (kb) 323.0 63.0 43.0 89.0 60.0
Gzip Wasm (kb) 42.0 31.0 21.0 54.0 30.0
Load Wasm (ms) 22.5 6.0 6.0 7.0 6.0
Load data (ms) 0.5 1.0 1.4 1.9 0.1

Let’s go over the findings:

Additional thoughts:

The results when applying the “parse” technique to my real world project of Rakaly, an EU4 achievement leaderboard:

This is a huge improvement – all at the cost of no, or very little, runtime performance.

Conclusion

When it’s appropriate to include data inside a rust application, one should prefer embedding the data in a dense, easily decoded format to realize compile time and size improvements. The alternative of having the cargo build script preemptively expand the data into insert or push calls can be used when performance is more important than compile and size.

Discussion:

Comments

If you'd like to leave a comment, please email [email protected]

2020-11-13 - Dylan Anthony

Did you try using either lazy_static ( https://docs.rs/lazy_static/1.4.0/lazy_static/ ) or phf ( https://docs.rs/phf/0.8.0/phf/ ) and if so, how did they compare? I use lazy_static to embed some data in one my programs and it works pretty well- though I’m not generating that from anything, it’s hard coded.

2020-11-14 - nick

Added section on phf – it works really well when the data fits! Quick compile time, small code size, and unbeatable runtime performance.