The serde optimization gauntlet: Wasm and arenas

Arena allocators are having a moment in the Rust ecosystem. They’re powering the next generation of JavaScript tooling¹, showing up in Reddit success stories², and promising three key benefits:

Amortized allocations
Cache locality
Efficient drops

I’m working on a project that deserializes 200MB of binary data in the browser via Wasm, where performance is critical. Could arena allocators help?

I want to break down this problem a few different ways:

How important is an allocation strategy in Wasm?
How to think about marrying serde with arena lifetimes
How can I write my own arena allocator that avoids the viral lifetimes that complicate deserialize implementations
How the choice of global allocator can influence benchmarks
How serde’s hidden deserialize in place API can impact performance

Wasm allocator benchmarks

For those unfamiliar, every Wasm module has an allocator compiled in to support memory allocations. The default is dlmalloc-rs, and while it advertises that “most of the time you’ll be manually switching to a different allocator”, I’d wager most don’t switch for reasons we’ll see shortly.

Whatever the allocator is used within Wasm, it has to request memory via Memory.grow when more 64 KiB pages are required. The browser allocator then fulfills that request. At each layer there’s a possibility of performance issues, so it can be a good idea to test across a multitude of allocators and browsers.

A few years ago, it was one of these layers, Memory.grow, that caused Chromium to perform 100x worse than Firefox in an application on 32 bit machines.³

I had the honor of reporting this bug, and coming up with a contrived benchmark for reproduction.

#[wasm_bindgen]
pub fn allocation(text: &str, iterations: usize) -> usize {
    let mut data = Vec::new();

    for _ in 0..iterations {
        let mut new_data = Vec::new();
        for _ in 0..5 {
            new_data.push(String::from(text).into_boxed_str());
        }
        data.push(new_data);
    }

    data.len()
}

I host the benchmark online, so play around with it!

The arena allocator version of this benchmark uses bumpalo:

#[wasm_bindgen]
pub fn allocation(text: &str, iterations: usize) -> usize {
    use bumpalo::{collections::Vec, Bump};

    let total_capacity = (text.len() * iterations * 5)
        + (std::mem::size_of::<&str>() * iterations * 5)
        + (std::mem::size_of::<Vec<&str>>() * iterations)
        + std::mem::size_of::<Vec<Vec<&str>>>();

    let bump = Bump::with_capacity(total_capacity);
    let mut data = Vec::new_in(&bump);
    for _ in 0..iterations {
        let mut new_data = Vec::new_in(&bump);
        for _ in 0..5 {
            new_data.push(bump.alloc_str(text));
        }
        data.push(new_data);
    }

    data.len()
}

The benchmark demonstrates another benefit of the bumpalo: ability to pre-allocate based on expected usage. This means that our code should never need to reach down through the layers other than in the initialization. In this benchmark this pre-allocation optimization is negligible, so perhaps it is more of a novelty.

Anyways, here are results when our text is “helloworld” with 100,000 iterations:

Chrome: Bumpalo is 40% faster (14.6ms vs 24.5ms)
Firefox: Bumpalo is 25% faster (17ms vs 23ms)

At a million iterations:

Chrome: Bumpalo is 500% faster (136ms vs 707ms)
Firefox: Bumpalo is 30% faster (156ms vs 218ms)

Across all benchmarks, bumpalo comes out on top, running 5x faster at heavier loads on Chrome. While still slower than bumpalo, running a general purpose allocator on Firefox scaled better with the workload.

The hosted benchmark also includes swapping out the dlmalloc implementation with a promising alternative, talc. I was not able to reproduce any performance benefits across Chrome and Firefox. This may be why swapping the global allocator on Wasm isn’t commonplace.

Bumpalo isn’t the fastest arena allocator. There’s a project called bump-scope, which benchmarks close to 2x “faster” than bumpalo (at least within Wasm). The key difference is their vec implementations. Because the standard library’s Vec is not generic over the allocator on stable Rust, both bumpalo and bump-scope ship their own Vec implementation. When the vec implementation variable is held constant (on nightly Rust), bumpalo and bump-scope are neck and neck.⁴

A tip: when benchmarking avoid having the devtools open in Chrome, as it will cause performance degradations, due to the devtools causing an unoptimized version to execute⁵. In this particular benchmark, and on my machine, it is a 20% penalty. To avoid temptation, you may want to think again if your Wasm reports performance data to the console.

A nice property of bumpalo that is worth highlighting is it’s possible to query the amount of data allocated by the arena, which is invaluable when working with large or hierarchical data structures to get a sense of just how much memory is needed per instance. Otherwise you’re stuck calculating by hand or relying on counters hooked into the global allocator that are susceptible to stray allocations:

bump.allocated_bytes();

A homebrewed arena for deserialization

Skip to the next section

With bumpalo’s proven performance and so many other arena allocators in Rust⁶, naturally I decided to write my own. Spoiler: this doesn’t work out, but sheds light on the design of allocators.

I thought I could be clever. I thought I could make the ergonomics better than bumpalo when it came to deserializing a deeply nested structure, which bumpalo would require one to weave the allocator lifetime through.

For instance, notice how viral the bump lifetime is:

#[derive(Debug)]
pub struct ZipPrelude<'bump> {
    pub metadata: Metadata<'bump>,
}

#[derive(Debug)]
pub struct Metadata<'bump> {
    pub compatibility: Compatibility<'bump>,
    pub playthrough_id: BStr<'bump>,
    pub playthrough_name: BStr<'bump>,
    // ...
}

#[derive(Debug)]
pub struct Compatibility<'bump> {
    pub locations: bumpalo::collections::Vec<'bump, BStr<'bump>>,
    // ...
}

Deserialization would require each level to know about the allocator and that kind of structuring is not possible with serde.

So my clever idea was to eliminate the lifetime by making all data owned.

Segment array

I knew the data format didn’t support strings longer than 64 KiB, so I made my own string type that was 6 bytes (3x smaller than String on 64bit platforms): 2 bytes for the length, 2 bytes for start index, and 2 bytes for the bucket index. Each bucket is a fixed size of 64 KiB, and whenever the capacity is exhausted, a new 64 KiB bucket is allocated.

A growable list with stable references is nothing new, and goes by many names⁷. I’ll call it a segment array.

That implementation didn’t last long. I swapped to a flat Vec<u8> for simplification. The problem: while segment arrays provide stable references in theory, I couldn’t leverage them in practice. Adding data to my allocator required a mutable reference, and Rust’s borrow checker would forbid any outstanding references even though they’d remain valid due to the stable growth property.

Imagine my surprise when I see the oxc allocator documents⁸:

The data from the 1st chunk is not copied into the 2nd one. It stays where it is, which means & or &mut references to data in the first chunk remain valid. This is unlike e.g. Vec which copies all existing data when it grows.

Huh? Won’t stable growth mean allocations are forced to take a shared reference to self? Sure enough looking at the function signature, only a reference to self is used:

pub fn alloc_slice_copy<T: Copy>(&self, src: &[T]) -> &mut [T]

Tracing the code down into Bumpalo, it achieves interior mutability by stuffing pointers returned by the global allocator inside Cell<T>.

It makes sense in hindsight, but it can be difficult to know how hard one should try to make an interface require only a shared reference. I’m saying this like I could mimic Bumpalo if I knew this ahead of time (yeah right). I take solace in knowing the dlmalloc Rust port also struggles with this.⁹

Store optimizations

During profiling, UTF-8 validation started to top the charts at 30-40% of self time. This validation was used to maintain the invariant that the string arena only contained valid UTF-8 data.

Before (pseudo-code):

fn store(&mut self, d: &[u8]) -> MyString {
    let ind = first_non_utf8_ind(d);
    self.buffer.extend(&d[..ind]);
    // ...
}

Writing a custom “is ascii” fast path helped shave some time off.

Additionally, we can speculate our data is UTF-8 and store it unconditionally. Only if it is not UTF-8 do we backtrack and truncate the buffer. By removing the non-UTF-8 index dependency from the memory store, our function should be more amenable to the CPU’s speculative execution.

After:

fn store(&mut self, d: &[u8]) -> MyString {
    self.buffer.extend(&d);
    let ind = first_non_utf8_ind(d);
    if ind != d.len() {
        self.buffer.truncate(/* ... */);
    }
    // ...
}

These optimizations had a strong incremental improvement, but by far the biggest performance boon was relaxing the invariant to allow non-UTF-8 data. Instead, string data is conventionally UTF-8, and any UTF-8 validations are deferred until required by a caller. Why pay the UTF-8 validation cost when deserializing millions of strings when only a small fraction may actually be required by the end user.

Thread local deserialization

With my little arena allocator, the string data would be indices into some data somewhere. Is this “just laundering the unsafe pointer arithmetic behind array indexing” that is brought up in Hacker News comments? Maybe. Is it possible to mix up indices between two arena instances as Rust is not a dependently typed language? Absolutely, perhaps I could borrow the concept of “branded indices” from compact_arena¹⁰.

My allocator can be used in deserialization:

Set a thread local to a new instance of the allocator
Run the deserialization implementation for the top level data model
Deserialize strings with a custom deserialization method that stores data in the allocator and receive our index back
After deserialization, take the allocator out of the thread local and return it as a tuple in the return value alongside the data they wanted.

Using thread local storage during deserialization feels foreign, but I’m not the first to resort to this technique¹¹. Though, in an effort to squeeze performance, I made a pinky promise to the compiler that all the mutable allocator references were short lived so I could drop the runtime checks imposed with RefCell in favor of an UnsafeCell. And this is after applying the special thread local const syntax to elide lazy initialization checks¹².

thread_local! {
    static ARENA: UnsafeCell<MyStringArena> = const {
        UnsafeCell::new(MyStringArena::new())
    };
}

The solution worked well enough. In benchmarks, this arena would be 2.5x faster than the global allocator version.

Though, it nagged at me that every string deserialization required thread local synchronization as it wasn’t provable that the allocator would stay the same for the entire deserialization. Even if this is a cheap operation, it bugged me that a solution that could better curry an allocator through deserialization would have more optimized code emitted.

But what sunk this experiment is the ergonomics of this “owned” data gets worse the closer it’s inspected:

String indices are useless on their own (can’t compare or hash them).
Rigmarole is required to “hydrate” these indices into useful data objects (ie: create a mirrored struct with all strings resolved to string references)

Perhaps this is why allocators like this aren’t popular. I carried forward the notion of conventionally UTF-8 strings, but scrapped my allocator and turned to bumpalo instead.

Bumpalo serde deserialization

Time to give deserialization with an arena another try. An arena allocator seems well suited for deserializing large structures as all the allocation (and the drop) can logically be seen as a single phase.

If the reason for experimenting with a homebrewed arena is to facilitate serde deserialization, how else can we solve the problem? We’d somehow need all the Deserialize implementations aware of the allocator, which seems like it’ll be a lot of code-gen and proc-macros might be our best bet.¹³.

I’ve written a proc-macro before to automatically derive a Deserialize implementation that, in part, can aggregate duplicate fields instead of returning an error. It is not what I would consider “fun” code to write.

So this task seems perfect for LLM assistance. It’s a lot of code, and there’s an undeniable slop to it, so I’ll post the prompt instead of the code.

Create a syn v2.0 proc-macro that is similar to serde_derive’s Deserialize, except that it weaves a bumpalo bump allocator through each level of deserialization.

Below is the code I want to achieve via this new ArenaDeserialize proc-macro
#[derive(ArenaDeserialize)]
struct City<'bump> {
    #[arena(alias = "n")]
    name: bumpalo::collections::String<'bump>,
    citizens: bumpalo::collections::Vec<'bump, Citizen>,
}

#[derive(ArenaDeserialize)]
struct Citizen<'bump> {
    first_name: &'bump str,
    #[arena(default)]
    age: i16,
    data: CitizenData<'bump>,
    #[arena(deserialize_with = "deserialize_map_as_pair_seq")] 
    tags: bumpalo::collections::Vec<'bump, (u16, &'bump str)>
}

#[derive(ArenaDeserialize)]
struct CitizenData<'bump>(&'bump [u8])
Based on the above, here are the requirements:

Support an “alias” field attribute

Support a “default” field attribute

Support customization with a “deserialize_with” field attribute

Support string and byte slices

Support deserializing Option and primitives found in the Rust core and std library

Support transparent newtype deserialization

Support deserializing bumpalo collections

Support deseriailizing nested structs with allocator lifetime

Do not support deserializing data types that rely on the global allocator (Box, String, Vec, HashMap, etc) as these will not be dropped if part of a bumpalo allocation

Use a single ArenaSeed that implements DeserializeSeed for each type:
struct ArenaSeed<'a, T> {
    allocator: &'bump bumpalo::Bump,
    marker: PhantomData<fn() -> T>,
}
Everything should be format agnostic, but serde_json can be used for unit tests, and ensure all tests pass with cargo test.

Use inspiration from another proc-macro that automatically derives Deserialize: JominiDeserialize: https://raw.githubusercontent.com/rakaly/jomini/refs/heads/master/jomini_derive/src/lib.rs

I had a working implementation after 1-2 hours of gentle guidance. It’s far from perfect but it’s something I can iterate on as I discover bugs or edge cases.

I think the ROI here is worth it. Using bumpalo over a home grown thread local arena netted a cool 15% improvement (which was already 2.5x faster than the global allocator).

Some may find bumpalo’s lack of builtin collections restrictive. I don’t, but I support the effort to stabilize Rust collections to be generic over an allocator, as I know others desperately want this.

The derived code will be tied to bumpalo, but as we’ve seen bumpalo isn’t the only game in town and isn’t the fastest. Thus, it’s a good idea to define a trait to unify bump allocators for the purposes of deserialization, abstracting over different allocator implementations while presenting a consistent API for collections like Vec and String.

Once deserialization with a custom allocator is done, one is far from over as the allocator and its lifetime become viral across many structs and functions. Seems analogous to introducing an async function in the amount of upstream impact.

Mimalloc

I’ve kinda drilled straight into deserializing with an arena, but there are a whole host of topics that come up when optimizing data models and deserialization.

At the start of the post we swapped out the global allocator in Wasm for an allocation benchmark, but we should also swap out the global allocator for the deserialization benchmark on a native system.

My goto native global allocator is Mimalloc, and when I swapped it in, I observed a 2x throughput improvement.

For those keeping track at home, this brings the standard String/Vec version of the benchmark (using the global allocator) close to matching our arena allocator performance.

How incredible! No need to contort your types to weave an allocator lifetime through them or suffer the ergonomics of my home-brewed string type based on indices. Just keep using String.

Is swapping to Mimalloc a free lunch? Probably not. A single benchmark is nothing to stake a technical decision on. Still, I came across the Microsoft Rust Guidelines which recommends Mimalloc for all apps¹⁴.

Between this article and the article where I recommend using Mimalloc over the default musl allocator¹⁵, I might be considered a Mimalloc-shill, but I just think it’s neat.

Deserialize in place

There are situations where Mimalloc isn’t feasible (like with our favorite wasm32-unknown-unknown target¹⁶). Can we still reduce the memory pressure of types that require the global allocator?

Well, if the code is structured to repeatedly deserialize into a pre-existing instance, you can use serde’s public, but hidden deserialize_in_place, which takes in a deserializer and previous instance of our object to mutate.

#[derive(Deserialize, Default)]
struct MyData {
    my_field: Vec<String>,
}

// ...
let mut data = MyData::default();
let mut deserializer = MyDeserializer::from(&input);
Deserialize::deserialize_in_place(&mut deserializer, &mut data)

Autogenerating deserialize_in_place requires one to depend on serde_derive directly as the feature is not exposed through serde:

serde = { version = "1.0.195", features = [] }
serde_derive = { version = "*", features = ["deserialize_in_place"] }

The good news is that, when it works, it works well. Being able to reuse Vec and String allocations benchmarked twice as fast for the system allocator. Mimalloc only showed a ~10% improvement, which is wild to think about. This suggests that when using Mimalloc, the cost of allocations becomes negligible enough that deserialize_in_place provides diminishing returns.

Unfortunately for deserialize_in_place, it’s finicky. For instance, we may want to customize deserialization with a serde field attribute:

#[derive(Deserialize, Default)]
struct MyData {
    #[serde(deserialize_with = "...")]
    my_field: Vec<String>,
}

Using the deserialize_with attribute silently reverts the implementation back to allocating a whole new instance. While this behavior is correct, it is certainly opaque to developers if a change fails to emit the optimization unless the project has automated performance testing. And since this API is hidden, there’s very little motivation to document the optimization criteria. One has to examine the source code to know why¹⁷.

Another downside with deserialize_in_place is that it stamps out a bunch more code. I’m keenly aware of the impact inlining and monomorphization from Deserialize implementations have on compile times and artifact sizes¹⁸, so seeing twice as much code emitted for Deserialize implementations makes me wince.

But this API certainly seems like a good tool to have in one’s toolbelt. Perhaps deserialize_in_place won’t be hidden for long?¹⁹

So many possible tangents

There are still so many more tangents to cover when it comes to potential deserialization performance improvements.

I want to skip discussions on string interning, which is important for mitigating the cost of duplicate data. There are already articles covering this topic²⁰. In a benchmark where every string deserialized is unique, the additional computations required from interning caused roughly a 2x slowdown. Your mileage will vary, but hashing and bookkeeping required for interning does not come free.

I want to skip discussions on small string optimizations that allow for data to be stored on the stack (another topic on which much ink has been spilled). As I’ve noted before²¹, Rust small string optimization libraries seem to fixate on being the same size as a String, so that means on 64-bit platforms one can inline 20-24 bytes, but on 32-bit (eg: Wasm), only 10-12 can be inlined. I’ve been burned before where positive small string optimizations were harmful in the browser. Small string optimizations can be useful, but are too data dependent.

Zero-copy deserialization is cool, but not applicable when deserializing from an IO stream or if only a small subset of data is required (eg: it might not make sense to hold onto a 100MB buffer just to avoid 1MB of copies).

Perhaps the most important caveat: these allocation optimizations matter most when the deserializer itself is fast. My benchmarks use a real-world binary format with length-prefixed strings that trivializes parsing. When deserializing text formats like JSON, the parser itself often becomes the bottleneck, reducing the relative impact of allocation strategy.

Conclusion

In the memory allocation benchmark in Wasm, we saw that performance of memory allocators can vary wildly. Using a bump allocator can help mitigate the volatility across browsers and can improve performance by an order of magnitude in certain situations.

Before committing to weaving an allocator lifetime through your entire codebase, try simpler approaches first: benchmark with a different global allocator like Mimalloc, or if the data model is repeatedly deserialized, see if it can be adapted to be deserialize_in_place compatible. In my benchmarks, both strategies were 2x faster than the default global allocator.

If those approaches fall short (particularly in Wasm where Mimalloc isn’t available), then consider writing a proc-macro to deserialize with an arena allocator. This is the fastest approach, but comes with the complexity cost of viral lifetimes and custom code generation. It’s what I ultimately chose for my use case, but doesn’t make it the right decision for you.