The dark side of inlining and monomorphization

I felt clever. After spending a week developing an incremental lexer that would allow pdx.tools to deserialize decompressed saves at a constant memory usage, I was excited to integrate it. I had worked very hard to craft a serde deserializer that was optimized to pull from a lending, incremental lexer without the performance caveats seen in serde_json.

Writing a serde deserializer over a lending iterator probably deserves a separate post of its own, as it required circumventing the borrow checker with raw pointers and disabling miri’s stacked borrows lint (this has been fixed). I’m not happy about these sins, but I am pleased at the result. Benchmarks showed that deserializing over some Read instead of a byte slice suffered only a 10% hit to latency while reducing memory usage by at least 50%. I only needed to trade future sanity with all the unsafe, so I could finally close my profiling tools of valgrind and kcachegrind and get to application integration.

When I did integrate it into the application, I noticed two things:

I don’t know about you, but these tradeoffs are hard to justify. An increase in Wasm means more bandwidth and client side compilation. What happened? Let’s take a look.

Twiggy allows us to see what is contributing to the Wasm size. Here are the top culprits from twiggy top:

 Bytes ┊ Shallow % ┊ Item
303840 ┊     2.54% ┊ "function names" subsection
134632 ┊     1.13% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h352de6f43b8ae00b
130638 ┊     1.09% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h5328ebeb4632e6d3
129647 ┊     1.08% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h7af3f430a6b9d3ee
125627 ┊     1.05% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h32a2f3c55af990a1
125627 ┊     1.05% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h8f04040b52d4986e
111698 ┊     0.93% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h50629ecefdd086c2
110730 ┊     0.93% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::hef484f6791f2ae21
104705 ┊     0.88% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::hcf2753629e8b3681
101597 ┊     0.85% ┊ <<eu4save::models::GameState as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h13ab440bf967de04
101125 ┊     0.85% ┊ <<<eu4save::models::Eu4Save as serde::de::Deserialize>::deserialize::Eu4SaveFlatten as serde::de::Deserialize>::deserialize::__Visitoras serde::de::Visitor>::visit_map::h75aedf2b28993ee6
 99633 ┊     0.83% ┊ <<<eu4save::models::Eu4Save as serde::de::Deserialize>::deserialize::Eu4SaveFlatten as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h6d6cfac90051fff8
 96179 ┊     0.80% ┊ <<<eu4save::models::Eu4Save as serde::de::Deserialize>::deserialize::Eu4SaveFlatten as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h0347d4be1486aefc
 85472 ┊     0.72% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h37e58ca049b14e29
 85436 ┊     0.71% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h0efac83a0e6fa7f4

There are a lot of serde deserialize mentions. I suppose that shouldn’t be surprising when the data model you’re working with has around a thousand fields (and that only covers 1/3 of them).

What’s interesting is that Country is listed 10 times in the culprit list. Summing up these instances shows that Country deserialize implementations are responsible for about 10% of the entire Wasm payload. That’s over 1 MB dedicated to just deserializing a single struct. Sacrebleu!

Why are there so many implementations?

Monomorphization

A struct implementing Deserialize is generic over the deserializer, so each unique instantiation of a deserializer that deserializes a Country will compile a new implementation of Country::deserialize tailored for that deserializer. This mouthful is known as monomorphization.

I enjoy this Wikipedia quote on monomorphization, as you can immediately recognize the caveat:

The resulting code is generally faster than dynamic dispatch, but may require more compilation time and storage space due to duplicating the function body

I think it is a good exercise to trace through the code and identify where monomorphization is occurring and make sure it makes sense, as ten Deserialize implementations seems like a lot. Tools like twiggy dominators helped, and I’ve heard ddbug can help point one in the right direction, but I think there’s a large enough gap in understanding monomorphization for more tools.

I was able to narrow on a seemingly innocent looking piece of code. It’s a function that will wrap a deserializer in serde_path_to_error if we are operating in debug mode, so that we can trace deserialization errors (invaluable when deserializing an undocumented, proprietary format that regularly exceeds 100 MB).

    fn deserialize<'de, T, D>(&self, deser: D) -> Result<T, Eu4GameError>
    where
        T: DeserializeOwned,
        D: Deserializer<'de, Error = Eu4Error>,
    {
        if self.debug {
            serde_path_to_error::deserialize(deser)
                .map_err(|e| Eu4GameError::DeserializeDebug(e.to_string()))
        } else {
            Ok(T::deserialize(deser)?)
        }
    }

Since wrapping a Deserializer to expose a new Deserializer will yield a new instantiation of T::Deserialize, this function will double the amount of deserializers for any T. Commenting out the debug branch reduced the Wasm payload by 45% and reduced compile times in half.

Seems prudent to consider what downsides there’d be to always run in debug mode, as this is a huge win. This win, though, is unrelated to the introduced lexing change to allow deserialization over a generic Read.

And that right there is the crux of the problem. Previously, everything was deserialized through a byte slice, but now every distinct Read type deserialized will conjure up a new deserialization implementation. Instead of one implementation, there’s 3:

Additionally, there are separate deserializers if the data is plaintext or binary. So for those keeping track of deserializers at home:

If my math is correct, that is 12 distinct instances of Country::deserialize that the compiler will implement. Again, incredible! The 10 vs 12 implementations discrepancy is due to the code not stressing all possibilities (ie: zstd compressed files don’t go through debug logic).

What if we wanted to optionally create a checksum of the data while we are deserializing? We can wrap the Read with HighwayHash, but now there’ll be up to 24 instantiations and our compile time and payload size just doubled!

Restricting ourselves to only byte slices results in 4 possible instantiations (binary/text + debug/no debug).

Introducing generics is making things complicated. Sometimes keeping things simple and only dealing with byte slices is the easiest path to joy.

Deserializer Trait Object

We can solve this code bloat by introducing trait objects and dynamic dispatch, so no implementation is duplicated via monomorphization.

let trait_object1 = reader as &mut dyn std::io::Read;

Going back to our earlier deserialization sample, where there’s logic to choose the deserializer based on debug mode, introducing dynamic dispatch here is tricky as serde::Deserializer is not object safe due to requiring Self: Sized.

That’s where the erased-serde crate comes into play, and allows our exact use case.

We can rewrite the logic as follows:

    fn deserialize<'de, T, D>(&self, deser: D) -> Result<T, Eu4GameError>
    where
        T: DeserializeOwned,
        D: serde::Deserializer<'de, Error = Eu4Error>,
    {
        use erased_serde::Deserializer;
        let mut track = serde_path_to_error::Track::new();
        let d: Box<dyn Deserializer> = if self.debug {
            let deserializer = serde_path_to_error::Deserializer::new(deser, &mut track);
            Box::new(<dyn Deserializer>::erase(deserializer))
        } else {
            Box::new(<dyn Deserializer>::erase(deser))
        };

        T::deserialize(d)
    }

The result reduced compile times and Wasm size by half. The same result as if we had simply removed the branch.

This lunch is not free, however, as I measured the performance penalty for using erased deserializers to be an increase in latency of 35%. Your mileage will vary. So close to an easy win. For use cases where performance is not a headline feature, erased serde is an excellent addition.

Read Trait Object

There’s an additional solution. Instead of dynamic dispatch over deserializer methods, dynamic dispatch over the read implementation is more powerful.

fn zstd_deserialize<T>(data: &[u8]) -> Result<T> {
    let mut decoder = zstd::Decoder::new(data)?;
    let input = &mut decoder as &mut dyn Read;
    let mut modeller = Eu4Deserializer::from_reader(input);
    modeller.deserialize()
}

As long as the client programmer (ie: me) is consistent in always passing in a Read trait object, it is possible to realize only one instance of Country::deserialize, assuming serde-erased is used to not only unify debug mode, but also unify the binary and text formats.

Setting aside erased-serde, and just leveraging Read trait objects, the Wasm size and compilation times halved, and all without performance loss. To build intuition on why there isn’t performance loss, it’s important to look at buffer management:

Using approximations, a 100 MB file will fill our 32 KB buffer about 3-4 thousands times. 3-4 thousand virtual function calls into the underlying Read implementation is negligible, whereas 1 million virtual function calls for a deserializer trait object to visit all fields is in the realm of 3 orders of magnitude more expensive (I’m not sure this is exactly how erased-serde works, but we’re just trying to build intuition).

An underrated feature of trait objects is that using them required no change to the underlying API. Client programmers are free to use whatever technique works for them.

Inline Bloat

Through Read trait objects we’ve been able to significantly reduce the amount of code we’re compiling at zero cost to performance, but we’re still 50% above baseline.

Rerunning twiggy shows a couple Country::deserialize implementations consuming over 130 KB.

 Bytes ┊ Shallow % ┊ Item
205757 ┊     3.47% ┊ "function names" subsection
135861 ┊     2.29% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::hf57e9114966f864b
132112 ┊     2.23% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::ha51ca389e709f64d
106116 ┊     1.79% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h5400dfc10519f373
 81753 ┊     1.38% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h4bd707820db92d32

This feels wrong, there’s too much code being executed for any given deserialization.

Time to reach for another tool, cargo-show-asm. We’ll have it print the functions of our assembly along with how many instructions are contained.

 # snip
 103 "<<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map" [53729, 37351, 38167, 54993]
 # snip

And wow, Country::deserialize has many lines of assembly, 30 thousand of them.

cargo-show-asm allows one to inspect these lines, and they tipped me off to a function inside the lexer for the text format:

impl<R> TokenReader<R>
where
    R: Read,
{
    #[inline(always)]
    pub fn read(&mut self) -> Result<Token, ReaderError> {
        // snip 300 lines of code
    }
}

The read function advances the lexer and returns the consumed token. The text deserializer invokes this function to know what to deserialize next. Weighing in at around 300 lines, this function is not simple, but I annotated it with #[inline(always)] as it is called millions of times and it showed a 10% improvement in latency.

But as it says on the tin, the consequence of always inlining is that each Deserialize implementation duplicated this heavy function.

Removing the “always” annotation knocked 100 KB off Country::deserialize and many other data models. You can see this effect in the latest twiggy top output that the 130 KB Country::deserialize have fallen to around 30 KB.

 Bytes ┊ Shallow % ┊ Item
193482 ┊     5.17% ┊ "function names" subsection
106113 ┊     2.83% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h5400dfc10519f373
 81751 ┊     2.18% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h4bd707820db92d32
 77384 ┊     2.07% ┊ data[0]
 47149 ┊     1.26% ┊ <<eu4save::models::Province as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h4725daf7a629682b
 39162 ┊     1.05% ┊ <<eu4save::models::Diplomacy as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::hb32fa5a0babeab4a
 38876 ┊     1.04% ┊ <<eu4save::models::Province as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::h4c4cf039b5115540
 36886 ┊     0.99% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::hf57e9114966f864b
 33761 ┊     0.90% ┊ <<eu4save::models::Country as serde::de::Deserialize>::deserialize::__Visitor as serde::de::Visitor>::visit_map::ha51ca389e709f64d

The other large Country::deserialize implementations belong to the simpler binary format, where the always inlining has an outsized impact on performance.

After “un-inlining” the text format, we’re back to our baseline in terms of size and compile time.

Inlining is a spectrum, it’s not all or nothing. It may prove beneficial to identify if there is a hot path that can be always inlined without such significant bloat and when straying from the hot path, invoke a function that should never be inlined. Then on cold paths, wrap the always inlined function in another function that is never inlined.

Conclusion

Phew, I think I’ve spent enough time on this. What an adventure just to maintain the status quo of payload size and compile times!

The classic space-time tradeoff always seems to appear. This time, space is the amount of code in the artifact. I don’t think any approach is necessarily the correct one, but I think it is important for one to be aware of the situation and know how to adjust their solution. I know I was ready to merge this code without investigating, but I’m glad I did and this has reinforced the need for an automated process in CI to flag unusual file size increases using something like preactjs/compressed-size-action. It would be nice if there was an equivalent for compile times, though I imagine that is tougher due to the volatility of CI machines.

Discuss on Reddit

Comments

If you'd like to leave a comment, please email [email protected]