Parsing Performance Improvement with Tapes and Spatial Locality

2020-08-12: This article described a performant method to parse data formats and how to aggregate serde fields with the input buffered. While the serde demonstration is still valid, I opted to create a derive macro that will aggregate fields that isn’t susceptible to edge cases.

There’s a format that I need to parse in Rust. It’s analogous to JSON but with a few twists:

{
    "core": "core1",
    "nums": [1, 2, 3, 4, 5],
    "core": "core2"
}
struct MyDocument {
  core: Vec<String>,
  nums: Vec<u8>,
}

Unfortunately we have to choose one or the other:

Since I consider serde deserialization a must have feature, I resigned myself to buffering the entire file in memory. To the end user, nothing is as ergonomic and time saving as serde deserialization. My initial designs had a nice iterative approach, combining the best things about quick-xml, rust-csv, and pulldown-cmark. I was even going to write a post about designing a nice pull based parser in rust showing it can be a great foundational start for higher level parsers, but stopped short when the performance fell short.

Some serde deserializer like quick-xml / serde-xml-rs hack around the serde limitation by only supporting deserializing multiple fields that appear consecutively. This allows them to still iteratively read the file. Though it remains an open issue to support non-consecutive fields

Representing a JSON Document

Since we have the whole document in memory, we need a format that we can write a deserializer for. Here’s what a simplified JSON document can look like

enum JsonValue<'a> {
  Text(&'a [u8]),
  Number(u64),
  Array(Vec<JsonValue<'a>>),
  Object(Vec<(&'a [u8], JsonValue<'a>)>),
}

This structure is hierarchical. It’s intuitive. You can see how it easily maps to JSON.

The issue is the structure is a horrible performance trap. When deserializing documents with many objects and arrays, the memory access and reallocation of the Array and Object vectors is significant. Profiling with valgrind showed that the code spent 35% of the time inside jemalloc’s realloc. The CPU has to jump all over RAM when constructing a JsonValue as these vectors could live anywhere on the heap. From Latency Numbers Every Programmer Should Know, accessing main memory takes 100ns, which is 200x over something in L1 cache.

So that’s the performance issue, but before we solve that we’ll need a algorithm to solve deserializing out of order fields.

Serde Deserializing out of order Fields

Below is some guidelines for those working with data formats that allow duplicate keys in an object but the keys are not consecutive / siblings:

This method requires that we have two additional vectors needed per deserialization of an object. One to keep track of the indices of values that are part of a given field key and another to keep track of which keys have been deserialized already. The alternative solution would be to create an intermediate “group by” structure but this was not needed as the above method never showed up in profiling results, despite how inefficient it may seem (it’s O(n^2), where n is the number of fields in an object, as deserializing the k’th field requires (n - k) comparisons).

Parsing to a Tape

A tape is a flattened document hierarchy that fits in a single vector of tokens that aid future traversals. This is not a novel technique, but I’ve seen so little literature on the subject that I figured I’d add my voice. There is a nice description of how the tape works in Daniel Lemire’s simdjson. To be concrete, this is what a tape looks like in Rust:

enum JsonValue {
  Text { data_idx: usize },
  Number(u64),
  Array { tape_end_idx: usize },
  Object { tape_end_idx: usize },
  End { tape_start_idx: usize }
}

struct Tapes<'a> {
  tokens: Vec<JsonValue>,
  data: Vec<&'a [u8]>,
}

So the earlier example of

{
    "core": "core1",
    "nums": [1, 2, 3, 4, 5],
    "core": "core2"
}

Would turn into a tape like so: (numbers are the tape / data index)

Tokens:

0 : Text { data_idx: 0 }
1 : Text { data_idx: 1 }
2 : Text { data_idx: 2 }
3 : Array { tape_end_idx: 9}
4 : Number(1)
5 : Number(2)
6 : Number(3)
7 : Number(4)
8 : Number(5)
9 : End { tape_start_idx: 3}
10: Text { data_idx: 3 }
11: Text { data_idx: 4 }

Data:

0: core
1: core1
2: nums
3: core
4: core2

Notes:

Why create the data tape instead of embedding the slice directly into JsonValue::Text?

Yes, an alternative design could be:

enum JsonValue2<'a> {
  Text(&'a [u8]),
  Number(u64),
  Array { tape_end_idx: usize },
  Object { tape_end_idx: usize },
  End { tape_start_idx: usize }
}

and then there would be no need for a separate data tape. The main reason why is space.

println!("{}", std::mem::size_of::<JsonValue>()); // 16
println!("{}", std::mem::size_of::<JsonValue2>()); // 24

JsonValue2 is 50% larger. This becomes apparent when deserializing documents that don’t contain many strings (ie mainly composed of numbers). The performance hit from the extra indirection has been negligible (probably because the data tape is often accessed sequentially and can spend a lot of time in cache), but I’m always on the lookout for better methods.

Performance Improvements

To me this is somewhat mind blowing – that I saw a 6x decrease in latency when deserializing in the browser. It took an idea that seemed untenable to possible. And I was so blown away that I felt compelled to write about it so that I couldn’t forget it and add tip on the internet in case someone comes looking.

Comments

If you'd like to leave a comment, please email [email protected]