Wasm's curious word size and the SWAR advantage

Oftentimes the size of addressable space is the same size as the system architecture’s preferred unit of operation, also known as its word size. On the vast majority of consumer hardware and servers (both x86 and Arm), these will both be 64 bits.

However this is not the case with Webassembly (Wasm), which has 32 bits of addressable space1 but a word size of 64 bits.

But wait, Wasm is a portable execution format. Isn’t word size dictated by the underlying system that executes the Wasm? So it can’t possibly know the word size ahead of time.

Correct, but when you are a compiler, a choice has to be made, and if the compilation target natively supports 64-bit values and operations, it seems intuitive to emit them. Leave it to the runtime to stitch together the appropriate machine code on 32-bit machines.2

Let’s confirm this 64-bit affinity with an example. Imagine we are writing a JSON parser and after visiting the comma field delimiter we want to count the number of newlines and tabs, so we can skip over them and get to the next field. With SIMD within a register (SWAR), we can write a function that processes 8 bytes at a time.

#[inline(always)]
const fn repeat_byte(b: u8) -> u64 {
    (b as u64) * (u64::MAX / 255)
}

#[inline]
pub const fn leading_whitespace_swar(value: u64) -> u32 {
    let mask1 = repeat_byte(b'\t');
    let mask2 = repeat_byte(b'\n');
    let res1 = value ^ mask1;
    let res2 = value ^ mask2;
    (res1 & res2).trailing_zeros() >> 3
}

Ignoring for a moment if this optimization is worth it, we can see in the Wasm output below that it is filled with i64 instructions and values.

leading_whitespace_swar:
        local.get       0
        i64.const       651061555542690057
        i64.xor 
        local.get       0
        i64.const       723401728380766730
        i64.xor 
        i64.and 
        i64.ctz 
        i32.wrap_i64
        i32.const       3
        i32.shr_u
        end_function

This looks close enough to the output when compiling on a 64-bit machine.

Assembly on 64-bit x86 machine (expand)
leading_whitespace_swar:
        movabs  rcx, 651061555542690057
        xor     rcx, rdi
        movabs  rax, 723401728380766730
        xor     rax, rdi
        and     rax, rcx
        je      .LBB0_1
        rep       bsf rax, rax
        shr     eax, 3
        ret
.LBB0_1:
        mov     eax, 64
        shr     eax, 3
        ret

In contrast, when compiling to a 32-bit system like i686-unknown-linux-gnu the compiler prefers working with 32-bit values.

Assembly on 32-bit x86 machine (expand)
leading_whitespace_swar:
        push    esi
        mov     ecx, dword ptr [esp + 12]
        mov     eax, dword ptr [esp + 8]
        mov     edx, ecx
        mov     esi, eax
        xor     ecx, 168430090
        xor     eax, 168430090
        xor     edx, 151587081
        xor     esi, 151587081
        and     ecx, edx
        and     eax, esi
        mov     edx, 32
        bsf     ecx, ecx
        cmovne  edx, ecx
        add     edx, 32
        bsf     eax, eax
        cmove   eax, edx
        shr     eax, 3
        pop     esi
        ret

64-bit atomics on 32-bit platforms?

I had the question, “does the compiler emit anything we can use to know the word size?”

My hypothesis became: a platform’s atomic value size should equal the word size.

This is not true. Take the following example:

fn main() {
    #[cfg(all(target_has_atomic = "32", not(target_has_atomic = "64")))]
    let word_size = 4;
    
    #[cfg(target_has_atomic = "64")]
    let word_size = 8;

    println!("word size: {}", word_size);
    println!("address size: {}", std::mem::size_of::<usize>());
}

Running it on a 32-bit machine will (most likely) output a word size of 8 bytes – which is incorrect!

We can see how 64-bit atomic values are supported on 32-bit platforms with the following:

static CALLS: AtomicU64 = AtomicU64::new(0);

fn add_calls() {
    CALLS.fetch_add(1, Ordering::SeqCst);
}

Which has a healthy amount of assembly, to what amounts as a compare and exchange loop.

Assembly of 64-bit atomics on 32-bit platform (expand)
add_calls:
        push    ebx
        push    esi
        call    .L3$pb
.L3$pb:
        pop     esi
.Ltmp25:
        add     esi, offset _GLOBAL_OFFSET_TABLE_+(.Ltmp25-.L3$pb)
        mov     edx, dword ptr [esi + example::CALLS::hc0e5ba0be355b7aa@GOTOFF+4]
        mov     eax, dword ptr [esi + example::CALLS::hc0e5ba0be355b7aa@GOTOFF]
.LBB3_1:
        mov     ebx, eax
        mov     ecx, edx
        add     ebx, 1
        adc     ecx, 0
        lock            cmpxchg8b       qword ptr [esi + example::CALLS::hc0e5ba0be355b7aa@GOTOFF]
        jne     .LBB3_1
        pop     esi
        pop     ebx
        ret

What can we take away?

Since the size of addressable space can differ from the word size, using the pointer-sized type of usize (or its conditional compilation sibling of target_pointer_width) as the basis for optimizations is not a good idea. One would be kneecapping Wasm if, in our skipping whitespace example, we only looked at the next 4 bytes instead of 8.

If needed, exclude the Wasm family in these instances of conditional compilations, so it can use 8 bytes. The root of the problem is that the compiler is free to emit whatever instructions it wants, so it’s hard for the compiler to make guarantees around its decision during optimizations.

Said another way, the compiler is free to emit 32-bit operations when optimal regardless of the platform. So there’s a chance that both the 32-bit and 64-bit x86 output could be identical!

At this point, it feels like talking about a niche within a niche. Is all this mental accounting worth it?

Benchmarking SWAR Across Platforms

Instead of SWAR, we can write a straightforward implementation over an array to skip leading whitespace.

#[no_mangle]
pub fn leading_whitespace(data: [u8; 8]) -> u8 {
    data.iter().take_while(|x| matches!(x, b'\t' | b'\n')).count() as u8
}

Looking at the emitted instructions for our three platforms below, the compiler clearly loves loop unrolling (and hooray for zero cost abstractions like iterators).

Wasm compilation (expand)
leading_whitespace:
        i32.const       1
        local.set       1
        block           
        local.get       0
        i32.load8_u     0
        i32.const       -9
        i32.add 
        i32.const       255
        i32.and 
        i32.const       1
        i32.le_u
        br_if           0
        i32.const       0
        return
        end_block
        block           
        local.get       0
        i32.load8_u     1
        i32.const       -9
        i32.add 
        i32.const       255
        i32.and 
        i32.const       1
        i32.gt_u
        br_if           0
        block           
        local.get       0
        i32.load8_u     2
        i32.const       -9
        i32.add 
        i32.const       255
        i32.and 
        i32.const       1
        i32.le_u
        br_if           0
        i32.const       2
        return
        end_block
        block           
        local.get       0
        i32.load8_u     3
        i32.const       -9
        i32.add 
        i32.const       255
        i32.and 
        i32.const       1
        i32.le_u
        br_if           0
        i32.const       3
        return
        end_block
        block           
        local.get       0
        i32.load8_u     4
        i32.const       -9
        i32.add 
        i32.const       255
        i32.and 
        i32.const       1
        i32.le_u
        br_if           0
        i32.const       4
        return
        end_block
        block           
        local.get       0
        i32.load8_u     5
        i32.const       -9
        i32.add 
        i32.const       255
        i32.and 
        i32.const       1
        i32.le_u
        br_if           0
        i32.const       5
        return
        end_block
        block           
        local.get       0
        i32.load8_u     6
        i32.const       -9
        i32.add 
        i32.const       255
        i32.and 
        i32.const       1
        i32.le_u
        br_if           0
        i32.const       6
        return
        end_block
        i32.const       8
        i32.const       7
        local.get       0
        i32.load8_u     7
        i32.const       -9
        i32.add 
        i32.const       255
        i32.and 
        i32.const       2
        i32.lt_u
        i32.select
        local.set       1
        end_block
        local.get       1
        end_function
64-bit x86 compilation (expand)
leading_whitespace:
        lea     eax, [rdi - 9]
        cmp     al, 1
        ja      .LBB1_1
        mov     ecx, edi
        shr     ecx, 8
        add     cl, -9
        mov     al, 1
        cmp     cl, 1
        ja      .LBB1_9
        mov     ecx, edi
        shr     ecx, 16
        add     cl, -9
        mov     al, 2
        cmp     cl, 1
        ja      .LBB1_9
        mov     ecx, edi
        shr     ecx, 24
        add     cl, -9
        mov     al, 3
        cmp     cl, 1
        ja      .LBB1_9
        mov     rcx, rdi
        shr     rcx, 32
        add     cl, -9
        mov     al, 4
        cmp     cl, 1
        ja      .LBB1_9
        mov     rcx, rdi
        shr     rcx, 40
        add     cl, -9
        mov     al, 5
        cmp     cl, 1
        ja      .LBB1_9
        mov     rcx, rdi
        shr     rcx, 48
        add     cl, -9
        mov     al, 6
        cmp     cl, 1
        ja      .LBB1_9
        shr     rdi, 56
        add     dil, -9
        xor     eax, eax
        cmp     dil, 2
        adc     al, 7
.LBB1_9:
        ret
.LBB1_1:
        xor     eax, eax
        ret
32-bit x86 compilation (expand)
leading_whitespace:
        mov     ecx, dword ptr [esp + 4]
        movzx   eax, byte ptr [ecx]
        add     al, -9
        cmp     al, 1
        ja      .LBB1_1
        movzx   edx, byte ptr [ecx + 1]
        mov     al, 1
        add     dl, -9
        cmp     dl, 1
        ja      .LBB1_9
        movzx   edx, byte ptr [ecx + 2]
        mov     al, 2
        add     dl, -9
        cmp     dl, 1
        ja      .LBB1_9
        movzx   edx, byte ptr [ecx + 3]
        mov     al, 3
        add     dl, -9
        cmp     dl, 1
        ja      .LBB1_9
        movzx   edx, byte ptr [ecx + 4]
        mov     al, 4
        add     dl, -9
        cmp     dl, 1
        ja      .LBB1_9
        movzx   edx, byte ptr [ecx + 5]
        mov     al, 5
        add     dl, -9
        cmp     dl, 1
        ja      .LBB1_9
        movzx   edx, byte ptr [ecx + 6]
        mov     al, 6
        add     dl, -9
        cmp     dl, 1
        ja      .LBB1_9
        movzx   ecx, byte ptr [ecx + 7]
        xor     eax, eax
        add     cl, -9
        cmp     cl, 2
        adc     al, 7
.LBB1_9:
        ret
.LBB1_1:
        xor     eax, eax
        ret

For the curious, &[u8]::trim_ascii_start had an analogous loop unrolled output.

Normally, when doing performance analysis on Wasm, there will be a step to compare it against Wasm compiled with SIMD instructions via:

RUSTFLAGS="-C target-feature=+simd128" # ...

This additional Rust flag is needed as Wasm was initially released without SIMD support, but nowadays there’s close to ubiquitous support in the browser and server runtimes.3

However, since the compiler eschewed any form of autovectorization in favor of loop unrolling, we don’t need to consider the Wasm SIMD-enabled build as it has no effect.

Time to put our two implementations in a shootout with criterion. The benchmark code is fairly involved, but it compares the implementations across data ranging from 64 to 4096 bytes and across a myriad of patterns:

  • No whitespace
  • All whitespace
  • Half whitespace
  • Variable whitespace (to measure if that pesky branch predictor has effect)
Rust benchmark (expand)
use criterion::{BenchmarkId, Criterion, Throughput, criterion_group, criterion_main};

#[inline(always)]
pub(crate) const fn repeat_byte(b: u8) -> u64 {
    (b as u64) * (u64::MAX / 255)
}

pub fn leading_whitespace_swar(value: u64) -> u32 {
    let mask1 = repeat_byte(b'\t');
    let mask2 = repeat_byte(b'\n');
    let res1 = value ^ mask1;
    let res2 = value ^ mask2;
    (res1 & res2).trailing_zeros() >> 3
}

pub fn leading_whitespace(data: [u8; 8]) -> u8 {
    data.iter().take_while(|x| matches!(x, b'\t' | b'\n')).count() as u8
}

fn criterion_benchmark(c: &mut Criterion) {
    // Create test data with varying amounts of whitespace
    let test_sizes = [64, 256, 1024, 4096];

    let mut group = c.benchmark_group("whitespace_scanning");

    for size in test_sizes {
        group.throughput(Throughput::Bytes(size as u64));

        // Test case 1: Mixed whitespace and non-whitespace
        let pattern = [b'\n', b'\t', b'\t', b'\t', b'c', b'd', b'e', b'f'];
        let mut mixed_ws = Vec::with_capacity(size);
        for _ in 0..(size / pattern.len()) {
            mixed_ws.extend_from_slice(&pattern);
        }

        // Test case 2: All whitespace
        let all_ws_pattern = [b'\t', b'\n', b'\t', b'\n', b'\t', b'\n', b'\t', b'\n'];
        let mut all_ws = Vec::with_capacity(size);
        for _ in 0..(size / all_ws_pattern.len()) {
            all_ws.extend_from_slice(&all_ws_pattern);
        }

        // Test case 3: No whitespace
        let no_ws_pattern = [b'a', b'b', b'c', b'd', b'e', b'f', b'g', b'h'];
        let mut no_ws = Vec::with_capacity(size);
        for _ in 0..(size / no_ws_pattern.len()) {
            no_ws.extend_from_slice(&no_ws_pattern);
        }

        // Test case 4: Variable whitespace
        let variable_patterns = [
            [b'\t', b'\n', b'\t', b'\n', b'a', b'b', b'c', b'd'],  // 4 whitespace
            [b'\t', b'\n', b'\t', b'\n', b'\t', b'\n', b'a', b'b'], // 6 whitespace
            [b'\t', b'\n', b'a', b'b', b'c', b'd', b'e', b'f'],     // 2 whitespace
            [b'a', b'b', b'c', b'd', b'e', b'f', b'g', b'h'],       // 0 whitespace
            [b'\t', b'\n', b'\t', b'\n', b'\t', b'\n', b'\t', b'\n'], // 8 whitespace
            [b'\t', b'\t', b'a', b'b', b'c', b'd', b'e', b'f'],     // 2 whitespace
            [b'\t', b'\n', b'\t', b'a', b'b', b'c', b'd', b'e'],    // 3 whitespace
            [b'\n', b'a', b'b', b'c', b'd', b'e', b'f', b'g'],      // 1 whitespace
            [b'\t', b'\n', b'\t', b'\n', b'\t', b'b', b'c', b'd'],  // 5 whitespace
        ];

        let mut variable_ws = Vec::with_capacity(size);
        let mut pattern_idx = 0;
        while variable_ws.len() < size {
            variable_ws.extend_from_slice(&variable_patterns[pattern_idx]);
            pattern_idx = (pattern_idx + 1) % variable_patterns.len();
        }

        // Benchmark for mixed whitespace (original)
        group.bench_with_input(BenchmarkId::new("half_swar", size), &mixed_ws, |b, data| {
            b.iter(|| {
                let mut count = 0;
                for chunk in data.chunks_exact(8) {
                    let value = u64::from_le_bytes([
                        chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
                        chunk[7],
                    ]);
                    count += leading_whitespace_swar(value);
                }
                count
            })
        });

        group.bench_with_input(BenchmarkId::new("half_standard", size), &mixed_ws, |b, data| {
            b.iter(|| {
                let mut count = 0;
                for chunk in data.chunks_exact(8) {
                    let chunk_array = [
                        chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
                        chunk[7],
                    ];
                    count += leading_whitespace(chunk_array);
                }
                count
            })
        });

        // Benchmark for all whitespace
        group.bench_with_input(BenchmarkId::new("all_ws_swar", size), &all_ws, |b, data| {
            b.iter(|| {
                let mut count = 0;
                for chunk in data.chunks_exact(8) {
                    let value = u64::from_le_bytes([
                        chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
                        chunk[7],
                    ]);
                    count += leading_whitespace_swar(value);
                }
                count
            })
        });

        group.bench_with_input(BenchmarkId::new("all_ws_standard", size), &all_ws, |b, data| {
            b.iter(|| {
                let mut count = 0;
                for chunk in data.chunks_exact(8) {
                    let chunk_array = [
                        chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
                        chunk[7],
                    ];
                    count += leading_whitespace(chunk_array);
                }
                count
            })
        });

        // Benchmark for no whitespace
        group.bench_with_input(BenchmarkId::new("no_ws_swar", size), &no_ws, |b, data| {
            b.iter(|| {
                let mut count = 0;
                for chunk in data.chunks_exact(8) {
                    let value = u64::from_le_bytes([
                        chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
                        chunk[7],
                    ]);
                    count += leading_whitespace_swar(value);
                }
                count
            })
        });

        group.bench_with_input(BenchmarkId::new("no_ws_standard", size), &no_ws, |b, data| {
            b.iter(|| {
                let mut count = 0;
                for chunk in data.chunks_exact(8) {
                    let chunk_array = [
                        chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
                        chunk[7],
                    ];
                    count += leading_whitespace(chunk_array);
                }
                count
            })
        });

        // Benchmark for variable whitespace
        group.bench_with_input(BenchmarkId::new("variable_ws_swar", size), &variable_ws, |b, data| {
            b.iter(|| {
                let mut count = 0;
                for chunk in data.chunks_exact(8) {
                    let value = u64::from_le_bytes([
                        chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
                        chunk[7],
                    ]);
                    count += leading_whitespace_swar(value);
                }
                count
            })
        });

        group.bench_with_input(BenchmarkId::new("variable_ws_standard", size), &variable_ws, |b, data| {
            b.iter(|| {
                let mut count = 0;
                for chunk in data.chunks_exact(8) {
                    let chunk_array = [
                        chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
                        chunk[7],
                    ];
                    count += leading_whitespace(chunk_array);
                }
                count
            })
        });
    }

    group.finish();
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

The only other thing of note is that running the benchmark under Wasm does require some finesse:

[target.'cfg(target_family = "wasm")'.dev-dependencies]
criterion = { version = "0.5.1", default-features = false }
cargo build --release --bench=swar --target=wasm32-wasip1

ls -t target/wasm32-wasi/release/deps/*.wasm 

Then one has to choose the Wasm runtime, knowing that there can be different performance characteristics between them. I default to Wasmer, but Wasmtime is also excellent.

The results are interesting. I’m reporting the max throughput in GiB/s observed for each test run as the benchmark didn’t vary too much with the size parameter.

Benchmark All Half None Variable
x86 3.63 6.52 16.01 6.26
x86 SWAR 16.29 13.07 13.10 13.47
Wasm 2.79 4.72 10.93 4.86
Wasm SWAR 13.04 13.04 13.13 13.02

Takeaways:

  • The SWAR implementations have the least volatility between inputs, as they are constant time. All inputs execute the same instructions, which is nice from a predictability standpoint.
  • On average, the SWAR implementations are 2-3x faster, with the one exception being on x86 when the input has no whitespace, which I don’t want to discount. Know your input, if the application can bet that the input will always be minified, then using the standard implementation is a reasonable approach.
  • Wasm SWAR is neck and neck with x86 SWAR, which is incredible because I believe that the main value proposition of Wasm is write once4, so this performance parity is just icing on top.

Key Takeaways and Recommendations

While unfortunate that it’s easier to target optimizations for platforms with 64-bit addressing rather than 64-bit operations, I’m feeling a little good at the end of this. The foray into performance optimization yielded a nice speedup and predictability. Discovering a SWAR implementation to a problem scratches a creative itch in my brain.

I’m not recommending everyone go out and start using this implementation to skip whitespace (especially since this “whitespace” assumes only tabs and newlines). For actual inspiration, there are great blog posts and projects by Daniel Lemire that should yield even better speedups.5

Optimization is hard. So many nuances and platform differences. Always profile. I have come to view projects that use usize as a heuristic for optimizations with a bit of skepticism. For instance, most (all?) Rust small string optimization libraries6 seem to fixate on being the same size as String, so that means on 64-bit platforms one can inline 20-24 bytes, but on 32-bit (eg: Wasm), only 10-12 can be inlined. My preference would be to declare upfront, how many bytes can be inlined to allow for predictable performance when all my strings tend to be 10-20 bytes! We’re out of time today, so I’ll save that adventure for another time.


  1. As of January 2025, one can opt into 64 bits of addressable space with Memory64, however most shouldn’t use it as there are some performance drawbacks. See: Is Memory64 actually worth using?↩︎

  2. As of March 2025, Wasmer does not support 32-bit machines and support via Wasmtime is still nascent with Pulley which comes with a “~10x order-of-magnitude slowdown”. ↩︎

  3. For browsers, caniuse reports 96.1% support, while every implementation in WebAssembly.org feature extension table supports fixed-width SIMD. ↩︎

  4. I’ve written a lot of Wasm, and written about Wasm a lot. I don’t try to lean into the performance aspects so expectations prior to shipping aren’t inflated. Read more at: The WebAssembly value proposition is write once, not performance ↩︎

  5. despacer, Scan HTML even faster with SIMD instructions (C++ and C#), and of course, simdjson ↩︎

  6. The Rust String Rosetta Comparison ↩︎

Comments

If you'd like to leave a comment, please email [email protected]