
Wasm's curious word size and the SWAR advantage
Published on:Table of Contents
Oftentimes the size of addressable space is the same size as the system architecture’s preferred unit of operation, also known as its word size. On the vast majority of consumer hardware and servers (both x86 and Arm), these will both be 64 bits.
However this is not the case with Webassembly (Wasm), which has 32 bits of addressable space1 but a word size of 64 bits.
But wait, Wasm is a portable execution format. Isn’t word size dictated by the underlying system that executes the Wasm? So it can’t possibly know the word size ahead of time.
Correct, but when you are a compiler, a choice has to be made, and if the compilation target natively supports 64-bit values and operations, it seems intuitive to emit them. Leave it to the runtime to stitch together the appropriate machine code on 32-bit machines.2
Let’s confirm this 64-bit affinity with an example. Imagine we are writing a JSON parser and after visiting the comma field delimiter we want to count the number of newlines and tabs, so we can skip over them and get to the next field. With SIMD within a register (SWAR), we can write a function that processes 8 bytes at a time.
#[inline(always)]
const fn repeat_byte(b: u8) -> u64 {
(b as u64) * (u64::MAX / 255)
}
#[inline]
pub const fn leading_whitespace_swar(value: u64) -> u32 {
let mask1 = repeat_byte(b'\t');
let mask2 = repeat_byte(b'\n');
let res1 = value ^ mask1;
let res2 = value ^ mask2;
(res1 & res2).trailing_zeros() >> 3
}
Ignoring for a moment if this optimization is worth it, we can see in the Wasm output below that it is filled with i64
instructions and values.
leading_whitespace_swar:
local.get 0
i64.const 651061555542690057
i64.xor
local.get 0
i64.const 723401728380766730
i64.xor
i64.and
i64.ctz
i32.wrap_i64
i32.const 3
i32.shr_u
end_function
This looks close enough to the output when compiling on a 64-bit machine.
Assembly on 64-bit x86 machine (expand)
leading_whitespace_swar:
movabs rcx, 651061555542690057
xor rcx, rdi
movabs rax, 723401728380766730
xor rax, rdi
and rax, rcx
je .LBB0_1
rep bsf rax, rax
shr eax, 3
ret
.LBB0_1:
mov eax, 64
shr eax, 3
ret
In contrast, when compiling to a 32-bit system like i686-unknown-linux-gnu
the compiler prefers working with 32-bit values.
Assembly on 32-bit x86 machine (expand)
leading_whitespace_swar:
push esi
mov ecx, dword ptr [esp + 12]
mov eax, dword ptr [esp + 8]
mov edx, ecx
mov esi, eax
xor ecx, 168430090
xor eax, 168430090
xor edx, 151587081
xor esi, 151587081
and ecx, edx
and eax, esi
mov edx, 32
bsf ecx, ecx
cmovne edx, ecx
add edx, 32
bsf eax, eax
cmove eax, edx
shr eax, 3
pop esi
ret
64-bit atomics on 32-bit platforms?
I had the question, “does the compiler emit anything we can use to know the word size?”
My hypothesis became: a platform’s atomic value size should equal the word size.
This is not true. Take the following example:
fn main() {
#[cfg(all(target_has_atomic = "32", not(target_has_atomic = "64")))]
let word_size = 4;
#[cfg(target_has_atomic = "64")]
let word_size = 8;
println!("word size: {}", word_size);
println!("address size: {}", std::mem::size_of::<usize>());
}
Running it on a 32-bit machine will (most likely) output a word size of 8 bytes – which is incorrect!
We can see how 64-bit atomic values are supported on 32-bit platforms with the following:
static CALLS: AtomicU64 = AtomicU64::new(0);
fn add_calls() {
CALLS.fetch_add(1, Ordering::SeqCst);
}
Which has a healthy amount of assembly, to what amounts as a compare and exchange loop.
Assembly of 64-bit atomics on 32-bit platform (expand)
add_calls:
push ebx
push esi
call .L3$pb
.L3$pb:
pop esi
.Ltmp25:
add esi, offset _GLOBAL_OFFSET_TABLE_+(.Ltmp25-.L3$pb)
mov edx, dword ptr [esi + example::CALLS::hc0e5ba0be355b7aa@GOTOFF+4]
mov eax, dword ptr [esi + example::CALLS::hc0e5ba0be355b7aa@GOTOFF]
.LBB3_1:
mov ebx, eax
mov ecx, edx
add ebx, 1
adc ecx, 0
lock cmpxchg8b qword ptr [esi + example::CALLS::hc0e5ba0be355b7aa@GOTOFF]
jne .LBB3_1
pop esi
pop ebx
ret
What can we take away?
Since the size of addressable space can differ from the word size, using the pointer-sized type of usize
(or its conditional compilation sibling of target_pointer_width
) as the basis for optimizations is not a good idea. One would be kneecapping Wasm if, in our skipping whitespace example, we only looked at the next 4 bytes instead of 8.
If needed, exclude the Wasm family in these instances of conditional compilations, so it can use 8 bytes. The root of the problem is that the compiler is free to emit whatever instructions it wants, so it’s hard for the compiler to make guarantees around its decision during optimizations.
Said another way, the compiler is free to emit 32-bit operations when optimal regardless of the platform. So there’s a chance that both the 32-bit and 64-bit x86 output could be identical!
At this point, it feels like talking about a niche within a niche. Is all this mental accounting worth it?
Benchmarking SWAR Across Platforms
Instead of SWAR, we can write a straightforward implementation over an array to skip leading whitespace.
#[no_mangle]
pub fn leading_whitespace(data: [u8; 8]) -> u8 {
data.iter().take_while(|x| matches!(x, b'\t' | b'\n')).count() as u8
}
Looking at the emitted instructions for our three platforms below, the compiler clearly loves loop unrolling (and hooray for zero cost abstractions like iterators).
Wasm compilation (expand)
leading_whitespace:
i32.const 1
local.set 1
block
local.get 0
i32.load8_u 0
i32.const -9
i32.add
i32.const 255
i32.and
i32.const 1
i32.le_u
br_if 0
i32.const 0
return
end_block
block
local.get 0
i32.load8_u 1
i32.const -9
i32.add
i32.const 255
i32.and
i32.const 1
i32.gt_u
br_if 0
block
local.get 0
i32.load8_u 2
i32.const -9
i32.add
i32.const 255
i32.and
i32.const 1
i32.le_u
br_if 0
i32.const 2
return
end_block
block
local.get 0
i32.load8_u 3
i32.const -9
i32.add
i32.const 255
i32.and
i32.const 1
i32.le_u
br_if 0
i32.const 3
return
end_block
block
local.get 0
i32.load8_u 4
i32.const -9
i32.add
i32.const 255
i32.and
i32.const 1
i32.le_u
br_if 0
i32.const 4
return
end_block
block
local.get 0
i32.load8_u 5
i32.const -9
i32.add
i32.const 255
i32.and
i32.const 1
i32.le_u
br_if 0
i32.const 5
return
end_block
block
local.get 0
i32.load8_u 6
i32.const -9
i32.add
i32.const 255
i32.and
i32.const 1
i32.le_u
br_if 0
i32.const 6
return
end_block
i32.const 8
i32.const 7
local.get 0
i32.load8_u 7
i32.const -9
i32.add
i32.const 255
i32.and
i32.const 2
i32.lt_u
i32.select
local.set 1
end_block
local.get 1
end_function
64-bit x86 compilation (expand)
leading_whitespace:
lea eax, [rdi - 9]
cmp al, 1
ja .LBB1_1
mov ecx, edi
shr ecx, 8
add cl, -9
mov al, 1
cmp cl, 1
ja .LBB1_9
mov ecx, edi
shr ecx, 16
add cl, -9
mov al, 2
cmp cl, 1
ja .LBB1_9
mov ecx, edi
shr ecx, 24
add cl, -9
mov al, 3
cmp cl, 1
ja .LBB1_9
mov rcx, rdi
shr rcx, 32
add cl, -9
mov al, 4
cmp cl, 1
ja .LBB1_9
mov rcx, rdi
shr rcx, 40
add cl, -9
mov al, 5
cmp cl, 1
ja .LBB1_9
mov rcx, rdi
shr rcx, 48
add cl, -9
mov al, 6
cmp cl, 1
ja .LBB1_9
shr rdi, 56
add dil, -9
xor eax, eax
cmp dil, 2
adc al, 7
.LBB1_9:
ret
.LBB1_1:
xor eax, eax
ret
32-bit x86 compilation (expand)
leading_whitespace:
mov ecx, dword ptr [esp + 4]
movzx eax, byte ptr [ecx]
add al, -9
cmp al, 1
ja .LBB1_1
movzx edx, byte ptr [ecx + 1]
mov al, 1
add dl, -9
cmp dl, 1
ja .LBB1_9
movzx edx, byte ptr [ecx + 2]
mov al, 2
add dl, -9
cmp dl, 1
ja .LBB1_9
movzx edx, byte ptr [ecx + 3]
mov al, 3
add dl, -9
cmp dl, 1
ja .LBB1_9
movzx edx, byte ptr [ecx + 4]
mov al, 4
add dl, -9
cmp dl, 1
ja .LBB1_9
movzx edx, byte ptr [ecx + 5]
mov al, 5
add dl, -9
cmp dl, 1
ja .LBB1_9
movzx edx, byte ptr [ecx + 6]
mov al, 6
add dl, -9
cmp dl, 1
ja .LBB1_9
movzx ecx, byte ptr [ecx + 7]
xor eax, eax
add cl, -9
cmp cl, 2
adc al, 7
.LBB1_9:
ret
.LBB1_1:
xor eax, eax
ret
For the curious, &[u8]::trim_ascii_start
had an analogous loop unrolled output.
Normally, when doing performance analysis on Wasm, there will be a step to compare it against Wasm compiled with SIMD instructions via:
RUSTFLAGS="-C target-feature=+simd128" # ...
This additional Rust flag is needed as Wasm was initially released without SIMD support, but nowadays there’s close to ubiquitous support in the browser and server runtimes.3
However, since the compiler eschewed any form of autovectorization in favor of loop unrolling, we don’t need to consider the Wasm SIMD-enabled build as it has no effect.
Time to put our two implementations in a shootout with criterion. The benchmark code is fairly involved, but it compares the implementations across data ranging from 64 to 4096 bytes and across a myriad of patterns:
- No whitespace
- All whitespace
- Half whitespace
- Variable whitespace (to measure if that pesky branch predictor has effect)
Rust benchmark (expand)
use criterion::{BenchmarkId, Criterion, Throughput, criterion_group, criterion_main};
#[inline(always)]
pub(crate) const fn repeat_byte(b: u8) -> u64 {
(b as u64) * (u64::MAX / 255)
}
pub fn leading_whitespace_swar(value: u64) -> u32 {
let mask1 = repeat_byte(b'\t');
let mask2 = repeat_byte(b'\n');
let res1 = value ^ mask1;
let res2 = value ^ mask2;
(res1 & res2).trailing_zeros() >> 3
}
pub fn leading_whitespace(data: [u8; 8]) -> u8 {
data.iter().take_while(|x| matches!(x, b'\t' | b'\n')).count() as u8
}
fn criterion_benchmark(c: &mut Criterion) {
// Create test data with varying amounts of whitespace
let test_sizes = [64, 256, 1024, 4096];
let mut group = c.benchmark_group("whitespace_scanning");
for size in test_sizes {
group.throughput(Throughput::Bytes(size as u64));
// Test case 1: Mixed whitespace and non-whitespace
let pattern = [b'\n', b'\t', b'\t', b'\t', b'c', b'd', b'e', b'f'];
let mut mixed_ws = Vec::with_capacity(size);
for _ in 0..(size / pattern.len()) {
mixed_ws.extend_from_slice(&pattern);
}
// Test case 2: All whitespace
let all_ws_pattern = [b'\t', b'\n', b'\t', b'\n', b'\t', b'\n', b'\t', b'\n'];
let mut all_ws = Vec::with_capacity(size);
for _ in 0..(size / all_ws_pattern.len()) {
all_ws.extend_from_slice(&all_ws_pattern);
}
// Test case 3: No whitespace
let no_ws_pattern = [b'a', b'b', b'c', b'd', b'e', b'f', b'g', b'h'];
let mut no_ws = Vec::with_capacity(size);
for _ in 0..(size / no_ws_pattern.len()) {
no_ws.extend_from_slice(&no_ws_pattern);
}
// Test case 4: Variable whitespace
let variable_patterns = [
[b'\t', b'\n', b'\t', b'\n', b'a', b'b', b'c', b'd'], // 4 whitespace
[b'\t', b'\n', b'\t', b'\n', b'\t', b'\n', b'a', b'b'], // 6 whitespace
[b'\t', b'\n', b'a', b'b', b'c', b'd', b'e', b'f'], // 2 whitespace
[b'a', b'b', b'c', b'd', b'e', b'f', b'g', b'h'], // 0 whitespace
[b'\t', b'\n', b'\t', b'\n', b'\t', b'\n', b'\t', b'\n'], // 8 whitespace
[b'\t', b'\t', b'a', b'b', b'c', b'd', b'e', b'f'], // 2 whitespace
[b'\t', b'\n', b'\t', b'a', b'b', b'c', b'd', b'e'], // 3 whitespace
[b'\n', b'a', b'b', b'c', b'd', b'e', b'f', b'g'], // 1 whitespace
[b'\t', b'\n', b'\t', b'\n', b'\t', b'b', b'c', b'd'], // 5 whitespace
];
let mut variable_ws = Vec::with_capacity(size);
let mut pattern_idx = 0;
while variable_ws.len() < size {
variable_ws.extend_from_slice(&variable_patterns[pattern_idx]);
pattern_idx = (pattern_idx + 1) % variable_patterns.len();
}
// Benchmark for mixed whitespace (original)
group.bench_with_input(BenchmarkId::new("half_swar", size), &mixed_ws, |b, data| {
b.iter(|| {
let mut count = 0;
for chunk in data.chunks_exact(8) {
let value = u64::from_le_bytes([
chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
chunk[7],
]);
count += leading_whitespace_swar(value);
}
count
})
});
group.bench_with_input(BenchmarkId::new("half_standard", size), &mixed_ws, |b, data| {
b.iter(|| {
let mut count = 0;
for chunk in data.chunks_exact(8) {
let chunk_array = [
chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
chunk[7],
];
count += leading_whitespace(chunk_array);
}
count
})
});
// Benchmark for all whitespace
group.bench_with_input(BenchmarkId::new("all_ws_swar", size), &all_ws, |b, data| {
b.iter(|| {
let mut count = 0;
for chunk in data.chunks_exact(8) {
let value = u64::from_le_bytes([
chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
chunk[7],
]);
count += leading_whitespace_swar(value);
}
count
})
});
group.bench_with_input(BenchmarkId::new("all_ws_standard", size), &all_ws, |b, data| {
b.iter(|| {
let mut count = 0;
for chunk in data.chunks_exact(8) {
let chunk_array = [
chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
chunk[7],
];
count += leading_whitespace(chunk_array);
}
count
})
});
// Benchmark for no whitespace
group.bench_with_input(BenchmarkId::new("no_ws_swar", size), &no_ws, |b, data| {
b.iter(|| {
let mut count = 0;
for chunk in data.chunks_exact(8) {
let value = u64::from_le_bytes([
chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
chunk[7],
]);
count += leading_whitespace_swar(value);
}
count
})
});
group.bench_with_input(BenchmarkId::new("no_ws_standard", size), &no_ws, |b, data| {
b.iter(|| {
let mut count = 0;
for chunk in data.chunks_exact(8) {
let chunk_array = [
chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
chunk[7],
];
count += leading_whitespace(chunk_array);
}
count
})
});
// Benchmark for variable whitespace
group.bench_with_input(BenchmarkId::new("variable_ws_swar", size), &variable_ws, |b, data| {
b.iter(|| {
let mut count = 0;
for chunk in data.chunks_exact(8) {
let value = u64::from_le_bytes([
chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
chunk[7],
]);
count += leading_whitespace_swar(value);
}
count
})
});
group.bench_with_input(BenchmarkId::new("variable_ws_standard", size), &variable_ws, |b, data| {
b.iter(|| {
let mut count = 0;
for chunk in data.chunks_exact(8) {
let chunk_array = [
chunk[0], chunk[1], chunk[2], chunk[3], chunk[4], chunk[5], chunk[6],
chunk[7],
];
count += leading_whitespace(chunk_array);
}
count
})
});
}
group.finish();
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
The only other thing of note is that running the benchmark under Wasm does require some finesse:
[target.'cfg(target_family = "wasm")'.dev-dependencies]
criterion = { version = "0.5.1", default-features = false }
cargo build --release --bench=swar --target=wasm32-wasip1
ls -t target/wasm32-wasi/release/deps/*.wasm
Then one has to choose the Wasm runtime, knowing that there can be different performance characteristics between them. I default to Wasmer, but Wasmtime is also excellent.
The results are interesting. I’m reporting the max throughput in GiB/s observed for each test run as the benchmark didn’t vary too much with the size parameter.
Benchmark | All | Half | None | Variable |
---|---|---|---|---|
x86 | 3.63 | 6.52 | 16.01 | 6.26 |
x86 SWAR | 16.29 | 13.07 | 13.10 | 13.47 |
Wasm | 2.79 | 4.72 | 10.93 | 4.86 |
Wasm SWAR | 13.04 | 13.04 | 13.13 | 13.02 |
Takeaways:
- The SWAR implementations have the least volatility between inputs, as they are constant time. All inputs execute the same instructions, which is nice from a predictability standpoint.
- On average, the SWAR implementations are 2-3x faster, with the one exception being on x86 when the input has no whitespace, which I don’t want to discount. Know your input, if the application can bet that the input will always be minified, then using the standard implementation is a reasonable approach.
- Wasm SWAR is neck and neck with x86 SWAR, which is incredible because I believe that the main value proposition of Wasm is write once4, so this performance parity is just icing on top.
Key Takeaways and Recommendations
While unfortunate that it’s easier to target optimizations for platforms with 64-bit addressing rather than 64-bit operations, I’m feeling a little good at the end of this. The foray into performance optimization yielded a nice speedup and predictability. Discovering a SWAR implementation to a problem scratches a creative itch in my brain.
I’m not recommending everyone go out and start using this implementation to skip whitespace (especially since this “whitespace” assumes only tabs and newlines). For actual inspiration, there are great blog posts and projects by Daniel Lemire that should yield even better speedups.5
Optimization is hard. So many nuances and platform differences. Always profile. I have come to view projects that use usize
as a heuristic for optimizations with a bit of skepticism. For instance, most (all?) Rust small string optimization libraries6 seem to fixate on being the same size as String
, so that means on 64-bit platforms one can inline 20-24 bytes, but on 32-bit (eg: Wasm), only 10-12 can be inlined. My preference would be to declare upfront, how many bytes can be inlined to allow for predictable performance when all my strings tend to be 10-20 bytes! We’re out of time today, so I’ll save that adventure for another time.
-
As of January 2025, one can opt into 64 bits of addressable space with Memory64, however most shouldn’t use it as there are some performance drawbacks. See: Is Memory64 actually worth using?. ↩︎
-
As of March 2025, Wasmer does not support 32-bit machines and support via Wasmtime is still nascent with Pulley which comes with a “~10x order-of-magnitude slowdown”. ↩︎
-
For browsers, caniuse reports 96.1% support, while every implementation in WebAssembly.org feature extension table supports fixed-width SIMD. ↩︎
-
I’ve written a lot of Wasm, and written about Wasm a lot. I don’t try to lean into the performance aspects so expectations prior to shipping aren’t inflated. Read more at: The WebAssembly value proposition is write once, not performance ↩︎
-
despacer, Scan HTML even faster with SIMD instructions (C++ and C#), and of course, simdjson ↩︎
Comments
If you'd like to leave a comment, please email [email protected]