
Are SIMD gather instructions underrated?
Published on:Table of Contents
A light-bulb went off over my head when I wrote a loop that contained repetitive lookups with no conditional logic:
// pseudocode
let data = [0u32; 1024];
let indices = [0u16; 1024 * 1024];
let output = [0u32; 1024 * 1024];
for i in 0..output.len() {
output[i] = data[indices[i]];
}
That for-loop is the poster child for AVX2 gather instructions1, as it will allow us to look up 8, 32 bit indices simultaneously instead of one by one.
SIMD by hand
The first step: write the SIMD version by hand to know what we’re up against.
Let’s envision we are looking up 32-bit data via 16-bit indices
const CHUNK_SIZE: usize = 8; // AVX2 processes 8x32-bit integers
#[cfg(all(target_arch = "x86_64", target_feature = "avx2"))]
pub unsafe fn gather(
data: [u32; 1024],
indices: [u16; CHUNK_SIZE],
output: &mut [u32; CHUNK_SIZE]
) {
let indices = _mm256_set_epi32(
indices[7] as i32, indices[6] as i32,
indices[5] as i32, indices[4] as i32,
indices[3] as i32, indices[2] as i32,
indices[1] as i32, indices[0] as i32,
);
const SCALE: i32 = std::mem::size_of::<u32>() as i32;
let gathered_data = _mm256_i32gather_epi32::<SCALE>(
data.as_ptr() as *const i32,
indices,
);
_mm256_storeu_si256(
output.as_mut_ptr() as *mut __m256i,
gathered_data,
);
}
When compiled with:
RUSTFLAGS="-C target-feature=+avx2"
The code yields remarkably terse assembly:
example::gather::h761ab51858afe6cb:
vpmovzxwd ymm0, xmmword ptr [rsi]
vpcmpeqd ymm1, ymm1, ymm1
vpxor xmm2, xmm2, xmm2
vpgatherdd ymm2, dword ptr [rdi + 4*ymm0], ymm1
vmovdqu ymmword ptr [rdx], ymm2
vzeroupper
ret
If only there was a SIMD instruction for gathering 16-bit indices instead of upcasting to 32-bit, we’d be able to process 16 elements at once, instead of 8.
In an application benchmark, the SIMD gather instructions performed 3x better than the scalar version: able to fill in 1 million elements in the output array in 500μs vs 1500μs. That’s a big enough relative improvement for me to pursue this idea further.
Auto vectorization
I like to avoid maintaining platform specific code, so perhaps we can structure our Rust code in such a way that the compiler will output SIMD gather instructions for us.
Attempt #1.
pub fn autovec(
data: [u32; 1024],
indices: [u16; CHUNK_SIZE],
output: &mut [u32; CHUNK_SIZE]
) {
for i in 0..CHUNK_SIZE {
output[i] = data[indices[i] as usize];
}
}
Let’s take a look at the assembly:
Autovec Assembly (expand)
example::autovec::hdd8d5a79acffac38:
mov rax, rdi
movzx edi, word ptr [rsi]
cmp rdi, 1024
jae .LBB0_9
mov ecx, dword ptr [rax + 4*rdi]
mov dword ptr [rdx], ecx
movzx edi, word ptr [rsi + 2]
cmp rdi, 1023
ja .LBB0_9
mov ecx, dword ptr [rax + 4*rdi]
mov dword ptr [rdx + 4], ecx
movzx edi, word ptr [rsi + 4]
cmp rdi, 1023
ja .LBB0_9
mov ecx, dword ptr [rax + 4*rdi]
mov dword ptr [rdx + 8], ecx
movzx edi, word ptr [rsi + 6]
cmp rdi, 1023
ja .LBB0_9
mov ecx, dword ptr [rax + 4*rdi]
mov dword ptr [rdx + 12], ecx
movzx edi, word ptr [rsi + 8]
cmp rdi, 1023
ja .LBB0_9
mov ecx, dword ptr [rax + 4*rdi]
mov dword ptr [rdx + 16], ecx
movzx edi, word ptr [rsi + 10]
cmp rdi, 1023
ja .LBB0_9
mov ecx, dword ptr [rax + 4*rdi]
mov dword ptr [rdx + 20], ecx
movzx edi, word ptr [rsi + 12]
cmp rdi, 1023
ja .LBB0_9
mov ecx, dword ptr [rax + 4*rdi]
mov dword ptr [rdx + 24], ecx
movzx edi, word ptr [rsi + 14]
cmp rdi, 1023
ja .LBB0_9
mov eax, dword ptr [rax + 4*rdi]
mov dword ptr [rdx + 28], eax
ret
.LBB0_9:
push rax
lea rdx, [rip + .Lanon.5ebe7e4dbf161fdbe4e81627694922b4.1]
mov esi, 1024
call qword ptr [rip + core::panicking::panic_bounds_check::h1a9bf3d94de0fc80@GOTPCREL]
It’s a lot of assembly, but that is mainly due to the unrolled loop and the conditional jumps that lead to a panic. The panics are Rust’s checked indexing at play; making sure we don’t access invalid memory. Oftentimes there are ways to write Rust such that it can elide bound checks, but since we are indexing with a dynamic value, Rust can’t elide safely.
How do gather instructions handle invalid indices? From the gather instructions Wikipedia page:
faults occurring from invalid memory accesses by masked-out elements are suppressed
I don’t know exactly how masked values transfers into runtime behavior, but perhaps Rust will have better luck with auto vectorization if we pinky promise to the compiler that our indices don’t need bound checks:
Attempt #2:
pub fn unchecked(
data: [u32; 1024],
indices: [u16; CHUNK_SIZE],
output: &mut [u32; CHUNK_SIZE]
) {
for i in 0..CHUNK_SIZE {
output[i] = unsafe { *data.get_unchecked(indices[i] as usize) };
}
}
The assembly shows that the jumps have been removed and now contains only the unrolled loop:
Unchecked Assembly (expand)
example::unchecked::ha0f0c770e0e99451:
movzx eax, word ptr [rsi]
mov eax, dword ptr [rdi + 4*rax]
mov dword ptr [rdx], eax
movzx eax, word ptr [rsi + 2]
mov eax, dword ptr [rdi + 4*rax]
mov dword ptr [rdx + 4], eax
movzx eax, word ptr [rsi + 4]
mov eax, dword ptr [rdi + 4*rax]
mov dword ptr [rdx + 8], eax
movzx eax, word ptr [rsi + 6]
mov eax, dword ptr [rdi + 4*rax]
mov dword ptr [rdx + 12], eax
movzx eax, word ptr [rsi + 8]
mov eax, dword ptr [rdi + 4*rax]
mov dword ptr [rdx + 16], eax
movzx eax, word ptr [rsi + 10]
mov eax, dword ptr [rdi + 4*rax]
mov dword ptr [rdx + 20], eax
movzx eax, word ptr [rsi + 12]
mov eax, dword ptr [rdi + 4*rax]
mov dword ptr [rdx + 24], eax
movzx eax, word ptr [rsi + 14]
mov eax, dword ptr [rdi + 4*rax]
mov dword ptr [rdx + 28], eax
ret
The unchecked version performed slightly better than our first attempt, but still falls far short of the gather instructions.
We can pretty confidently say that Rust will not auto vectorize gather instructions.
Ok that’s not exactly true, as LLVM documents that gather operations are vectorizable2, but it also gives the caveat that vectorizing gather operations is rarely beneficial. There is a Stack Overflow discussion on this very topic3
Memory latency
We can build an intuition for why the compiler won’t autovectorize, and it starts with looking at the application, which is the best case scenario for gather instructions.
The data that we are looking up is 1024 32-bit elements – that’s 32 KiB of memory, which coincidentally, matches the size of the L1 data cache on one of my Ryzen 5900x cores.4
Accessing L1 cache is 4 CPU cycles for Zen3 CPUs5. Back of the napkin math shows that 1 million L1 accesses consolidated into 125k gather instructions should require about 500k CPU cycles or roughly 100μs when running at 4.8 GHz. This matched the timings measured when memory writes were removed in the application.
If our data was bigger and indices randomly distributed, the cost to access it could be 10-100x more expensive depending on what cache could contain it or if the CPU had to go all the way to RAM.
That would put gather instructions in a tough spot. It’s an optimization for the wrong side of the equation. The scalar and SIMD versions would be stuck for the same amount of time waiting for the same data.
This performance nuance has been well documented6 and discussed7; so it can never be assumed that SIMD gather instructions can be a performance boon (ie: “worth the squeeze”).
Moreover, certain platforms like Wasm SIMD, lack gather instructions all together, and that makes me think those instructions were left on the cutting room floor due to their lack of value proposition.
Conclusion
In my last post exploring DuckDB8, I was inspired to restructure my data in the vein of Structs of Arrays and observe performance changes. With this restructure, there are a lot of lookups to gather the information to make a query. Hence, the investigation into gather instructions.
DuckDB prides itself on performance, so imagine my surprise when they publicly document their avoidance of explicit SIMD9
Does DuckDB use SIMD?
DuckDB does not use explicit SIMD (single instruction, multiple data) instructions because they greatly complicate portability and compilation. Instead, DuckDB uses implicit SIMD, where we go to great lengths to write our C++ code in such a way that the compiler can auto-generate SIMD instructions for the specific hardware. As an example why this is a good idea, it took 10 minutes to port DuckDB to the Apple Silicon architecture.
If DuckDB is relying on auto vectorization, and as we’ve seen, the compiler is very reluctant to generate gather instructions, we can postulate that DuckDB doesn’t use them. A quick search for vpgatherdd
through an objdump
of DuckDB confirmed this.
So are gather instructions underrated? They can be! The benchmarks here showed the SIMD implementation performed the computation 3x faster than the scalar version. But it comes at a cost:
- The compiler won’t auto vectorize, so platform specific SIMD is required
- Not all platforms support gather instructions
- The absolute differences in benchmarks are measured in micro-seconds. It is much more likely that profiling will show the bottleneck elsewhere.
- Most importantly, gather instructions are workload dependent. The same code can have different performance characteristics when given different data. I can see an argument for sacrificing a bit of performance for a bit more consistency.
-
https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing) ↩︎
-
A Ryzen 5900x has 768 KiB of L1 memory that is split per-core (12x) and is split in half again: one set for instructions and another for data. ↩︎
-
https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5 ↩︎
-
When vectorization hits the memory wall: investigating the AVX2 memory gather instruction ↩︎
-
The DuckDB and Arrow rabbit hole: data modeling adventures ↩︎
Comments
If you'd like to leave a comment, please email [email protected]