Are SIMD gather instructions underrated?

A light-bulb went off over my head when I wrote a loop that contained repetitive lookups with no conditional logic:

// pseudocode
let data = [0u32; 1024];
let indices = [0u16; 1024 * 1024];
let output = [0u32; 1024 * 1024];
for i in 0..output.len() {
    output[i] = data[indices[i]];
}

That for-loop is the poster child for AVX2 gather instructions¹, as it will allow us to look up 8, 32 bit indices simultaneously instead of one by one.

SIMD by hand

The first step: write the SIMD version by hand to know what we’re up against.

Let’s envision we are looking up 32-bit data via 16-bit indices

const CHUNK_SIZE: usize = 8; // AVX2 processes 8x32-bit integers

#[cfg(all(target_arch = "x86_64", target_feature = "avx2"))]
pub unsafe fn gather(
    data: [u32; 1024],
    indices: [u16; CHUNK_SIZE],
    output: &mut [u32; CHUNK_SIZE]
) {
    let indices = _mm256_set_epi32(
        indices[7] as i32, indices[6] as i32,
        indices[5] as i32, indices[4] as i32,
        indices[3] as i32, indices[2] as i32,
        indices[1] as i32, indices[0] as i32,
    );


    const SCALE: i32 = std::mem::size_of::<u32>() as i32;
    let gathered_data = _mm256_i32gather_epi32::<SCALE>(
        data.as_ptr() as *const i32,
        indices,
    );

    _mm256_storeu_si256(
        output.as_mut_ptr() as *mut __m256i,
        gathered_data,
    );
}

When compiled with:

RUSTFLAGS="-C target-feature=+avx2"

The code yields remarkably terse assembly:

example::gather::h761ab51858afe6cb:
        vpmovzxwd       ymm0, xmmword ptr [rsi]
        vpcmpeqd        ymm1, ymm1, ymm1
        vpxor   xmm2, xmm2, xmm2
        vpgatherdd      ymm2, dword ptr [rdi + 4*ymm0], ymm1
        vmovdqu ymmword ptr [rdx], ymm2
        vzeroupper
        ret

If only there was a SIMD instruction for gathering 16-bit indices instead of upcasting to 32-bit, we’d be able to process 16 elements at once, instead of 8.

In an application benchmark, the SIMD gather instructions performed 3x better than the scalar version: able to fill in 1 million elements in the output array in 500μs vs 1500μs. That’s a big enough relative improvement for me to pursue this idea further.

Auto vectorization

I like to avoid maintaining platform specific code, so perhaps we can structure our Rust code in such a way that the compiler will output SIMD gather instructions for us.

Attempt #1.

pub fn autovec(
    data: [u32; 1024],
    indices: [u16; CHUNK_SIZE],
    output: &mut [u32; CHUNK_SIZE]
) {
    for i in 0..CHUNK_SIZE {
        output[i] = data[indices[i] as usize];
    }
}

Let’s take a look at the assembly:

Autovec Assembly (expand)

example::autovec::hdd8d5a79acffac38:
        mov     rax, rdi
        movzx   edi, word ptr [rsi]
        cmp     rdi, 1024
        jae     .LBB0_9
        mov     ecx, dword ptr [rax + 4*rdi]
        mov     dword ptr [rdx], ecx
        movzx   edi, word ptr [rsi + 2]
        cmp     rdi, 1023
        ja      .LBB0_9
        mov     ecx, dword ptr [rax + 4*rdi]
        mov     dword ptr [rdx + 4], ecx
        movzx   edi, word ptr [rsi + 4]
        cmp     rdi, 1023
        ja      .LBB0_9
        mov     ecx, dword ptr [rax + 4*rdi]
        mov     dword ptr [rdx + 8], ecx
        movzx   edi, word ptr [rsi + 6]
        cmp     rdi, 1023
        ja      .LBB0_9
        mov     ecx, dword ptr [rax + 4*rdi]
        mov     dword ptr [rdx + 12], ecx
        movzx   edi, word ptr [rsi + 8]
        cmp     rdi, 1023
        ja      .LBB0_9
        mov     ecx, dword ptr [rax + 4*rdi]
        mov     dword ptr [rdx + 16], ecx
        movzx   edi, word ptr [rsi + 10]
        cmp     rdi, 1023
        ja      .LBB0_9
        mov     ecx, dword ptr [rax + 4*rdi]
        mov     dword ptr [rdx + 20], ecx
        movzx   edi, word ptr [rsi + 12]
        cmp     rdi, 1023
        ja      .LBB0_9
        mov     ecx, dword ptr [rax + 4*rdi]
        mov     dword ptr [rdx + 24], ecx
        movzx   edi, word ptr [rsi + 14]
        cmp     rdi, 1023
        ja      .LBB0_9
        mov     eax, dword ptr [rax + 4*rdi]
        mov     dword ptr [rdx + 28], eax
        ret
.LBB0_9:
        push    rax
        lea     rdx, [rip + .Lanon.5ebe7e4dbf161fdbe4e81627694922b4.1]
        mov     esi, 1024
        call    qword ptr [rip + core::panicking::panic_bounds_check::h1a9bf3d94de0fc80@GOTPCREL]

It’s a lot of assembly, but that is mainly due to the unrolled loop and the conditional jumps that lead to a panic. The panics are Rust’s checked indexing at play; making sure we don’t access invalid memory. Oftentimes there are ways to write Rust such that it can elide bound checks, but since we are indexing with a dynamic value, Rust can’t elide safely.

How do gather instructions handle invalid indices? From the gather instructions Wikipedia page:

faults occurring from invalid memory accesses by masked-out elements are suppressed

I don’t know exactly how masked values transfers into runtime behavior, but perhaps Rust will have better luck with auto vectorization if we pinky promise to the compiler that our indices don’t need bound checks:

Attempt #2:

pub fn unchecked(
    data: [u32; 1024],
    indices: [u16; CHUNK_SIZE],
    output: &mut [u32; CHUNK_SIZE]
) {
    for i in 0..CHUNK_SIZE {
        output[i] = unsafe { *data.get_unchecked(indices[i] as usize) };
    }
}

The assembly shows that the jumps have been removed and now contains only the unrolled loop:

Unchecked Assembly (expand)

example::unchecked::ha0f0c770e0e99451:
        movzx   eax, word ptr [rsi]
        mov     eax, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdx], eax
        movzx   eax, word ptr [rsi + 2]
        mov     eax, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdx + 4], eax
        movzx   eax, word ptr [rsi + 4]
        mov     eax, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdx + 8], eax
        movzx   eax, word ptr [rsi + 6]
        mov     eax, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdx + 12], eax
        movzx   eax, word ptr [rsi + 8]
        mov     eax, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdx + 16], eax
        movzx   eax, word ptr [rsi + 10]
        mov     eax, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdx + 20], eax
        movzx   eax, word ptr [rsi + 12]
        mov     eax, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdx + 24], eax
        movzx   eax, word ptr [rsi + 14]
        mov     eax, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdx + 28], eax
        ret

The unchecked version performed slightly better than our first attempt, but still falls far short of the gather instructions.

We can pretty confidently say that Rust will not auto vectorize gather instructions.

Ok that’s not exactly true, as LLVM documents that gather operations are vectorizable², but it also gives the caveat that vectorizing gather operations is rarely beneficial. There is a Stack Overflow discussion on this very topic³

Memory latency

We can build an intuition for why the compiler won’t autovectorize, and it starts with looking at the application, which is the best case scenario for gather instructions.

The data that we are looking up is 1024 32-bit elements – that’s 32 KiB of memory, which coincidentally, matches the size of the L1 data cache on one of my Ryzen 5900x cores.⁴

Accessing L1 cache is 4 CPU cycles for Zen3 CPUs⁵. Back of the napkin math shows that 1 million L1 accesses consolidated into 125k gather instructions should require about 500k CPU cycles or roughly 100μs when running at 4.8 GHz. This matched the timings measured when memory writes were removed in the application.

If our data was bigger and indices randomly distributed, the cost to access it could be 10-100x more expensive depending on what cache could contain it or if the CPU had to go all the way to RAM.

That would put gather instructions in a tough spot. It’s an optimization for the wrong side of the equation. The scalar and SIMD versions would be stuck for the same amount of time waiting for the same data.

This performance nuance has been well documented⁶ and discussed⁷; so it can never be assumed that SIMD gather instructions can be a performance boon (ie: “worth the squeeze”).

Moreover, certain platforms like Wasm SIMD, lack gather instructions all together, and that makes me think those instructions were left on the cutting room floor due to their lack of value proposition.

Conclusion

In my last post exploring DuckDB⁸, I was inspired to restructure my data in the vein of Structs of Arrays and observe performance changes. With this restructure, there are a lot of lookups to gather the information to make a query. Hence, the investigation into gather instructions.

DuckDB prides itself on performance, so imagine my surprise when they publicly document their avoidance of explicit SIMD⁹

Does DuckDB use SIMD?

DuckDB does not use explicit SIMD (single instruction, multiple data) instructions because they greatly complicate portability and compilation. Instead, DuckDB uses implicit SIMD, where we go to great lengths to write our C++ code in such a way that the compiler can auto-generate SIMD instructions for the specific hardware. As an example why this is a good idea, it took 10 minutes to port DuckDB to the Apple Silicon architecture.

If DuckDB is relying on auto vectorization, and as we’ve seen, the compiler is very reluctant to generate gather instructions, we can postulate that DuckDB doesn’t use them. A quick search for vpgatherdd through an objdump of DuckDB confirmed this.

So are gather instructions underrated? They can be! The benchmarks here showed the SIMD implementation performed the computation 3x faster than the scalar version. But it comes at a cost:

The compiler won’t auto vectorize, so platform specific SIMD is required
Not all platforms support gather instructions
The absolute differences in benchmarks are measured in micro-seconds. It is much more likely that profiling will show the bottleneck elsewhere.
Most importantly, gather instructions are workload dependent. The same code can have different performance characteristics when given different data. I can see an argument for sacrificing a bit of performance for a bit more consistency.

https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing) ↩︎
https://llvm.org/docs/Vectorizers.html#scatter-gather ↩︎
https://stackoverflow.com/q/62328698/433785 ↩︎
A Ryzen 5900x has 768 KiB of L1 memory that is split per-core (12x) and is split in half again: one set for instructions and another for data. ↩︎
https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5 ↩︎
When vectorization hits the memory wall: investigating the AVX2 memory gather instruction ↩︎
https://news.ycombinator.com/item?id=19239273 ↩︎
The DuckDB and Arrow rabbit hole: data modeling adventures ↩︎
https://duckdb.org/faq.html ↩︎

Are SIMD gather instructions underrated?

Table of Contents

SIMD by hand

Auto vectorization

Memory latency

Conclusion

Comments