Authoring a SIMD enhanced Wasm library with Rust

Chrome, Firefox, and Node LTS have all stabilized the SIMD extension to Wasm in the last few months (Safari is lagging at the time of writing. See the updated roadmap for changes). Additionally, Rust has stabilized Wasm SIMD intrinsics recently too. All the pieces are set and now is the time to start authoring libraries that take advantage of the promised performance that SIMD can bring.

I’m biased, but I believe that Rust is the best language to be authoring Wasm libraries, as Rust tends to produce the smallest and the fastest Wasm payload in the shortest amount of development time. In this post, we’ll be enhancing the Rust port of Google’s HighwayHash, which already uses SIMD on x86, by leveraging SIMD instructions on Wasm too.

Phase 1: Add Wasm Instructions

The Wasm SIMD extension adds instructions to operate on 128 bits at once. Luckily for me the x86 SSE 4.1 implementation of HighwayHash also executes with 128 bit registers, so a good chunk of the work was simple translations. For instance, to add two 64 bit integers, the x86 _mm_add_epi64 instruction is equivalent to Rust’s wasm32::u64x2_add (and Wasm’s i64x2.add).

But not every instruction has a 1:1 mapping between these architectures, so the first step was to compile a list of x86 instructions that I’d need to emulate. A good example of this is _mm_mul_epu32. The below is the Wasm equivalent:

#[target_feature(enable = "simd128")]
fn _mm_mul_epu32(a: wasm32::v128, b: wasm32::v128) -> wasm32::v128 {
    let mask = wasm32::u32x4(0xFFFF_FFFF, 0, 0xFFFF_FFFF, 0);
    let lo_a_0 = wasm32::v128_and(a, mask);
    let lo_b_0 = wasm32::v128_and(b, mask);
    wasm32::u64x2_mul(lo_a_0, lo_b_0)
}

No worries if your eyes glazed while reading. I know I pored over Intel’s intrinsic guide and Rust’s wasm32 module to arrive at the above solution. Unit tests helped immensely to understand the desired behavior, as the documentation around each instruction can be lacking. Below is one such Wasm unit test that I created after verifying results on x86.

#[wasm_bindgen_test]
fn test_mm_mul_epu32() {
    let x = wasm32::u64x2(0x0264_432C_CD8A_70E0, 0x0B28_E3EF_EBB3_172D);
    let y = wasm32::u64x2(0x0B28_E3EF_EBB3_172D, 0x0264_432C_CD8A_70E0);
    let z = _mm_mul_epu32(x, y);
    assert_eq!(as_arr(z), [0xBD3D_E006_1E19_F760, 0xBD3D_E006_1E19_F760]);
}

Toolchain Considerations

Due to Wasm SIMD not being ubiquitous, one must plan on supporting environments where SIMD is enabled or disabled. A major roadblock, however; is that one can’t embed Wasm logic to selectively run SIMD workloads, as Wasm lacks dynamic feature detection. The issue goes even deeper as the mere presence of SIMD instructions (even if not executed) can cause headaches. Trying to compile our Wasm module on Node 14 will result in failure:

failed: invalid value type 'Simd128', enable with --experimental-wasm-simd

Executing with node with --experimental-wasm-simd is easy, but the message is clear: we must allow downstream users the choice whether to include these Wasm SIMD instructions. My solution is to store usage of Wasm SIMD in a module that is conditionally compiled.

#[cfg(all(
    target_family = "wasm", 
    target_feature = "simd128"
))]
mod wasm;

The good news from this conditional compilation is that we no longer need the target_feature annotation on functions, and so those will be omitted in subsequent examples. The tradeoff is that downstream users will to remember to add a compiler flag if they want Wasm SIMD:

FLAGS="-C target-feature=+simd128"
RUSTFLAGS="$FLAGS" cargo build --target wasm32-unknown-unknown

On the testing front, tests are written with wasm-bindgen-test, and are executed with wasm-pack:

FLAGS="-C target-feature=+simd128"
RUSTFLAGS="$FLAGS" wasm-pack test --node

The biggest caveat with running and writing Wasm tests is that panic! print is the most conducive way to debug.

More Translation Examples

Before moving onto the next part, I’ve included a few more examples of how x86 and WASM SIMD compare. A large disclaimer here is that I am not well versed on SIMD (much less a Wasm SIMD), so the translations are almost certainly sub-optimal. So if something catches your eye, feel free to email me (at bottom) or better yet, send a pull request.

One thing I’ve noticed is that I use shuffle in Wasm like a hammer and everything is a nail:

/// Shift left 8 bytes (aka _mm_bslli_si128 8)
fn _mm_slli_si128_8(a: wasm32::v128) -> wasm32::v128 {
    let zero = wasm32::u64x2(0, 0);
    wasm32::u64x2_shuffle::<1, 2>(a, zero)
}

Here comes a function for rotating bytes, which contain a couple oddities.

/// Rotate 4 bytes to the left
fn rotate_4(a: wasm32::v128) -> wasm32::v128 {
    let ignored = wasm32::u64x2(0, 0);
    wasm32::u32x4_shuffle::<1, 0, 3, 2>(a, ignored)
}

The first oddity is that we’re allocating an unused variable, as shuffle requires two inputs even if one is only used. Perhaps swizzle is better here. I’m not sure, as I’ve yet to setup a Wasm benchmark harness.

Another oddity, but one that isn’t immediately obvious, is that lane positioning of Wasm and x86 are opposites. The equivalent x86 function would be:

#[target_feature(enable = "sse4.1")]
pub unsafe fn rotate_4(a: __m128i) -> __m128i {
    _mm_shuffle_epi32(a, _mm_shuffle!(2, 3, 0, 1))
}

So the lane positioning went from [1, 0, 3, 2] to [2, 3, 0, 1]. Bizarre, but something to be aware of.

Phase 2: Ergonomic JS Wrapper

Our Rust library can be published, but it’s not ergonomic in a JS environment, so the next step is to wrap it up in a NPM package. If you are interested in the end result and nitty gritty build details, you can examine highwayhasher, which has its own explanatory post. Be warned, it’s even more complicated as that package is an isomorphic hashing API over Wasm and node native modules, which itself abstracts over x86 SIMD and a portable implementation. And we’re about to make it even more complicated by adding Wasm SIMD to the mix.

Phew. Let’s break down what we need to do.

The first step is to bundle two Wasm implementations into the distributable, one with SIMD enabled and one without. We can’t have a single implementation, else older Wasm compilers can choke even if the SIMD instruction is unused. One way to achieve this is to have two separate Wasm crates, but my preference is to consolidate the logic in one place and use conditional compilation.

#[cfg(target_feature = "simd128")]
type MyHasher = highway::WasmHash;

#[cfg(not(target_feature = "simd128"))]
type MyHasher = highway::PortableHash;

Now use wasm-pack to generate both our packages, but output them to separate directories.

wasm-pack build -t web --out-dir ../src/web web

FLAGS="-C target-feature=+simd128"
RUSTFLAGS="$FLAGS" wasm-pack build -t web --out-dir ../src/web-simd web

Except the above will fail due to wasm-pack invoking an old version of wasm-opt that was released a couple years ago that does not understand the new SIMD extension:

[parse exception: bad local.get index (at 0:769)]
Fatal: error in parsing input

The solution is to disable the wasm-opt portion of wasm-pack:

[package.metadata.wasm-pack.profile.release]
wasm-opt = false

Newer versions of binaryen, and by extension wasm-opt, support the SIMD extension, so one can elect to invoke their own wasm-opt installed out of band.

If I can get on a soapbox for a moment: wasm-pack should have corporate sponsors. There’s a significant amount of issues and updates that need tending to. Since no alternative has presented itself, every time I revisit wasm-pack I’m pained by new potholes that have risen due to neglect. It’s not a great experience, and makes me feel silly extolling Rust for Wasm when the toolchain is suffering from neglect. Even wasm-bindgen, the glue that binds everything together, could have a looming maintenance crisis. Rust and Wasm has proven itself to be an invaluable combination and I believe they should get the support they deserve.

Back at our code, we have our two Wasm modules that we now have to choose the appropriate one to load. The check if SIMD is enabled is intuitive: Wasm SIMD is enabled if it is possible to compile the smallest Wasm program with SIMD instructions. This is exactly what wasm-feature-detect does and we can inline that logic with a couple of embellishments to cache the results.

let simdEnabled: boolean | undefined;
const hasSimd = () =>
  simdEnabled ??
  (simdEnabled = WebAssembly.validate(
    new Uint8Array([
      0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 123, 3, 2, 1, 0, 10, 10,
      1, 8, 0, 65, 0, 253, 15, 253, 98, 11,
    ])
  ));

And finally we need to import our JS-Wasm glue code. This may look different depending on the build step of your library.

import init, { WasmHighway } from "./wasm/highwayhasher_wasm";
import simdInit, {
  WasmHighway as WasmSimdHighway,
} from "./wasm-simd/highwayhasher_wasm";
import wasm from "./wasm/highwayhasher_wasm_bg.wasm";
import wasmSimd from "./wasm-simd/highwayhasher_wasm_bg.wasm";

Notes:

And for completeness sake, here is how everything comes together:

let wasmInitialized = false;
let wasmSimdInitialized = false;
const loadWasmSimd = async () => {
  if (!wasmSimdInitialized) {
    await simdInit(wasmSimd());
    wasmSimdInitialized = true;
  }
};

const loadWasm = async () => {
  if (!wasmInitialized) {
    await init(wasm());
    wasmInitialized = true;
  }
};

export const loadModule = async () => {
  if (!hasSimd()) {
    await loadWasm();
    return WasmHighway;
  } else {
    await loadWasmSimd();
    return WasmSimdHighway;
  }
}

Now the fun stuff: benchmarking

Phase 3: Benchmarks

The highwayhasher repo contains a crude benchmark that pits the new Wasm SIMD against the scalar implementation and also against x86 SIMD via node native modules.

The results is that the performance of the SIMD implementation varies depending on the payload size:

A 3x speedup for Wasm use cases is a significant improvement, but it still falls significantly short of native performance. Further improvement could come from better translations of x86 instructions, or maybe all that is needed is time for browsers to further optimize their Wasm runtime.

Cloudflare Workers vs Fastly C@E

Time to add fuel to the fire. I kid, but both Cloudflare Workers and Fastly’s Compute@Edge allow Wasm to be executed on the edge, and I’m curious if Wasm SIMD is available on these platforms and what type of performance one can receive.

Our goal is to have an endpoint that will ingest a request body and respond with the 64bit hash.

On Cloudflare, my intuition led me to first write a JS worker so that I can use the Streams API to incrementally hash the body without worrying about buffering. Regrettably, while Cloudflare’s platform is detected as Wasm SIMD enabled (our hasSimd function from earlier returns true), one is not allowed to compile Wasm. The following error is returned:

Wasm code generation disallowed by embedder

This error highlights another reality check for the platform. To be fair the fault isn’t solely Cloudflare’s. One can deploy and run Wasm on Workers in a standards compliant fashion (see my example repo of running brotli compression on JS). The first step is to upload Wasm as separate files configured as CompiledWasm. The trick; however, is that importing Wasm on a worker returns a WebAssembly.Module, meaning that in the background, Cloudflare did the heavy lifting by compiling the Wasm and giving us the results. This is theoretically great as it saves us from recompiling the Wasm on every request, but developer experience needs improving as I’m not familiar with any JS bundler that understands this behavior. I hope to create a separate post to dive into the details and see if there’s a way we can ergonomically solve these problems.

The alternative is to target their Rust SDK, which still compiles to Wasm at the end of the day, but has fewer developer experience kinks. The main trade off is that the entire request body is buffered in memory.

#[event(fetch)]
pub async fn main(mut req: Request, _env: Env) -> Result<Response> {
    let data = req.bytes().await?;
    let hash = WasmHash::force_new(Key::default());
    let result = hash.hash64(data.as_slice());
    Ok(Response::ok(format!("{}", result)).unwrap())
}

Since Cloudflare uses wasm-pack we need to disable the bundled and outdated wasm-opt in our Cargo.toml

[package.metadata.wasm-pack.profile.release]
wasm-opt = false

A deploy and a quick test later and we’re off to evaluate Fastly.

With Fastly, their JS compute platform is in beta, so we’ll use their Wasm platform and compile our function in Rust to Wasm. Testing showed that, unfortunately for Fastly, their platform does not support Wasm SIMD, as publishing anything with the simd128 feature enabled results in an error:

ERROR: Function translation error
// ...
Warning: No default director found

So we have to go with the portable implementation, which at the very least is beautifully succinct.

#[fastly::main]
fn main(mut req: Request) -> Result<Response, Error> {
    let mut body = req.take_body();
    let mut hasher = PortableHash::new(Key::default());
    std::io::copy(&mut body, &mut hasher).unwrap();
    let result = hasher.finalize64();
    let resp = Response::from_status(StatusCode::OK)
        .with_body_text_plain(&format!("{}", result));
    Ok(resp)
}

After confirming Cloudflare and Fastly give the same results, I decided to post a 5 MB file and time the duration between when the request’s first byte is sent and the response is finished downloading. The informal benchmark was run several times, averaged, and then rounded. The results:

Fastly ekes out a small win, though Cloudflare has the disadvantage of buffering everything, so I’d hazard a guess that optimal deployments will result in equivalent performance.

Though if I am to be completely honest, I see potential performance improvements for both platforms. I’ve done other compute heavy benchmarks on both, and every time I’ve been left wanting. Of course, performance latency is but a small price for globally scalable and infinitely scalable compute.

Neon through Wasm

Here’s a fun experiment. ARM Neon instructions aren’t yet stable in Rust, but when our Wasm SIMD instructions are executed on ARM, they will be translated into Neon. This raises the question, what’s faster: the native module that isn’t optimized for Neon or Wasm SIMD.

To find out, I took an AWS Graviton2 instance for a spin. And wow, I always forget how uncommon it is to see prebuilt binaries for aarch64 or how some projects flat out don’t support ARM. After several minutes of compilation I was able to and rerun the benchmark and get the following results:

hashing data of size: 100000000
native 897.46 MB/s
wasm simd 1481.42 MB/s
wasm scalar 671.55 MB/s

The Wasm SIMD implementation is 65% faster than native! But what is perhaps more interesting is that the Wasm scalar implementation is only half as fast as the Wasm SIMD version instead of the 3x seen on x86. Perhaps v8 doesn’t have enough optimizations on the Wasm SIMD to Neon front.

Even though I benchmarked multiple times, take the results with a grain of sand, as benchmarking on a VPS can be unreliable.

Conclusion

Main takeaways:

Discussion (thank you to those who’ve shared this article):

Comments

If you'd like to leave a comment, please email [email protected]