Default musl allocator considered harmful (to performance)

TLDR: In a real world benchmark, the default musl allocator caused a 7x slowdown compared to other allocators. I recommend all Rust projects immediately add the following lines to their application’s main.rs:

// Avoid musl's default allocator due to lackluster performance
// https://nickb.dev/blog/default-musl-allocator-considered-harmful-to-performance
#[cfg(target_env = "musl")]
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

And Cargo.toml

[target.'cfg(target_env = "musl")'.dependencies]
mimalloc = "0.1.43"

The root cause is the contention between multiple threads when allocating memory, so the problem worsens as more threads or allocations are created.

I recommend swapping the allocator even if musl is not a compilation target today or if the program is single threaded. This is something you simply don’t want to forget.

Reader’s choice on what allocator they want to sub in. The code snippets use mimalloc, but jemalloc is also good.

Also reader’s choice if allocation substitution should only be restricted to musl environments (as shown) or if it should be done globally. I don’t mind conditionally compiling dependencies here as it serves as another form of documentation.

Why musl?

If I feel so strongly about avoiding the default musl allocator, why even use musl in the first place?

Well, when an important customer is running a version of Red Hat Linux initially released before I started high school, you can bet that there will be glibc issues if you don’t have a build machine with the same version of Red Hat Linux. Corollary: hats off to Red Hat for supporting their distro releases for such a lengthy period of time.

So this is a love-hate relationship with musl. I love cross compiling and creating static executables where I can scp anywhere and just have everything work. I will continue to use musl and respect the hard work behind the team.

And while docker image size should never be a deciding factor, it can be tantalizing to leverage a 2MB distroless image with a static build with musl to minimize any possibility of cold starts for scalable workloads (though if the executable is much larger than 2MB than the difference between container images is negligible, so your mileage may vary).

REPOSITORY                              IMAGE ID       SIZE
gcr.io/distroless/cc-debian12           6f09ff5d0af8   23.4MB
gcr.io/distroless/base-nossl-debian12   ae4cc24e698d   14.8MB
gcr.io/distroless/base-debian12         fab58a7ef52e   20.7MB
gcr.io/distroless/static-debian12       5d7d2b425607   1.99MB

Scene of the crime

I first noticed performance issues when a server was processing data slower than my host machine, which needed to fetch the data over a 1 Gbit/s connection.

Here’s my thought process as I honed in:

The server has an older CPU and older CPUs are typically slower CPUs. (true, but not 7x slower).
Maybe the server Proxmox VM has a CPU type of kvm64 (which I’ve been guilty of before), as it excludes SIMD instructions
Maybe musl’s CPU feature detection is wonky and not selecting the SIMD-enabled code? (Nope, not true). I wish there was a more ergonomic way to see if SIMD is being executed.

Embarrassingly, it took me an hour until I ran the glibc and musl version side by side on the host machine and on the server and found the glibc was much faster in both. To be fair, 45 minutes of that hour was spent fiddling with a C++ dependency’s build system to test if the CPU feature detection was different on musl vs glibc (I never did figure it out).

I created a reduced benchmark on the host machine and compared the following:

# glibc
/usr/bin/time -v ./target/release/compare

# musl
/usr/bin/time -v ./target/x86_64-unknown-linux-musl/release/compare

Below is a comparison table of just the important metrics:

	glibc	musl
User time (seconds)	1.31	2.72
System time (seconds)	0.37	6.13
Percent of CPU this job got	943%	745%
Elapsed time (seconds)	0.17	1.18
Voluntary context switches	1196	199786
Involuntary context switches	191	794

The musl version took 7x longer! But I didn’t know why. All I knew is that the differences in voluntary context switches stood out: musl had a 167x more of them! At 200k switches per second, we’re in thrashing territory.

My first instinct was to profile the executables with callgrind and visualize them with kcachegrind.

I was disappointed to see the cycle estimate results were within the margin of error of each other. How can that be when the run times were so different? Perhaps this is a case of observer effect due to how much slower apps run under valgrind. Or maybe this information is just not captured in callgrind. Looks like I have some homework to hone my profiling tools.

Next, I looked at syscalls, a common source of context switching.

strace -c ./target/release/compare

Both versions had the same number of syscalls. The main difference is the musl version spent 6.7 seconds in a futex while glibc only 0.5 seconds, a 13x penalty! This means there must be some contention for a shared lock in musl when allocating or de-allocating memory from multiple threads.

And to be honest, this is where I reached the end of my investigative skills, and verified that swapping the allocator fixed the performance issues.

This is not news

Just searching for “musl performance issues” will yield a myriad of results. It must be a rite of passage for programmers to stumble upon this performance pitfall.

Chimera Linux uses musl with mimalloc, noting

the stock allocator is the primary reason for nearly all performance issues people generally have with musl
The folks at tweag.io documented musl’s malloc contention results in a 20x slowdown
Projects like Ripgrep and Apache DataFusion needed to swap out the allocator (or abandon musl).
Chainguard reported a 2-4x slowdown with musl
Binaryen’s wasp-opt saw a musl allocator slowdown of 10x (and this blog was cited as a reference)
Lots of reddit threads and github issues

There is a wide range of reported slowdowns, from 2x to 20x. Such a wide range is due to how many threads are contending for the memory allocator in the application. For applications that don’t allocate much or have fewer parallel threads, the slowdown is not as drastic.

The 7x slowdown I observed was on a 6 core machine. Since the slowdown is correlated with the amount of contention, I decided to do an experiment.

I ran down to the local VPS provider, nabbed a 48 core machine, and crushed its dreams of running anything meaningful with the following benchmark.

fn main() {
    let num_threads = std::thread::available_parallelism().map_or(8, |x| x.get());

    let mut handles = vec![];
    for _ in 0..num_threads {
        let handle = std::thread::spawn(move || {
            let mut counter = 0;
            for _ in 0..100000 {
                let data = vec![1u8; counter];
                counter += usize::from(data.get(100).copied().unwrap_or(1));
            }
            println!("counter: {}", counter);
        });
        handles.push(handle);
    }

    for handle in handles {
        handle.join().unwrap();
    }
}

Using the not-so-scientific measurements of time, the default allocator yielded:

real    0m0.169s
user    0m7.680s
sys     0m0.025s

While the musl build yielded:

real    1m56.890s
user    1m0.522s
sys     7m3.542s

That’s nearly a 700x slowdown. Going from a blink of an eye to having time to get up and stretch, and we’re just getting started. 48 cores is now only a mid-tier instance at AWS, as one can rent instances with 192 cores.

I also ran the benchmark on an 8 core machine from same VPS provider and came away with the following learnings:

Running the benchmark across 6x more cores resulted in a 4x slowdown.
The stock GNU allocator took the same amount of time regardless of 8 cores or 48 cores. Nicely done!

I find a 700x difference in synthetic workloads and a 7x difference in application performance stemming from the memory allocator to be mind boggling.

A skill issue?

Andrew Kelley, Zig’s creator, brings up a great discussion point that the underperformance might not be such a big issue to experienced programmers:

it under performs a lot

It’s true. The funny thing is when you’re a beginner Zig programmer you need a good GPA [General Purpose Allocator] for performance reasons but you lack the skills to write an allocator implementation.

However, as you become an advanced programmer you start to learn about better memory management techniques that makes GPA performance irrelevant. Ironically, at this point you become capable of implementing a better GPA, but it’s low-key kind of useful that GPA’s poor performance helps highlight where you’re not batch allocating objects.

That’s the Zig Malloc Paradox.

Andrew is, of course, completely right. If I had used musl’s allocator from the start, the poor performance would have caused me to structure the code base to minimize allocations by flattening data structures and facilitating object reuse.

Or would I have abandoned the attempt and switched to another language or project? Unfortunately, I can’t rewind time and at this point, I’m reluctant to introduce breaking changes, as the code is already fast enough under most circumstances, and relied upon by others.

While there is probably a correlation between the number of allocations and experience, I would not use this as a measure of success.

And in an ironic twist, I’m working on a Rust library in a domain that differentiates itself by minimizing allocations through ergonomic sacrifices. Is it worth it? To me it is, as the performance benefits are tangible, but to others where the library may slot in as a tertiary afterthought, ergonomics would be more highly valued.

Musl’s new mallocng

I got excited when I read a reddit post from May 2020 about how an upcoming redesign of musl’s allocator should bring improvements.

Later that same year, musl released it in v1.2.1, but didn’t illustrate any performance implications. Perhaps some of the release notes are insinuating a performance improvement?

In May 2023, Rust bumped the musl target to 1.2.x, and cross followed suit in October.

There hasn’t (yet) been a cross release that includes this change (much to the chagrin of some), but no worries, I can cargo install from git.

The excitement had built to a crescendo.

But the benchmark results remained unchanged.

I’m not the only one to have noticed this.

the new ng allocator in MUSL doesn’t make a dime of a difference

In the end, no matter what musl allocator you are using, I recommend switching to a different one as shown at the start.

I love musl, but now I know it needs a little extra something to go with it, and that would be another allocator.

Discuss on Reddit