Strange bad performance with [u8; 16] and `union` feature #379

JonathanWilbur · 2025-05-24T10:45:54Z

Forgive me if I'm making some sort of amateur mistake here. I know when it comes to profiling, subtle things done wrong can completely bias a result, but I ran these two tests with criterion and it is saying that the SmallVec<[u8; 16]> is about 100x slower than the normal Vec at push().

These are my tests:

fn vec_test() {
    let mut v = Vec::with_capacity(13);
    for i in 0..12 {
        v.push(i);
    }
}

fn smallvec_test() {
    let mut v: SmallVec<[u8; 16]> = SmallVec::new();
    for i in 0..15 {
        v.push(i);
    }
}

fn bench_vec(c: &mut Criterion) {
    let mut group = c.benchmark_group("Vec");
    group.bench_function("Normal", |b| b.iter(|| vec_test()));
    group.bench_function("SmallVec", |b| b.iter(|| smallvec_test()));
    group.finish();
}

However, when I change it to SmallVec<[u8; 15]>, the smallvec is 3x faster than Vec (which is a result that I would expect).

Why would adding that single additional byte make this so slow? I am using the union feature, so I would think that 16 bytes could fit in SmallVec while keeping it the same size as a Vec. Is this a known problem? Am I making some obvious mistake?

The text was updated successfully, but these errors were encountered:

JonathanWilbur · 2025-05-24T10:48:03Z

Huh, it seems like disabling union fixed it (not completely, since the non-union implementation is probably just going to be a little slower):

Vec/SmallVec            time:   [20.751 ns 20.783 ns 20.823 ns]
                        change: [−72.229% −71.968% −71.731%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  1 (1.00%) high severe

JonathanWilbur · 2025-05-24T10:50:29Z

Actually, even that appears to be slower than Vec, even though it is a lot better than the catastrophic performance when union is on.

Vec/Normal              time:   [694.64 ps 696.88 ps 699.61 ps]
                        change: [+4.2862% +4.9232% +5.7627%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Vec/SmallVec            time:   [20.751 ns 20.783 ns 20.823 ns]
                        change: [−72.229% −71.968% −71.731%] (p = 0.00 < 0.05)
                        Performance has improved.

mbrubeck · 2025-05-24T15:49:15Z

Because your vec_test function does not return anything, the compiler is optimizing it away completely. Try changing it to something like this:

fn vec_test() -> Vec<u8> {
    let mut v = Vec::with_capacity(16);
    for i in 0..12 {
        v.push(i);
    }
    v
}

fn smallvec_test() -> SmallVec<[u8; 16]> {
    let mut v: SmallVec<[u8; 16]> = SmallVec::new();
    for i in 0..12 {
        v.push(i);
    }
    v
}

Using the above code, I get the following timings (Rust 1.87, aarch64-apple-darwin, M4):

Vec/SmallVec            time:   [481.83 ps 481.96 ps 482.07 ps] (with `union` feature)
Vec/SmallVec            time:   [619.08 ps 619.57 ps 620.06 ps] (without `union` feature)
Vec/Normal              time:   [11.728 ns 11.734 ns 11.739 ns]

mbrubeck · 2025-05-24T16:49:53Z

I can reproduce a performance cliff when pushing more than 15 items into a SmallVec<[u8; N]> (where N > 15) with the union feature enabled.

For example, pushing 14 or 15 items into a SmallVec<[u8; 17]> is fast, but pushing 16 or 17 items into the same vector is slow.

I haven't looked at the codegen to figure out why this happens, but I would guess it is hitting a threshold that prevents some optimization like loop unrolling, and perhaps this has a cascading effect on other optimizations.

JonathanWilbur · 2025-05-25T02:36:48Z

I have a case where it seems to be slower by a really consistent 12% or so.

This benchmark

https://github.com/JonathanWilbur/asn1.rs/blob/0b7d210053281579e94dd8839c46f87dbfffc17b/asn1/benches/oid.rs#L12-L14

calls

https://github.com/JonathanWilbur/asn1.rs/blob/0b7d210053281579e94dd8839c46f87dbfffc17b/asn1/src/oid.rs#L530-L582

As you can see above, hardly any difference in the smallvec case.

The write_oid_arc macro is just this, which means that .push() should be called every time.

https://github.com/JonathanWilbur/asn1.rs/blob/0b7d210053281579e94dd8839c46f87dbfffc17b/asn1/src/utils.rs#L111-L132

I started looking into this issue because of this. I just don't get why the heck this would be 12% slower in this real-world case when we are able to make the benchmarks look good. Smallvec can push a few things into its inline memory in a matter of picoseconds in the benchmarks, but in these real world tests, it takes longer than the normal Vec.

I'm sorry to hit you with a "fix my crate" but what the heck it causing this? Is this some other compiler optimization? I have spent almost the entire day trying to get one profiling tool after another to work, and if they work at all, I can never "drill down" deep enough to see where exactly it is held up.

mbrubeck · 2025-05-25T03:59:55Z

On my machine (M4 Macbook Air), I get the following timings. It's not actually slower than Vec on this particular machine, but it is still weirdly slow with the union feature enabled:

create oid2             time:   [16.747 ns 16.821 ns 16.879 ns] ("smallvec" disabled)
create oid2             time:   [14.893 ns 14.896 ns 14.899 ns] ("smallvec" enabled, "smallvec/union" enabled)
create oid2             time:   [7.1507 ns 7.1611 ns 7.1724 ns] ("smallvec" enabled, "smallvec/union" disabled)

We should definitely look into the performance problems with union, but for now do you think disabling the union feature would be better for your use case?

As a general note, the most significant speed advantage of SmallVec is that it avoids allocation and deallocation (as long as the inline capacity is not exceeded). This means it has a big advantage in code where vecs are created and destroyed frequently. I expect the microbenchmarks we're looking at are dominated by initial allocation cost, not push speed. Or at least they should be when things are working as expected.

SmallVec::push on its own can be slower than Vec::push, since it involves an additional branch (spilled vs. non-spilled) compared to the standard Vec::push. I wrote more about possible SmallVec downsides in this in this old forum thread: https://users.rust-lang.org/t/when-is-it-morally-correct-to-use-smallvec/46375

mbrubeck changed the title ~~Strange bad performance with [u8; 16]~~ Strange bad performance with [u8; 16] and union feature May 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strange bad performance with [u8; 16] and `union` feature #379

Strange bad performance with [u8; 16] and `union` feature #379

JonathanWilbur commented May 24, 2025

JonathanWilbur commented May 24, 2025

Uh oh!

JonathanWilbur commented May 24, 2025 •

edited

Loading

Uh oh!

mbrubeck commented May 24, 2025 •

edited

Loading

Uh oh!

mbrubeck commented May 24, 2025 •

edited

Loading

Uh oh!

JonathanWilbur commented May 25, 2025

Uh oh!

mbrubeck commented May 25, 2025 •

edited

Loading

Uh oh!

Strange bad performance with [u8; 16] and union feature #379

Strange bad performance with [u8; 16] and union feature #379

Comments

JonathanWilbur commented May 24, 2025

JonathanWilbur commented May 24, 2025

Uh oh!

JonathanWilbur commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbrubeck commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbrubeck commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonathanWilbur commented May 25, 2025

Uh oh!

mbrubeck commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Strange bad performance with [u8; 16] and `union` feature #379

Strange bad performance with [u8; 16] and `union` feature #379

JonathanWilbur commented May 24, 2025 •

edited

Loading

mbrubeck commented May 24, 2025 •

edited

Loading

mbrubeck commented May 24, 2025 •

edited

Loading

mbrubeck commented May 25, 2025 •

edited

Loading