Skip to content

Strange bad performance with [u8; 16] and union feature #379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JonathanWilbur opened this issue May 24, 2025 · 6 comments
Open

Strange bad performance with [u8; 16] and union feature #379

JonathanWilbur opened this issue May 24, 2025 · 6 comments

Comments

@JonathanWilbur
Copy link

Forgive me if I'm making some sort of amateur mistake here. I know when it comes to profiling, subtle things done wrong can completely bias a result, but I ran these two tests with criterion and it is saying that the SmallVec<[u8; 16]> is about 100x slower than the normal Vec at push().

These are my tests:

fn vec_test() {
    let mut v = Vec::with_capacity(13);
    for i in 0..12 {
        v.push(i);
    }
}

fn smallvec_test() {
    let mut v: SmallVec<[u8; 16]> = SmallVec::new();
    for i in 0..15 {
        v.push(i);
    }
}

fn bench_vec(c: &mut Criterion) {
    let mut group = c.benchmark_group("Vec");
    group.bench_function("Normal", |b| b.iter(|| vec_test()));
    group.bench_function("SmallVec", |b| b.iter(|| smallvec_test()));
    group.finish();
}

However, when I change it to SmallVec<[u8; 15]>, the smallvec is 3x faster than Vec (which is a result that I would expect).

Why would adding that single additional byte make this so slow? I am using the union feature, so I would think that 16 bytes could fit in SmallVec while keeping it the same size as a Vec. Is this a known problem? Am I making some obvious mistake?

@JonathanWilbur
Copy link
Author

Huh, it seems like disabling union fixed it (not completely, since the non-union implementation is probably just going to be a little slower):

Vec/SmallVec            time:   [20.751 ns 20.783 ns 20.823 ns]
                        change: [−72.229% −71.968% −71.731%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  1 (1.00%) high severe

@JonathanWilbur
Copy link
Author

JonathanWilbur commented May 24, 2025

Actually, even that appears to be slower than Vec, even though it is a lot better than the catastrophic performance when union is on.

Vec/Normal              time:   [694.64 ps 696.88 ps 699.61 ps]
                        change: [+4.2862% +4.9232% +5.7627%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Vec/SmallVec            time:   [20.751 ns 20.783 ns 20.823 ns]
                        change: [−72.229% −71.968% −71.731%] (p = 0.00 < 0.05)
                        Performance has improved.

@mbrubeck
Copy link
Collaborator

mbrubeck commented May 24, 2025

Because your vec_test function does not return anything, the compiler is optimizing it away completely. Try changing it to something like this:

fn vec_test() -> Vec<u8> {
    let mut v = Vec::with_capacity(16);
    for i in 0..12 {
        v.push(i);
    }
    v
}

fn smallvec_test() -> SmallVec<[u8; 16]> {
    let mut v: SmallVec<[u8; 16]> = SmallVec::new();
    for i in 0..12 {
        v.push(i);
    }
    v
}

Using the above code, I get the following timings (Rust 1.87, aarch64-apple-darwin, M4):

Vec/SmallVec            time:   [481.83 ps 481.96 ps 482.07 ps] (with `union` feature)
Vec/SmallVec            time:   [619.08 ps 619.57 ps 620.06 ps] (without `union` feature)
Vec/Normal              time:   [11.728 ns 11.734 ns 11.739 ns]

@mbrubeck
Copy link
Collaborator

mbrubeck commented May 24, 2025

I can reproduce a performance cliff when pushing more than 15 items into a SmallVec<[u8; N]> (where N > 15) with the union feature enabled.

For example, pushing 14 or 15 items into a SmallVec<[u8; 17]> is fast, but pushing 16 or 17 items into the same vector is slow.

I haven't looked at the codegen to figure out why this happens, but I would guess it is hitting a threshold that prevents some optimization like loop unrolling, and perhaps this has a cascading effect on other optimizations.

@mbrubeck mbrubeck changed the title Strange bad performance with [u8; 16] Strange bad performance with [u8; 16] and union feature May 24, 2025
@JonathanWilbur
Copy link
Author

I have a case where it seems to be slower by a really consistent 12% or so.

This benchmark

https://github.com/JonathanWilbur/asn1.rs/blob/0b7d210053281579e94dd8839c46f87dbfffc17b/asn1/benches/oid.rs#L12-L14

calls

https://github.com/JonathanWilbur/asn1.rs/blob/0b7d210053281579e94dd8839c46f87dbfffc17b/asn1/src/oid.rs#L530-L582

As you can see above, hardly any difference in the smallvec case.

The write_oid_arc macro is just this, which means that .push() should be called every time.

https://github.com/JonathanWilbur/asn1.rs/blob/0b7d210053281579e94dd8839c46f87dbfffc17b/asn1/src/utils.rs#L111-L132

I started looking into this issue because of this. I just don't get why the heck this would be 12% slower in this real-world case when we are able to make the benchmarks look good. Smallvec can push a few things into its inline memory in a matter of picoseconds in the benchmarks, but in these real world tests, it takes longer than the normal Vec.

I'm sorry to hit you with a "fix my crate" but what the heck it causing this? Is this some other compiler optimization? I have spent almost the entire day trying to get one profiling tool after another to work, and if they work at all, I can never "drill down" deep enough to see where exactly it is held up.

@mbrubeck
Copy link
Collaborator

mbrubeck commented May 25, 2025

On my machine (M4 Macbook Air), I get the following timings. It's not actually slower than Vec on this particular machine, but it is still weirdly slow with the union feature enabled:

create oid2             time:   [16.747 ns 16.821 ns 16.879 ns] ("smallvec" disabled)
create oid2             time:   [14.893 ns 14.896 ns 14.899 ns] ("smallvec" enabled, "smallvec/union" enabled)
create oid2             time:   [7.1507 ns 7.1611 ns 7.1724 ns] ("smallvec" enabled, "smallvec/union" disabled)

We should definitely look into the performance problems with union, but for now do you think disabling the union feature would be better for your use case?

As a general note, the most significant speed advantage of SmallVec is that it avoids allocation and deallocation (as long as the inline capacity is not exceeded). This means it has a big advantage in code where vecs are created and destroyed frequently. I expect the microbenchmarks we're looking at are dominated by initial allocation cost, not push speed. Or at least they should be when things are working as expected.

SmallVec::push on its own can be slower than Vec::push, since it involves an additional branch (spilled vs. non-spilled) compared to the standard Vec::push. I wrote more about possible SmallVec downsides in this in this old forum thread: https://users.rust-lang.org/t/when-is-it-morally-correct-to-use-smallvec/46375

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants