Skip to content

std.Random.shuffle: optimize for cache utilizing @prefetch input queue #24705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Olvilock
Copy link

@Olvilock Olvilock commented Aug 5, 2025

I've been doing a simulation, where one of the steps was shuffling a big table of particles (~10^7 elements). The std.Random.shuffle method turned out to be the major contributor into overall runtime. The main reason lies in inherent cache unfriendliness of the random-access swaps.

The solution that worked for me: precompute the swap indices 32 steps ahead of time, @prefetch them and put them into a ring buffer. As the array was cold, prefetching achieved over 3x speedup.

This pull request presents a hybrid approach that works well on both hot and cold arrays (threshold optimized for hot memory)

Benchmark for worst case (hot memory) results are attached in screenshots (ReleaseFast, CPU: Intel i5 8300H, memory freq. 2400 MHz).
Shuffle 10000000 8-byte
Shuffle 1000000 8-byte
Shuffle 400000 8-byte
Shuffle 150000 8-byte
Shuffle 100000 8-byte
Shuffle 20000 8-byte
Shuffle 5000 8-byte
Shuffle 1000 8-byte

@Rexicon226
Copy link
Contributor

Nice - I wonder if setting the prefetch locality to 0 would improve it further.

@Olvilock
Copy link
Author

Olvilock commented Aug 5, 2025

@Rexicon226 strangely, it did not (on my machine)

@Rexicon226
Copy link
Contributor

not too surprising - im not aware of x86 having a less-temporal prefetch and i doubt llvm is smart enough to do anything with the information. thanks for checking!

@andrewrk andrewrk self-requested a review August 6, 2025 18:42
@Olvilock
Copy link
Author

Is there/might there be a standard way of querying cache sizes of target machine, if applicable? I hardcoded a constant to switch between implementations in this pull request, a function of (L3) cache size is more suited as a choice of that constant

@Olvilock Olvilock changed the title std.Random: optimize for cache utilizing @prefetch input queue std.Random.shuffle: optimize for cache utilizing @prefetch input queue Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants