Don't keep box muller transform state between kernel launches #649
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This builds on the new hybrid HIP/CUDA backend from #647 so review that first!
As described in #648, beyond its (arguably excessive) 192 bit of state, the curand XORWOW RNG used to provide randomness for neuron and custom connectivity updates also stores 160 bits of box muller transform state in the
curandState
struct (BM draws two numbers and produces two normally distributed values so this state is used to cache one of those results for subsequent calls tocurand_normal
).In this PR, when using CUDA and HIP backends, we create our own
XORWowStateInternal
struct in definitions.h without the BM state and store this in memory. At the start of the neuron and custom connectivity update kernels, we copy the fields from theXORWowStateInternal
struct into a localcurandState
and, at the end, we copy them back.Excitingly, because we are very memory bandwidth bound, this makes the neuron kernel on the cortical microcircuit model about 60% faster. On my A5000 (running for 1 second):