Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize effect application in heterogeneous delay mode over additional CUDA blocks #269

Open
denisalevi opened this issue Feb 15, 2022 · 1 comment

Comments

@denisalevi
Copy link
Member

The effect application in our Brunel Hakim benchmark with heterogeneous delays is very inefficient when using no partitioning of the connectivity matrix (blocks = 1). But spike propagation is very efficient in that case (talking about large network sizes with many spikes per dt).

Here a figure from the benchmark. Yellow: Effect application, Red: Spike propagation, Blue: Neurons. Brian2CUDA top bar shows 1 block setting.
fig3-brunel
.
So it would be great if effect application would be better without connectivity matrix partition, then we could just use 1 block and be done with it.

The reason effect application is inefficient is that we use only 1 CUDA block per partition to apply the synaptic effects in the current spike queue. For the block setting, that means we only use 1 CUDA block. But this effect application could easily be further parallelized.

We should choose the number of CUDA blocks per connectivity matrix partition based on the total number of connectivity matrix partitions, such that the total number of CUDA blocks is as high as possible but below the maximal number of active CUDA blocks per SM. Since the number of synapses/bundles in the current spike queue is probably variable, it might make sense to read the spike queue sizes (same as we read the number of spiking neuron sizes) and choose the kernel dimensions accordingly.

This would likely be beneficial also for smaller networks with variable bundle sizes, since we could ignore idle threads per bundle (for smaller bundles) if everything is executed in parallel anyways.

@denisalevi
Copy link
Member Author

I had a quick look, this shouldn't be too hard. No time for it right now, instruction are here:

__device__ set_size_address(volatile size_type* size)
{
// TODO:
// 1. Mofigy m_size to be a pointer
// 2. Change it's address in here (not sure if method call needs 'volatile'?
// 3. In spikequeue.h, after initializing the queues, declare a
// volatile array of size of queues and change each queues m_size
// parameter with this function here.
// 4. In synapses.cu, get the queue size by indexing this new array
// with current_offset and Memcpy it to host to set the correct kernel
// dimensions.
};

Started working on this in the heterog-delays-parallel-effects branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant