-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) #963
Comments
The profiling of x_to_f_arg is done as follows
|
Just to give more details, a specific example is PS Specific example #943 (comment)
This shows that sampling takes almost three times MEs for cpp512z (and it will be much worse for cuda). This is a clear bottleneck, so it is important to vectorize this. Also, the relevant functions x_to_f_args should in principle be vectorisable with lockstep processing. There does not seem to be too much if/then/else branching. (This is different from setclscales, see #964) |
I assign this to myself at least for some further analysis and reverse engineering. I had a closer look with more fine grained profiling. These profiles show that, in one subprocess within DY+3j where x_to_f_arg is the bottleneck (75% for cuda):
Overall, these two observation suggest that vectorising sample_get_x is the highest priority within the vectorisation of phase space sampling I then had a look by reverse engineering the code. I may be wrong, but it seems that
The good thing about this is that sample_get_x seems to be relatively easily vectorizable
One idea for porting this could be the following:
Note also, in a second step one could easily imagine replacing ntuple by curand or other SIMD/GPU random generators Note finally, maybe a very low hanging fruit could be to reengineer the xbin function in fortran. As mentioned above thi suses a binary tree now. From a very quick look, there is nothing obviously wrong/slow done there. But maybe it can be optimized without SIMD/GPU. (Ah and also note, Stefan asked yesterday if Madnis may help - I thought it would not, if the problem is [0,1] to momenta mapping; but the sample_get_x is the [0,1] to [0,1] mapping, which maybe is exactly what Madnis does? so maybe Madnis would replace the whole sample_get_x and in that case it is less urgen to try and optimise it, assuming Madnis is fast?) Finally for completeness, a few diagrams about this possible work (assuming I understood it correctly, or for discussion with Olivier)
|
PS if I understand correctly, ntuple in htuple.f is:
My question here (still assuming I got it right that ntuple is used - it sounds too strange, Olivier?) is whether something like CURAND would be suitable at all. Quasi and pseudo random are very different, maybe there was a reason why this quasi random was chosen for madgraph? |
I realised I wasted a lot of time. There are TWO versions of I created #967 to remove htuple.f from generated code, this is very confusing. |
Some ideas related to this are in #968 #969... some speedups without SIMD/GPU may indeed be possible |
Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support). Just a placeholder.
This clearly seems to be a bottleneck in DY+3jets, see #943
The text was updated successfully, but these errors were encountered: