Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) #963

valassi · 2024-08-12T20:49:14Z

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support). Just a placeholder.

This clearly seems to be a bottleneck in DY+3jets, see #943

valassi · 2024-08-12T20:56:06Z

The profiling of x_to_f_arg is done as follows

https://github.com/valassi/madgraph4gpu/blob/56404b34aa66632befa8795c0d11715a22a3025f/epochX/cudacpp/gg_tt.mad/Source/dsample.f#L172

            CALL COUNTERS_START_COUNTER( 3, 1 ) ! FortranRandom2Momenta=3
            call x_to_f_arg(ndim,ipole,mincfig,maxcfig,ninvar,wgt,x,p)
            CALL COUNTERS_STOP_COUNTER( 3 ) ! FortranRandom2Momenta=3

https://github.com/valassi/madgraph4gpu/blob/56404b34aa66632befa8795c0d11715a22a3025f/epochX/cudacpp/gg_tt.mad/Source/dsample.f#L414

            CALL COUNTERS_START_COUNTER( 3, 1 ) ! FortranRandom2Momenta=3
            call x_to_f_arg(ndim,ipole,mincfig,maxcfig,ninvar,wgt,x,p)
            CALL COUNTERS_STOP_COUNTER( 3 ) ! FortranRandom2Momenta=3

valassi · 2024-08-13T04:59:50Z

Just to give more details, a specific example is

PS Specific example #943 (comment)
https://github.com/valassi/madgraph4gpu/blob/d29e75be7fbea5f1283af60538ab2f190674bf02/epochX/cudacpp/tlau/fromgridpacks/pp_dy3j.mad/summary.txt

--------------------------------------------------------------------------------
pp_dy3j.mad//cpp512z/output.txt
[GridPackCmd.launch] GRIDPCK TOTAL 176.8891 seconds
[madevent COUNTERS]  PROGRAM TOTAL 172.637
[madevent COUNTERS]  Fortran Other 6.5768
[madevent COUNTERS]  Fortran Initialise(I/O) 4.486
[madevent COUNTERS]  Fortran Random2Momenta 93.2907
[madevent COUNTERS]  Fortran PDFs 8.2998
[madevent COUNTERS]  Fortran UpdateScaleCouplings 7.2827
[madevent COUNTERS]  Fortran Reweight 3.7045
[madevent COUNTERS]  Fortran Unweight(LHE-I/O) 4.8719
[madevent COUNTERS]  Fortran SamplePutPoint 8.2892
[madevent COUNTERS]  CudaCpp Initialise 0.3619
[madevent COUNTERS]  CudaCpp Finalise 0.0221
[madevent COUNTERS]  CudaCpp MEs 35.4557
[madevent COUNTERS]  OVERALL NON-MEs 137.181
[madevent COUNTERS]  OVERALL MEs 35.4557
--------------------------------------------------------------------------------

This shows that sampling takes almost three times MEs for cpp512z (and it will be much worse for cuda). This is a clear bottleneck, so it is important to vectorize this.

Also, the relevant functions x_to_f_args should in principle be vectorisable with lockstep processing. There does not seem to be too much if/then/else branching. (This is different from setclscales, see #964)

valassi · 2024-08-14T20:25:13Z

I assign this to myself at least for some further analysis and reverse engineering.

I had a closer look with more fine grained profiling.
See 1927c68
See 57b4e08

These profiles show that, in one subprocess within DY+3j where x_to_f_arg is the bottleneck (75% for cuda):

perf gives function xbin (called within sample_get_x) at 20% and function sample_get_x at 5%, so together at least 25% i.e. 1/3 of x_to_f_arg
a detailed counter profile of get_sample_x (which increases the profiling overhead enormously due to the large number of calls) seems to give sample_get_x at more than half x_to_f_arg

Overall, these two observation suggest that vectorising sample_get_x is the highest priority within the vectorisation of phase space sampling

I then had a look by reverse engineering the code. I may be wrong, but it seems that

uniform ('quasi-uniform' according to code comments) random numbers in [0,1] for phase space sampling are actually generated by 'ntuple' (htuple.f) rather than ranmar as I originally though (to do: check how randinit seeds propagate to this?)
function sample_get_x, i.e. the bottleneck, seems to be more about vegas mapping from [0,1] to [0,1] (as Olivier had once described), rather than actual randm number to momenta mapping (the 'rambo' physics step)
the actual mapping of [0,1] to momenta is done in the rest of gen_mom, which internally calls sample_get_x and is called by x_to_f_arg (I profiled gen_mom to be the largest component of x_to_f_arg anyway)

The good thing about this is that sample_get_x seems to be relatively easily vectorizable

it has a moderate number of commons (vegas grida mainly?) which can be passed
it internally calls only few functions like xbin, ntuple, transpole and untranspole
so all in all this is a relatively limited number of fortran lines to port
(one catch is that the xbin function is presently implemented with a binary tree search, which may heavily break lockstep processing... but other solutions may be found, or maybe the divergence is acceptable)

One idea for porting this could be the following:

sample_get_x actually interanlly calls ntuple, so it is a generator of number is [0,1] (not uniformly distributed)
one could then simply write a vectorized SIMD/GPU vector version of sample_get_x, which pre-allocates a large number of values
the API of the sample_get_x replaement would then simply be a frontend iterator over preallocated values (exactly like curand does internally)

Note also, in a second step one could easily imagine replacing ntuple by curand or other SIMD/GPU random generators

Note finally, maybe a very low hanging fruit could be to reengineer the xbin function in fortran. As mentioned above thi suses a binary tree now. From a very quick look, there is nothing obviously wrong/slow done there. But maybe it can be optimized without SIMD/GPU.

(Ah and also note, Stefan asked yesterday if Madnis may help - I thought it would not, if the problem is [0,1] to momenta mapping; but the sample_get_x is the [0,1] to [0,1] mapping, which maybe is exactly what Madnis does? so maybe Madnis would replace the whole sample_get_x and in that case it is less urgen to try and optimise it, assuming Madnis is fast?)

Finally for completeness, a few diagrams about this possible work (assuming I understood it correctly, or for discussion with Olivier)

top two diagrams are old pre-GPU case, and what I thought we had done so far (mentioning ranmar and a generic madevent)
bottom left is a zoomed and more precise (?) description of what exists now: ntuple not ranmar, and there is an extra [0,1] to [0,1] step in between, and all of these things are embeddded in one another (ntuple within sample_get_x within gen_mom)
bottom medium is what we could aim for initially, with a vector version of sample_get_x, whose scalar frontend is still called by gen_mom
bottom right is a possible next step with curand on gpu/simd instead of ntuple (and this might be a better random generator anyway)

valassi · 2024-08-14T21:34:05Z

PS if I understand correctly, ntuple in htuple.f is:

a quasi-random generator, NOT a pseudo-random generator
it is based on Braaten-Weller https://doi.org/10.1016/0021-9991(79)90019-6
which is itself one of the first examples of 'scrambled Halton sequences'
note that scrambled halton sequneces have been ported to gpus https://ieeexplore.ieee.org/document/6596247/

My question here (still assuming I got it right that ntuple is used - it sounds too strange, Olivier?) is whether something like CURAND would be suitable at all. Quasi and pseudo random are very different, maybe there was a reason why this quasi random was chosen for madgraph?

valassi · 2024-08-15T09:33:50Z

PS if I understand correctly, ntuple in htuple.f is:

I realised I wasted a lot of time. There are TWO versions of ntuple. One is in htuple.f, but there is also a version in ranmar.f. And I think that the ranmar version is used. So indeed everything is based on ranmar (and correctly uses randinit I think).

I created #967 to remove htuple.f from generated code, this is very confusing.

valassi · 2024-08-15T09:41:18Z

Updating the previous diagram, the explanations remain the same. But it then becomes much more conceivable to replace RANMAR by another pseudo random generator like CURAND (though this may not necessarily be needed from a speed point of view)

valassi · 2024-08-15T15:24:53Z

Note finally, maybe a very low hanging fruit could be to reengineer the xbin function in fortran. As mentioned above thi suses a binary tree now. From a very quick look, there is nothing obviously wrong/slow done there. But maybe it can be optimized without SIMD/GPU.

Some ideas related to this are in #968 #969... some speedups without SIMD/GPU may indeed be possible

valassi mentioned this issue Aug 12, 2024

Understand why CMS sees a speedup in DY+4jets but not DY+3 jets #943

Open

valassi mentioned this issue Aug 13, 2024

Vectorise running coupling scale updates (port update_scale_coupling_vec to cudacpp with SIMD/GPU)?? Maybe not #964

Open

valassi self-assigned this Aug 14, 2024

valassi changed the title ~~Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support)~~ Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) Aug 14, 2024

This was referenced Aug 15, 2024

Performance slowdown in sample_get_x from the checks for warnings at the end #968

Open

Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969

Open

WIP: studies on CMS DY with phase space optimizations #970

Draft

valassi linked a pull request Aug 15, 2024 that will close this issue

WIP: studies on CMS DY with phase space optimizations #970

Draft

valassi mentioned this issue Aug 22, 2024

WIP: studies on CMS DY #946

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) #963

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) #963

valassi commented Aug 12, 2024

valassi commented Aug 12, 2024

valassi commented Aug 13, 2024

valassi commented Aug 14, 2024

valassi commented Aug 14, 2024

valassi commented Aug 15, 2024

valassi commented Aug 15, 2024

valassi commented Aug 15, 2024

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) #963

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) #963

Comments

valassi commented Aug 12, 2024

valassi commented Aug 12, 2024

valassi commented Aug 13, 2024

valassi commented Aug 14, 2024

valassi commented Aug 14, 2024

valassi commented Aug 15, 2024

valassi commented Aug 15, 2024

valassi commented Aug 15, 2024