Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) #963

Open
valassi opened this issue Aug 12, 2024 · 7 comments · May be fixed by #970
Assignees

Comments

@valassi
Copy link
Member

valassi commented Aug 12, 2024

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support). Just a placeholder.

This clearly seems to be a bottleneck in DY+3jets, see #943

@valassi
Copy link
Member Author

valassi commented Aug 12, 2024

The profiling of x_to_f_arg is done as follows

https://github.com/valassi/madgraph4gpu/blob/56404b34aa66632befa8795c0d11715a22a3025f/epochX/cudacpp/gg_tt.mad/Source/dsample.f#L172

            CALL COUNTERS_START_COUNTER( 3, 1 ) ! FortranRandom2Momenta=3
            call x_to_f_arg(ndim,ipole,mincfig,maxcfig,ninvar,wgt,x,p)
            CALL COUNTERS_STOP_COUNTER( 3 ) ! FortranRandom2Momenta=3

https://github.com/valassi/madgraph4gpu/blob/56404b34aa66632befa8795c0d11715a22a3025f/epochX/cudacpp/gg_tt.mad/Source/dsample.f#L414

            CALL COUNTERS_START_COUNTER( 3, 1 ) ! FortranRandom2Momenta=3
            call x_to_f_arg(ndim,ipole,mincfig,maxcfig,ninvar,wgt,x,p)
            CALL COUNTERS_STOP_COUNTER( 3 ) ! FortranRandom2Momenta=3

@valassi
Copy link
Member Author

valassi commented Aug 13, 2024

Just to give more details, a specific example is

PS Specific example #943 (comment)
https://github.com/valassi/madgraph4gpu/blob/d29e75be7fbea5f1283af60538ab2f190674bf02/epochX/cudacpp/tlau/fromgridpacks/pp_dy3j.mad/summary.txt

--------------------------------------------------------------------------------
pp_dy3j.mad//cpp512z/output.txt
[GridPackCmd.launch] GRIDPCK TOTAL 176.8891 seconds
[madevent COUNTERS]  PROGRAM TOTAL 172.637
[madevent COUNTERS]  Fortran Other 6.5768
[madevent COUNTERS]  Fortran Initialise(I/O) 4.486
[madevent COUNTERS]  Fortran Random2Momenta 93.2907
[madevent COUNTERS]  Fortran PDFs 8.2998
[madevent COUNTERS]  Fortran UpdateScaleCouplings 7.2827
[madevent COUNTERS]  Fortran Reweight 3.7045
[madevent COUNTERS]  Fortran Unweight(LHE-I/O) 4.8719
[madevent COUNTERS]  Fortran SamplePutPoint 8.2892
[madevent COUNTERS]  CudaCpp Initialise 0.3619
[madevent COUNTERS]  CudaCpp Finalise 0.0221
[madevent COUNTERS]  CudaCpp MEs 35.4557
[madevent COUNTERS]  OVERALL NON-MEs 137.181
[madevent COUNTERS]  OVERALL MEs 35.4557
--------------------------------------------------------------------------------

This shows that sampling takes almost three times MEs for cpp512z (and it will be much worse for cuda). This is a clear bottleneck, so it is important to vectorize this.

Also, the relevant functions x_to_f_args should in principle be vectorisable with lockstep processing. There does not seem to be too much if/then/else branching. (This is different from setclscales, see #964)

@valassi valassi self-assigned this Aug 14, 2024
@valassi
Copy link
Member Author

valassi commented Aug 14, 2024

I assign this to myself at least for some further analysis and reverse engineering.

I had a closer look with more fine grained profiling.
See 1927c68
See 57b4e08

These profiles show that, in one subprocess within DY+3j where x_to_f_arg is the bottleneck (75% for cuda):

  • perf gives function xbin (called within sample_get_x) at 20% and function sample_get_x at 5%, so together at least 25% i.e. 1/3 of x_to_f_arg
  • a detailed counter profile of get_sample_x (which increases the profiling overhead enormously due to the large number of calls) seems to give sample_get_x at more than half x_to_f_arg

Overall, these two observation suggest that vectorising sample_get_x is the highest priority within the vectorisation of phase space sampling

I then had a look by reverse engineering the code. I may be wrong, but it seems that

  • uniform ('quasi-uniform' according to code comments) random numbers in [0,1] for phase space sampling are actually generated by 'ntuple' (htuple.f) rather than ranmar as I originally though (to do: check how randinit seeds propagate to this?)
  • function sample_get_x, i.e. the bottleneck, seems to be more about vegas mapping from [0,1] to [0,1] (as Olivier had once described), rather than actual randm number to momenta mapping (the 'rambo' physics step)
  • the actual mapping of [0,1] to momenta is done in the rest of gen_mom, which internally calls sample_get_x and is called by x_to_f_arg (I profiled gen_mom to be the largest component of x_to_f_arg anyway)

The good thing about this is that sample_get_x seems to be relatively easily vectorizable

  • it has a moderate number of commons (vegas grida mainly?) which can be passed
  • it internally calls only few functions like xbin, ntuple, transpole and untranspole
  • so all in all this is a relatively limited number of fortran lines to port
  • (one catch is that the xbin function is presently implemented with a binary tree search, which may heavily break lockstep processing... but other solutions may be found, or maybe the divergence is acceptable)

One idea for porting this could be the following:

  • sample_get_x actually interanlly calls ntuple, so it is a generator of number is [0,1] (not uniformly distributed)
  • one could then simply write a vectorized SIMD/GPU vector version of sample_get_x, which pre-allocates a large number of values
  • the API of the sample_get_x replaement would then simply be a frontend iterator over preallocated values (exactly like curand does internally)

Note also, in a second step one could easily imagine replacing ntuple by curand or other SIMD/GPU random generators

Note finally, maybe a very low hanging fruit could be to reengineer the xbin function in fortran. As mentioned above thi suses a binary tree now. From a very quick look, there is nothing obviously wrong/slow done there. But maybe it can be optimized without SIMD/GPU.

(Ah and also note, Stefan asked yesterday if Madnis may help - I thought it would not, if the problem is [0,1] to momenta mapping; but the sample_get_x is the [0,1] to [0,1] mapping, which maybe is exactly what Madnis does? so maybe Madnis would replace the whole sample_get_x and in that case it is less urgen to try and optimise it, assuming Madnis is fast?)

Finally for completeness, a few diagrams about this possible work (assuming I understood it correctly, or for discussion with Olivier)

  • top two diagrams are old pre-GPU case, and what I thought we had done so far (mentioning ranmar and a generic madevent)
  • bottom left is a zoomed and more precise (?) description of what exists now: ntuple not ranmar, and there is an extra [0,1] to [0,1] step in between, and all of these things are embeddded in one another (ntuple within sample_get_x within gen_mom)
  • bottom medium is what we could aim for initially, with a vector version of sample_get_x, whose scalar frontend is still called by gen_mom
  • bottom right is a possible next step with curand on gpu/simd instead of ntuple (and this might be a better random generator anyway)

image

@valassi valassi changed the title Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support) Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) Aug 14, 2024
@valassi
Copy link
Member Author

valassi commented Aug 14, 2024

PS if I understand correctly, ntuple in htuple.f is:

My question here (still assuming I got it right that ntuple is used - it sounds too strange, Olivier?) is whether something like CURAND would be suitable at all. Quasi and pseudo random are very different, maybe there was a reason why this quasi random was chosen for madgraph?

@valassi
Copy link
Member Author

valassi commented Aug 15, 2024

PS if I understand correctly, ntuple in htuple.f is:

I realised I wasted a lot of time. There are TWO versions of ntuple. One is in htuple.f, but there is also a version in ranmar.f. And I think that the ranmar version is used. So indeed everything is based on ranmar (and correctly uses randinit I think).

I created #967 to remove htuple.f from generated code, this is very confusing.

@valassi
Copy link
Member Author

valassi commented Aug 15, 2024

Updating the previous diagram, the explanations remain the same. But it then becomes much more conceivable to replace RANMAR by another pseudo random generator like CURAND (though this may not necessarily be needed from a speed point of view)

image

@valassi
Copy link
Member Author

valassi commented Aug 15, 2024

Note finally, maybe a very low hanging fruit could be to reengineer the xbin function in fortran. As mentioned above thi suses a binary tree now. From a very quick look, there is nothing obviously wrong/slow done there. But maybe it can be optimized without SIMD/GPU.

Some ideas related to this are in #968 #969... some speedups without SIMD/GPU may indeed be possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant