Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance slowdown in sample_get_x from the checks for warnings at the end #968

Open
valassi opened this issue Aug 15, 2024 · 0 comments · May be fixed by #970 or #946
Open

Performance slowdown in sample_get_x from the checks for warnings at the end #968

valassi opened this issue Aug 15, 2024 · 0 comments · May be fixed by #970 or #946
Assignees

Comments

@valassi
Copy link
Member

valassi commented Aug 15, 2024

I am doing a few tests with sample_get_x towards vectorising it, see #963

I realised that there is a relatively large overhead there from some checks for warnings at the end. In a cuda only build of DY+3jets (CMS #943 from @choij1589) this seems to take around 15 to 20% for some subprocesses.

The question is whether these checks must be kept and in which form. I actually realised becase I removed them in my vectorization tests. This type of checks is likely to be difficult to vectorize. I agree some checks are needed, but maybe this could be reviewed.

This is these checks here (by the way there may be a minor bug as icount is not SAVEd)

c
c     Simple checks to see if we got the right point note 1e-3 corresponds
c     to the fact that the grids are required to be separated by 1e-14. Since
c     double precision is about 18 digits, we expect things to agree to
c     3 digit accuracy.
c
      if (abs(ddum(j)-xbin(x,ij))/(ddum(j)+1d-22) .gt. 1e-3) then
         if (icount .lt. 5) then
            write(*,'(a,i4,2e14.6,1e12.4)')
     &           'Warning xbin not returning correct x', ij,
     &           ddum(j),xbin(x,ij),xo
         elseif (icount .eq. 5) then
            write(*,'(a,a)')'Warning xbin still not working well. ',
     &           'Last message this iteration.'
         endif
         icount=icount+1
      endif

@oliviermattelaer any feedback? Thanks Andrea

@valassi valassi self-assigned this Aug 15, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 15, 2024
…for warnings in sample_get_x: indeed this gives a large slowdown madgraph5#968

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 Found          997  events.
 Wrote           59  events.
 Actual xsec    5.9274488566377981
 [COUNTERS] PROGRAM TOTAL                         :    6.0989s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1664s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0691s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    4.8443s for  1170103 events => throughput is 4.14E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1004s for    49152 events => throughput is 2.04E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1328s for    16384 events => throughput is 8.11E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0504s for    16384 events => throughput is 3.07E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0645s for    16384 events => throughput is 3.94E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1420s for  1170103 events => throughput is 1.21E-07 events/s
 [COUNTERS] PROGRAM SampleGetX             ( 10 ) :    3.5352s for 15211307 events => throughput is 2.32E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4681s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0261s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0348s for    16384 events => throughput is 2.12E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    6.0641s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0348s for    16384 events => throughput is 2.12E-06 events/s

CUDACPP_RUNTIME_DISABLECOUNTERS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.5460s
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 15, 2024
…arnings in sample_get_x

Revert "[cmsdyps] in pp_dy3j.mad newgenps, (temporarely) add back the checks for warnings in sample_get_x: indeed this gives a large slowdown madgraph5#968"
This reverts commit 2aebfba.
@valassi valassi linked a pull request Aug 15, 2024 that will close this issue
@valassi valassi linked a pull request Aug 15, 2024 that will close this issue
@valassi valassi linked a pull request Aug 19, 2024 that will close this issue
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
… CUDACPP_RUNTIME_SKIPXBINCHECKS patch madgraph5#968 (on top of madgraph5#969)

This includes
- optionally skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set

The only files that still need to be patched are
- 4 in patch.common: Source/makefile, Source/genps.inc, Source/dsample.f, SubProcesses/makefile
- 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad

(Later checked that regenerating gg_tt.mad is ok)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…cluding madgraph5#968 CUDACPP_RUNTIME_SKIPXBINCHECKS

CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -fortran pp_dy3j.mad -togridpack

ELAPSED: 740 seconds
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…UDACPP_RUNTIME_SKIPXBINCHECKS set madgraph5#968 : big improvement!)

For the cuda backend is now, skipping xbin checks madgraph5#968
Phase space sampling in dy+3j has decreased from 78s to 53s (down by 30%) thanks to removal of xbin checks
> [GridPackCmd.launch] GRIDPCK TOTAL                    135.1144
> [madevent COUNTERS]  PROGRAM TOTAL                    130.8140s
> [madevent COUNTERS]  Fortran PhaseSpaceSampling        53.0338s for   44652395 events
> ...
> [madevent COUNTERS]  CudaCpp MEs                       35.4908s for    1769472 events
> [madevent COUNTERS]  OVERALL NON-MEs                   95.3232s
> [madevent COUNTERS]  OVERALL MEs                       35.4908s for    1769472 events

For the cuda backend was, including xbin checks but including trivial improvements madgraph5#969
Phase space sampling in dy+3j has decreased from 93s to 78s (down by 15%) thanks to removal of xbin checks
< [GridPackCmd.launch] GRIDPCK TOTAL                    160.1718
< [madevent COUNTERS]  PROGRAM TOTAL                    155.8605s
< [madevent COUNTERS]  Fortran PhaseSpaceSampling        78.1023s for   44652395 events
< ...
< [madevent COUNTERS]  CudaCpp MEs                       35.4320s for    1769472 events
< [madevent COUNTERS]  OVERALL NON-MEs                  120.4290s
< [madevent COUNTERS]  OVERALL MEs                       35.4320s for    1769472 events

For the cuda backend was in 2e59eca, without trivial improvements
< [GridPackCmd.launch] GRIDPCK TOTAL                    176.8891
< [madevent COUNTERS]  PROGRAM TOTAL                    172.6370s
< [madevent COUNTERS]  Fortran Random2Momenta            93.2907s for   44651014 events
< ...
< [madevent COUNTERS]  CudaCpp MEs                       35.4557s for    1769472 events
< [madevent COUNTERS]  OVERALL NON-MEs                  137.1806s
< [madevent COUNTERS]  OVERALL MEs                       35.4557s for    1769472 events
@valassi valassi linked a pull request Aug 22, 2024 that will close this issue
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…ts - but not yet the latest upstream/master) into cmsdyps

Fix conflicts in patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…ts - but not yet the latest upstream/master) into cmsdyps

Fix conflicts in patch.P1 and patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant