Performance slowdown in sample_get_x from the checks for warnings at the end #968

valassi · 2024-08-15T14:28:56Z

I am doing a few tests with sample_get_x towards vectorising it, see #963

I realised that there is a relatively large overhead there from some checks for warnings at the end. In a cuda only build of DY+3jets (CMS #943 from @choij1589) this seems to take around 15 to 20% for some subprocesses.

The question is whether these checks must be kept and in which form. I actually realised becase I removed them in my vectorization tests. This type of checks is likely to be difficult to vectorize. I agree some checks are needed, but maybe this could be reviewed.

This is these checks here (by the way there may be a minor bug as icount is not SAVEd)

c
c     Simple checks to see if we got the right point note 1e-3 corresponds
c     to the fact that the grids are required to be separated by 1e-14. Since
c     double precision is about 18 digits, we expect things to agree to
c     3 digit accuracy.
c
      if (abs(ddum(j)-xbin(x,ij))/(ddum(j)+1d-22) .gt. 1e-3) then
         if (icount .lt. 5) then
            write(*,'(a,i4,2e14.6,1e12.4)')
     &           'Warning xbin not returning correct x', ij,
     &           ddum(j),xbin(x,ij),xo
         elseif (icount .eq. 5) then
            write(*,'(a,a)')'Warning xbin still not working well. ',
     &           'Last message this iteration.'
         endif
         icount=icount+1
      endif

@oliviermattelaer any feedback? Thanks Andrea

The text was updated successfully, but these errors were encountered:

…for warnings in sample_get_x: indeed this gives a large slowdown madgraph5#968 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp Found 997 events. Wrote 59 events. Actual xsec 5.9274488566377981 [COUNTERS] PROGRAM TOTAL : 6.0989s [COUNTERS] Fortran Other ( 0 ) : 0.1664s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0691s [COUNTERS] Fortran Random2Momenta ( 3 ) : 4.8443s for 1170103 events => throughput is 4.14E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1004s for 49152 events => throughput is 2.04E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1328s for 16384 events => throughput is 8.11E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0504s for 16384 events => throughput is 3.07E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0645s for 16384 events => throughput is 3.94E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1420s for 1170103 events => throughput is 1.21E-07 events/s [COUNTERS] PROGRAM SampleGetX ( 10 ) : 3.5352s for 15211307 events => throughput is 2.32E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4681s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0261s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0348s for 16384 events => throughput is 2.12E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 6.0641s [COUNTERS] OVERALL MEs ( 22 ) : 0.0348s for 16384 events => throughput is 2.12E-06 events/s CUDACPP_RUNTIME_DISABLECOUNTERS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.5460s

…arnings in sample_get_x Revert "[cmsdyps] in pp_dy3j.mad newgenps, (temporarely) add back the checks for warnings in sample_get_x: indeed this gives a large slowdown madgraph5#968" This reverts commit 2aebfba.

… CUDACPP_RUNTIME_SKIPXBINCHECKS patch madgraph5#968 (on top of madgraph5#969) This includes - optionally skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set The only files that still need to be patched are - 4 in patch.common: Source/makefile, Source/genps.inc, Source/dsample.f, SubProcesses/makefile - 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later checked that regenerating gg_tt.mad is ok)

…cluding madgraph5#968 CUDACPP_RUNTIME_SKIPXBINCHECKS CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -fortran pp_dy3j.mad -togridpack ELAPSED: 740 seconds

…UDACPP_RUNTIME_SKIPXBINCHECKS set madgraph5#968 : big improvement!) For the cuda backend is now, skipping xbin checks madgraph5#968 Phase space sampling in dy+3j has decreased from 78s to 53s (down by 30%) thanks to removal of xbin checks > [GridPackCmd.launch] GRIDPCK TOTAL 135.1144 > [madevent COUNTERS] PROGRAM TOTAL 130.8140s > [madevent COUNTERS] Fortran PhaseSpaceSampling 53.0338s for 44652395 events > ... > [madevent COUNTERS] CudaCpp MEs 35.4908s for 1769472 events > [madevent COUNTERS] OVERALL NON-MEs 95.3232s > [madevent COUNTERS] OVERALL MEs 35.4908s for 1769472 events For the cuda backend was, including xbin checks but including trivial improvements madgraph5#969 Phase space sampling in dy+3j has decreased from 93s to 78s (down by 15%) thanks to removal of xbin checks < [GridPackCmd.launch] GRIDPCK TOTAL 160.1718 < [madevent COUNTERS] PROGRAM TOTAL 155.8605s < [madevent COUNTERS] Fortran PhaseSpaceSampling 78.1023s for 44652395 events < ... < [madevent COUNTERS] CudaCpp MEs 35.4320s for 1769472 events < [madevent COUNTERS] OVERALL NON-MEs 120.4290s < [madevent COUNTERS] OVERALL MEs 35.4320s for 1769472 events For the cuda backend was in 2e59eca, without trivial improvements < [GridPackCmd.launch] GRIDPCK TOTAL 176.8891 < [madevent COUNTERS] PROGRAM TOTAL 172.6370s < [madevent COUNTERS] Fortran Random2Momenta 93.2907s for 44651014 events < ... < [madevent COUNTERS] CudaCpp MEs 35.4557s for 1769472 events < [madevent COUNTERS] OVERALL NON-MEs 137.1806s < [madevent COUNTERS] OVERALL MEs 35.4557s for 1769472 events

…ts - but not yet the latest upstream/master) into cmsdyps Fix conflicts in patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)

…ts - but not yet the latest upstream/master) into cmsdyps Fix conflicts in patch.P1 and patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)

valassi self-assigned this Aug 15, 2024

valassi mentioned this issue Aug 15, 2024

Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969

Open

valassi linked a pull request Aug 15, 2024 that will close this issue

WIP: studies on CMS DY with phase space optimizations #970

Draft

valassi mentioned this issue Aug 15, 2024

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) #963

Open

valassi linked a pull request Aug 19, 2024 that will close this issue

WIP: studies on CMS DY #946

Draft

valassi linked a pull request Aug 22, 2024 that will close this issue

WIP: studies on CMS DY #946

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance slowdown in sample_get_x from the checks for warnings at the end #968

Performance slowdown in sample_get_x from the checks for warnings at the end #968

valassi commented Aug 15, 2024

Performance slowdown in sample_get_x from the checks for warnings at the end #968

Performance slowdown in sample_get_x from the checks for warnings at the end #968

Comments

valassi commented Aug 15, 2024