-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance slowdown in sample_get_x from the checks for warnings at the end #968
Comments
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Aug 15, 2024
…for warnings in sample_get_x: indeed this gives a large slowdown madgraph5#968 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp Found 997 events. Wrote 59 events. Actual xsec 5.9274488566377981 [COUNTERS] PROGRAM TOTAL : 6.0989s [COUNTERS] Fortran Other ( 0 ) : 0.1664s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0691s [COUNTERS] Fortran Random2Momenta ( 3 ) : 4.8443s for 1170103 events => throughput is 4.14E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1004s for 49152 events => throughput is 2.04E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1328s for 16384 events => throughput is 8.11E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0504s for 16384 events => throughput is 3.07E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0645s for 16384 events => throughput is 3.94E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1420s for 1170103 events => throughput is 1.21E-07 events/s [COUNTERS] PROGRAM SampleGetX ( 10 ) : 3.5352s for 15211307 events => throughput is 2.32E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4681s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0261s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0348s for 16384 events => throughput is 2.12E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 6.0641s [COUNTERS] OVERALL MEs ( 22 ) : 0.0348s for 16384 events => throughput is 2.12E-06 events/s CUDACPP_RUNTIME_DISABLECOUNTERS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.5460s
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Aug 15, 2024
…arnings in sample_get_x Revert "[cmsdyps] in pp_dy3j.mad newgenps, (temporarely) add back the checks for warnings in sample_get_x: indeed this gives a large slowdown madgraph5#968" This reverts commit 2aebfba.
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Aug 22, 2024
… CUDACPP_RUNTIME_SKIPXBINCHECKS patch madgraph5#968 (on top of madgraph5#969) This includes - optionally skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set The only files that still need to be patched are - 4 in patch.common: Source/makefile, Source/genps.inc, Source/dsample.f, SubProcesses/makefile - 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later checked that regenerating gg_tt.mad is ok)
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Aug 22, 2024
…cluding madgraph5#968 CUDACPP_RUNTIME_SKIPXBINCHECKS CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -fortran pp_dy3j.mad -togridpack ELAPSED: 740 seconds
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Aug 22, 2024
…UDACPP_RUNTIME_SKIPXBINCHECKS set madgraph5#968 : big improvement!) For the cuda backend is now, skipping xbin checks madgraph5#968 Phase space sampling in dy+3j has decreased from 78s to 53s (down by 30%) thanks to removal of xbin checks > [GridPackCmd.launch] GRIDPCK TOTAL 135.1144 > [madevent COUNTERS] PROGRAM TOTAL 130.8140s > [madevent COUNTERS] Fortran PhaseSpaceSampling 53.0338s for 44652395 events > ... > [madevent COUNTERS] CudaCpp MEs 35.4908s for 1769472 events > [madevent COUNTERS] OVERALL NON-MEs 95.3232s > [madevent COUNTERS] OVERALL MEs 35.4908s for 1769472 events For the cuda backend was, including xbin checks but including trivial improvements madgraph5#969 Phase space sampling in dy+3j has decreased from 93s to 78s (down by 15%) thanks to removal of xbin checks < [GridPackCmd.launch] GRIDPCK TOTAL 160.1718 < [madevent COUNTERS] PROGRAM TOTAL 155.8605s < [madevent COUNTERS] Fortran PhaseSpaceSampling 78.1023s for 44652395 events < ... < [madevent COUNTERS] CudaCpp MEs 35.4320s for 1769472 events < [madevent COUNTERS] OVERALL NON-MEs 120.4290s < [madevent COUNTERS] OVERALL MEs 35.4320s for 1769472 events For the cuda backend was in 2e59eca, without trivial improvements < [GridPackCmd.launch] GRIDPCK TOTAL 176.8891 < [madevent COUNTERS] PROGRAM TOTAL 172.6370s < [madevent COUNTERS] Fortran Random2Momenta 93.2907s for 44651014 events < ... < [madevent COUNTERS] CudaCpp MEs 35.4557s for 1769472 events < [madevent COUNTERS] OVERALL NON-MEs 137.1806s < [madevent COUNTERS] OVERALL MEs 35.4557s for 1769472 events
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Aug 22, 2024
…ts - but not yet the latest upstream/master) into cmsdyps Fix conflicts in patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Aug 22, 2024
…ts - but not yet the latest upstream/master) into cmsdyps Fix conflicts in patch.P1 and patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am doing a few tests with sample_get_x towards vectorising it, see #963
I realised that there is a relatively large overhead there from some checks for warnings at the end. In a cuda only build of DY+3jets (CMS #943 from @choij1589) this seems to take around 15 to 20% for some subprocesses.
The question is whether these checks must be kept and in which form. I actually realised becase I removed them in my vectorization tests. This type of checks is likely to be difficult to vectorize. I agree some checks are needed, but maybe this could be reviewed.
This is these checks here (by the way there may be a minor bug as icount is not SAVEd)
@oliviermattelaer any feedback? Thanks Andrea
The text was updated successfully, but these errors were encountered: