-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: gg to ttgggg (2->6 process) #601
base: master
Are you sure you want to change the base?
Conversation
Later on however (now) clang has also increased in size
And cuda too
|
Update - cuda build is still running after more than one day
The clang++ build with inlining has been killed by oom
The clang++ build without inlining was still running but seemed stuck: high memory but 0 CPU? I saw that the AFS token had expired in between, so I ctrl-z stopped it, renewed the token and fg resumed it, but this caused a crash immediately afterwards
I will restart this one... |
The clang++ build without inlining finally completed! It took 32 hours to compile CPPProcess.o on lxplus
I will relaunch the build with inlining. Note instead that the cuda build is still ongoing... |
The clang build with inlining never completed successfully (on lxplus, my interactive process was logged out every time within one or two days, which I suspect being a symptom of an out of memory). As for cuda, the build is still running after one week! I will kill the process, it is unreasonable to keep it going longer
|
…ocess.cc which is 32MB) Note: the generation of gg_ttgggg.mad failed, killed by out-of-memory oom killer after ~1h30 dmesg -T | egrep -i 'killed process' [Fri Feb 24 21:45:56 2023] Out of memory: Killed process 2812622 (python3) total-vm:30208192kB, anon-rss:14254780kB, file-rss:4kB, shmem-rss:0kB, UID:14546 pgtables:58908kB oom_score_adj:0
…d (-makej -inl) [root@itscrd90 cudacpp]# grep -i 'killed process' /var/log/messages Feb 24 21:45:56 itscrd90.cern.ch kernel: Out of memory: Killed process 2812622 (python3) total-vm:30208192kB, anon-rss:14254780kB, file-rss:4kB, shmem-rss:0kB, UID:14546 pgtables:58908kB oom_score_adj:0 Feb 25 12:08:32 itscrd90.cern.ch kernel: Out of memory: Killed process 25738 (dbus-broker-lau) total-vm:19644kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:60kB oom_score_adj:200 Feb 25 12:08:32 itscrd90.cern.ch kernel: Out of memory: Killed process 2859216 (cudafe++) total-vm:4439180kB, anon-rss:2533172kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:8728kB oom_score_adj:0 Feb 25 12:09:59 itscrd90.cern.ch kernel: Out of memory: Killed process 2859218 (cudafe++) total-vm:4830956kB, anon-rss:2404060kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:9504kB oom_score_adj:0 Feb 25 12:12:26 itscrd90.cern.ch kernel: Out of memory: Killed process 2859211 (cudafe++) total-vm:4830956kB, anon-rss:1651848kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:9496kB oom_score_adj:0 Feb 25 12:17:51 itscrd90.cern.ch kernel: Out of memory: Killed process 2859172 (cc1plus) total-vm:5225996kB, anon-rss:3906132kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:9800kB oom_score_adj:0 The first line is the failed generation of ggttgggg.mad yesterday. The next lines are the failed builds. NB: the builds failed already with inl0. I only have gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg/build.*hrd0 and none has a complete CPPProcess.o Will retry one by one as ./tput/throughputX.sh -ggttgggg -sa -512yonly -makeclean
…FLAGS+= -freport-bug" to prepare bug reports for internal compiler errors
I have rebased over upstream/master... I will probably close this MR as unmerged, but at least it's updated now. And I will cherry pick a few commits elseweher. |
…FVs and for compiling them as separate object files (related to splitting kernels)
…d MemoryAccessMomenta.h
…the P subdirectory (depends on npar) - build succeeds for cpp, link fails for cuda ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -Xcompiler -fPIC -c -x cu CPPProcess.cc -o CPPProcess_cuda.o ptxas fatal : Unresolved extern function '_ZN9mg5amcGpu14helas_VVV1P0_1EPKdS1_S1_dddPd'
…cuda tests succeed The build issues some warnings however nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'
…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'
…nd -inlLonly options
… to ease code generation
…y in the HELINL=L mode
…c++, a factor 3 slower for cuda... ./tput/teeThroughputX.sh -ggtt -makej -makeclean -inlLonly diff -u --color tput/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt tput/logs_ggtt_mad/log_ggtt_mad_d_inlL_hrd0.txt -Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.589473e+07 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 1.164485e+08 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 1.280951e+08 ) sec^-1 -MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 -TOTAL : 0.528239 sec -INFO: No Floating Point Exceptions have been reported - 2,222,057,027 cycles # 2.887 GHz - 3,171,868,018 instructions # 1.43 insn per cycle - 0.826440817 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inl0_hrd0/check_cuda.exe -p 2048 256 1 -==PROF== Profiling "sigmaKin": launch__registers_per_thread 214 +EvtsPerSec[Rmb+ME] (23) = ( 2.667135e+07 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.116115e+07 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.251573e+07 ) sec^-1 +MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 +TOTAL : 0.550450 sec +INFO: No Floating Point Exceptions have been reported + 2,272,219,097 cycles # 2.889 GHz + 3,361,475,195 instructions # 1.48 insn per cycle + 0.842685843 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inlL_hrd0/check_cuda.exe -p 2048 256 1 +==PROF== Profiling "sigmaKin": launch__registers_per_thread 190 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
…lates in HELINL=L mode
…t.mad of HelAmps.h in HELINL=L mode
…t.mad of CPPProcess.cc in HELINL=L mode
…P* (the source is the same but it must be compiled in each P* separately)
… complete its backport
git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc
…ild failed? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlL ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_INLINE_HELAMPS -Xcompiler -fPIC -c -x cu CPPProcess.cc -o build.cuda_d_inl1_hrd0/CPPProcess_cuda.o nvcc error : 'ptxas' died due to signal 9 (Kill signal) make[2]: *** [cudacpp.mk:754: build.cuda_d_inl1_hrd0/CPPProcess_cuda.o] Error 9 make[2]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make[1]: *** [makefile:142: build.cuda_d_inl1_hrd0/.cudacpplibs] Error 2 make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make: *** [makefile:282: bldcuda] Error 2 make: *** Waiting for unfinished jobs....
… build time is from cache ./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…mode (use that from the previous run, not from cache) ./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…factor x2 faster (c++? cuda?), runtime is 5-10% slower in C++, but 5-10% faster in cuda!? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlLonly diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt ... On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.338149e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02 ) sec^-1 -MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 2.242693 sec -INFO: No Floating Point Exceptions have been reported - 7,348,976,543 cycles # 2.902 GHz - 16,466,315,526 instructions # 2.24 insn per cycle - 2.591057214 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1 +EvtsPerSec[Rmb+ME] (23) = ( 4.063038e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02 ) sec^-1 +MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 +TOTAL : 2.552546 sec +INFO: No Floating Point Exceptions have been reported + 7,969,059,552 cycles # 2.893 GHz + 17,401,037,642 instructions # 2.18 insn per cycle + 2.954791685 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ... ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] -EvtsPerSec[Rmb+ME] (23) = ( 3.459662e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 3.835352e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02 ) sec^-1 MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 1.528240 sec +TOTAL : 1.378567 sec INFO: No Floating Point Exceptions have been reported - 4,140,408,789 cycles # 2.703 GHz - 9,072,597,595 instructions # 2.19 insn per cycle - 1.532357792 seconds time elapsed -=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:94048) (512y: 91) (512z: 0) + 3,738,350,469 cycles # 2.705 GHz + 8,514,195,736 instructions # 2.28 insn per cycle + 1.382567882 seconds time elapsed +=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:80619) (512y: 89) (512z: 0) -------------------------------------------------------------------------
…10-15% slower in both C++ and cuda diff -u --color tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt -Executing ' ./build.512y_d_inlL_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' +Executing ' ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' [OPENMPTH] omp_get_max_threads/nproc = 1/4 [NGOODHEL] ngoodhel/ncomb = 128/128 [XSECTION] VECSIZE_USED = 8192 @@ -401,10 +401,10 @@ [XSECTION] ChannelId = 1 [XSECTION] Cross section = 2.332e-07 [2.3322993086656014E-007] fbridge_mode=1 [UNWEIGHT] Wrote 303 events (found 1531 events) - [COUNTERS] PROGRAM TOTAL : 320.6913s - [COUNTERS] Fortran Overhead ( 0 ) : 4.5138s - [COUNTERS] CudaCpp MEs ( 2 ) : 316.1312s for 90112 events => throughput is 2.85E+02 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 0.0463s + [COUNTERS] PROGRAM TOTAL : 288.3304s + [COUNTERS] Fortran Overhead ( 0 ) : 4.4909s + [COUNTERS] CudaCpp MEs ( 2 ) : 283.7968s for 90112 events => throughput is 3.18E+02 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 0.0426s -Executing ' ./build.cuda_d_inlL_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' +Executing ' ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' [OPENMPTH] omp_get_max_threads/nproc = 1/4 [NGOODHEL] ngoodhel/ncomb = 128/128 [XSECTION] VECSIZE_USED = 8192 @@ -557,10 +557,10 @@ [XSECTION] ChannelId = 1 [XSECTION] Cross section = 2.332e-07 [2.3322993086656006E-007] fbridge_mode=1 [UNWEIGHT] Wrote 303 events (found 1531 events) - [COUNTERS] PROGRAM TOTAL : 19.6663s - [COUNTERS] Fortran Overhead ( 0 ) : 4.9649s - [COUNTERS] CudaCpp MEs ( 2 ) : 13.4667s for 90112 events => throughput is 6.69E+03 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 1.2347s + [COUNTERS] PROGRAM TOTAL : 18.0242s + [COUNTERS] Fortran Overhead ( 0 ) : 4.9891s + [COUNTERS] CudaCpp MEs ( 2 ) : 11.9530s for 90112 events => throughput is 7.54E+03 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 1.0821s
…arnings and runtime test failures in HELINL=0 There are still build failures in HELINL=L
…allCOUP2 instead of allCOUP) to FFV2_4_0 and FFV2_4_3, fixing build failures in HELINL=L
…d CI access, to fix the issues observed in ee_mumu I did not find an easier way to do this, because the model is known in the aloha caller but not at the time of aloha codegen
…one, COUP1/COUP2 instead of COUP; two, CI/CD instead of CD)
Fix conflicts: epochX/cudacpp/tput/teeThroughputX.sh epochX/cudacpp/tput/throughputX.sh
Fix conflicts: epochX/cudacpp/tput/teeThroughputX.sh epochX/cudacpp/tput/throughputX.sh
I regenerated gg_ttgggg with the helas codegen of PR #978. Using the HELINL=L option this still fails compilation on gcc. I guess it must be the color algebra that does not follow?
|
Also clang fails with a different error 255
|
Following the discussion at the last meeting, I started doing a few tests of gg to ttgggg. Here's a first WIP MR with some changes.
Note on codegen
NB: CPPProcess.cc is 32MB size and contains 15495 Feynman diagrams and a 720x720 color matrix
Note on builds of ggttgggg.sa:
PS1 currently cuda on itscrd90
PS2 currently clang on lxplus9