Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: gg to ttgggg (2->6 process) #601

Draft
wants to merge 42 commits into
base: master
Choose a base branch
from
Draft

Conversation

valassi
Copy link
Member

@valassi valassi commented Feb 26, 2023

Following the discussion at the last meeting, I started doing a few tests of gg to ttgggg. Here's a first WIP MR with some changes.

Note on codegen

  • generating the standalone ggttgggg.sa succeeded (took 10-20 minutes? should check again)
  • generating the madevent ggttgggg.mad failed with out-of-memory errors (generating madevent always takes longer than standalone. maybe it is the color index mapping? the fortran helamps?)

NB: CPPProcess.cc is 32MB size and contains 15495 Feynman diagrams and a 720x720 color matrix

Note on builds of ggttgggg.sa:

  • building with cuda is proceeding since 20h
  • building with gcc failed with an internal compiler error, but this may be memory related as I see top going to around 8GB (I might eventually submit a bug report to gcc)
  • building with clang is proceeding, both without and with inlining: differently from gcc, the clang build seems limited to 1.5GB RES memory in top all the time... but it make yake many hours (days?)

PS1 currently cuda on itscrd90

top - 11:36:46 up 18:48,  5 users,  load average: 1.02, 1.03, 1.04
Tasks: 209 total,   2 running, 207 sleeping,   0 stopped,   0 zombie
%Cpu(s): 24.9 us,  0.0 sy,  0.0 ni, 74.8 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  15337.7 total,   4797.8 free,   7358.4 used,   3582.6 buff/cache
MiB Swap:  16000.0 total,  15991.5 free,      8.4 used.   7979.4 avail Mem 
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
   7726 avalassi  20   0 6394712   6.1g    768 R  99.3  40.4   1114:37 cicc    

PS2 currently clang on lxplus9

top - 11:37:58 up 38 days, 17:59, 11 users,  load average: 2.00, 2.02, 2.00
Tasks:  24 total,   3 running,  21 sleeping,   0 stopped,   0 zombie
%Cpu(s): 20.0 us,  0.1 sy,  0.0 ni, 79.7 id,  0.0 wa,  0.2 hi,  0.0 si,  0.0 st
MiB Mem :  29099.6 total,   9461.0 free,   5657.3 used,  14473.0 buff/cache
MiB Swap:  10240.0 total,  10203.0 free,     37.0 used.  23442.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
2130639 avalassi  20   0 1667836   1.5g  84304 R  99.3   5.4  46:23.36 clang++ 
2130714 avalassi  20   0 1667972   1.5g  84440 R  99.3   5.4  46:14.50 clang++ 

@valassi valassi marked this pull request as draft February 26, 2023 10:30
@valassi valassi self-assigned this Feb 26, 2023
@valassi
Copy link
Member Author

valassi commented Feb 26, 2023

Later on however (now) clang has also increased in size

top - 20:20:30 up 39 days,  2:42,  9 users,  load average: 2.34, 2.15, 2.11
Tasks:  24 total,   3 running,  21 sleeping,   0 stopped,   0 zombie
%Cpu(s): 20.1 us,  0.1 sy,  0.0 ni, 79.6 id,  0.0 wa,  0.1 hi,  0.0 si,  0.0 st
MiB Mem :  29099.6 total,   1742.8 free,  14135.6 used,  13712.6 buff/cache
MiB Swap:  10240.0 total,  10199.5 free,     40.5 used.  14963.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
2130639 avalassi  20   0 5086720   4.7g  36636 R  99.3  16.7 566:02.26 clang++ 
2130714 avalassi  20   0 6615896   6.2g  43004 R  99.3  21.8 565:50.69 clang++ 

And cuda too

top - 20:21:59 up 4 days,  5:35,  3 users,  load average: 1.07, 1.10, 1.14
Tasks:  12 total,   2 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.8 us,  0.2 sy,  0.0 ni, 96.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 164532.5 total, 149810.8 free,   9596.9 used,   6663.6 buff/cache
MiB Swap: 249920.0 total, 249920.0 free,      0.0 used. 154935.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
 673484 avalassi  20   0 5943184   5.6g  15616 R 100.0   3.5 512:05.93 cicc    

@valassi
Copy link
Member Author

valassi commented Feb 27, 2023

Update - cuda build is still running after more than one day

top - 08:28:54 up 1 day, 15:40,  5 users,  load average: 1.03, 1.03, 1.00
Tasks: 212 total,   4 running, 208 sleeping,   0 stopped,   0 zombie
%Cpu(s): 26.6 us,  0.7 sy,  0.0 ni, 72.6 id,  0.0 wa,  0.2 hi,  0.0 si,  0.0 st
MiB Mem :  15337.7 total,   3677.4 free,   7897.6 used,   4211.7 buff/cache
MiB Swap:  16000.0 total,  15991.5 free,      8.4 used.   7440.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
   7726 avalassi  20   0 6853192   6.5g    768 R  99.7  43.4   2361:36 cicc    

The clang++ build with inlining has been killed by oom

[avalassi@lxplus9s07 bash] ~> dmesg | grep kill | grep clang
[3381857.623536] clang++ invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[3381857.767210] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/user.slice/user-14546.slice,task_memcg=/user.slice/user-14546.slice/session-31008.scope,task=clang++,pid=2130714,uid=14546

The clang++ build without inlining was still running but seemed stuck: high memory but 0 CPU? I saw that the AFS token had expired in between, so I ctrl-z stopped it, renewed the token and fg resumed it, but this caused a crash immediately afterwards

[avalassi@lxplus9s07 clang14.0.6/cvmfs] /afs/cern.ch/work/a/avalassi/GPU2023/madgraph4gpuClang14/epochX/cudacpp> fg
./tput/throughputX.sh -sa -ggttgggg -makej -avx2only
fatal error: error in backend: IO failure on output stream: Input/output error
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: /cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang++ --gcc-toolchain=/cvmfs/sft.cern.ch/lcg/releases/gcc/12.1.0-57c96/x86_64-centos9 -O3 -std=c++17 -Wall -Wshadow -Wextra -ffast-math -fopenmp -march=haswell -fPIC -I. -I../../src -I../../../../../tools -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_HAS_NO_CURAND -c -fcolor-diagnostics -o build.avx2_d_inl0_hrd0/CPPProcess.o CPPProcess.cc
1.      <eof> parser at end of file
 #0 0x0000000001f3d704 (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x1f3d704)
 #1 0x0000000001f3b4a4 llvm::sys::CleanupOnSignal(unsigned long) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x1f3b4a4)
 #2 0x0000000001e920b4 llvm::CrashRecoveryContext::HandleExit(int) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x1e920b4)
 #3 0x0000000001f333ce llvm::sys::Process::Exit(int, bool) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x1f333ce)
 #4 0x0000000000a293d3 (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0xa293d3)
 #5 0x0000000001e98799 llvm::report_fatal_error(llvm::Twine const&, bool) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x1e98799)
 #6 0x0000000001f0e7be (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x1f0e7be)
 #7 0x0000000002272525 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x2272525)
 #8 0x0000000002f3763d (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x2f3763d)
 #9 0x0000000003c65f79 clang::ParseAST(clang::Sema&, bool, bool) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x3c65f79)
#10 0x0000000002929a69 clang::FrontendAction::Execute() (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x2929a69)
#11 0x00000000028b7dcb clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x28b7dcb)
#12 0x00000000029d8cf3 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x29d8cf3)
#13 0x0000000000a2a575 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0xa2a575)
#14 0x0000000000a27bfc (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0xa27bfc)
#15 0x0000000002741695 (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x2741695)
#16 0x0000000001e91f43 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x1e91f43)
#17 0x0000000002741a29 (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x2741a29)
#18 0x0000000002714d06 clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&) const (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x2714d06)
#19 0x0000000002715729 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*> >&) const (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x2715729)
#20 0x0000000002724699 clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*> >&) (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x2724699)
#21 0x0000000000993ea1 main (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0x993ea1)
#22 0x00007f930143feb0 __libc_start_call_main (/lib64/libc.so.6+0x3feb0)
#23 0x00007f930143ff60 __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x3ff60)
#24 0x0000000000a26fe5 _start (/cvmfs/sft.cern.ch/lcg/releases/clang/14.0.6-14bdb/x86_64-centos9/bin/clang+++0xa26fe5)
make: *** [makefile:432: build.avx2_d_inl0_hrd0/CPPProcess.o] Error 1

I will restart this one...

@valassi
Copy link
Member Author

valassi commented Feb 27, 2023

The clang++ build without inlining finally completed! It took 32 hours to compile CPPProcess.o on lxplus

[avalassi@lxplus9s07 clang14.0.6/cvmfs] /afs/cern.ch/work/a/avalassi/GPU2023/madgraph4gpuClang14/epochX/cudacpp> ./tput/throughputX.sh -sa -ggttgggg -makej -avx2only -nocuda
...
DATE: 2023-02-27_18:15:02
On lxplus9s07.cern.ch [CPU: Intel Core Processor (Broadwell, IBRS)] [GPU: none]:
=========================================================================
runExe /afs/cern.ch/work/a/avalassi/GPU2023/madgraph4gpuClang14/epochX/cudacpp/gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg/build.avx2_d_inl0_hrd0/check.exe -p 1 256 2 OMP=
Process                     = SIGMA_SM_GG_TTXGGGG_CPP [clang 14.0.6 (gcc 12.1.0)] [inlineHel=0] [hardcodePARAM=0]
Workflow summary            = CPP:DBL+CXS:COMMON+RMBHST+MESHST/avx2+NOVBRK
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 10
EvtsPerSec[Rmb+ME]     (23) = ( 2.824191e+00                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 2.824196e+00                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 2.824196e+00                 )  sec^-1
MeanMatrixElemValue         = ( 6.408665e-09 +- 2.650516e-09 )  GeV^-8
TOTAL       :   186.927826 sec
real    3m6.988s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:3823790) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
cmpExe /afs/cern.ch/work/a/avalassi/GPU2023/madgraph4gpuClang14/epochX/cudacpp/gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg/build.avx2_d_inl0_hrd0/check.exe --common -p 2 64 2
cmpExe /afs/cern.ch/work/a/avalassi/GPU2023/madgraph4gpuClang14/epochX/cudacpp/gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg/build.avx2_d_inl0_hrd0/fcheck.exe 2 64 2
Avg ME (C++/C++)    = 4.579798e-09
Avg ME (F77/C++)    = 4.5797964756172243E-009
Relative difference = 3.3284934734463e-07
OK (relative difference <= 5E-3)
=========================================================================
TEST COMPLETED

[avalassi@lxplus9s07 clang14.0.6/cvmfs] /afs/cern.ch/work/a/avalassi/GPU2023/madgraph4gpuClang14/epochX/cudacpp> ls -l gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg/build.avx2_d_inl0_hrd0/check*
-rwxr-xr-x. 1 avalassi zg 130832 Feb 27 18:15 gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg/build.avx2_d_inl0_hrd0/check.exe*
-rw-r--r--. 1 avalassi zg 163224 Feb 26 10:36 gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg/build.avx2_d_inl0_hrd0/check_sa.o

I will relaunch the build with inlining.

Note instead that the cuda build is still ongoing...

@valassi
Copy link
Member Author

valassi commented Mar 4, 2023

The clang build with inlining never completed successfully (on lxplus, my interactive process was logged out every time within one or two days, which I suspect being a symptom of an out of memory).

As for cuda, the build is still running after one week! I will kill the process, it is unreasonable to keep it going longer

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                        
   7726 avalassi  20   0 8794336   8.2g  22676 R  99.3  55.0  10127:59 cicc                                                                                           
   1297 root      20   0  260488  59356  16460 S   0.3   0.4   8:46.66 collectd                                                                                       

…ocess.cc which is 32MB)

Note: the generation of gg_ttgggg.mad failed, killed by out-of-memory oom killer after ~1h30
  dmesg -T | egrep -i 'killed process'
  [Fri Feb 24 21:45:56 2023] Out of memory: Killed process 2812622 (python3) total-vm:30208192kB, anon-rss:14254780kB, file-rss:4kB, shmem-rss:0kB, UID:14546 pgtables:58908kB oom_score_adj:0
…d (-makej -inl)

[root@itscrd90 cudacpp]# grep -i 'killed process' /var/log/messages
Feb 24 21:45:56 itscrd90.cern.ch kernel: Out of memory: Killed process 2812622 (python3) total-vm:30208192kB, anon-rss:14254780kB, file-rss:4kB, shmem-rss:0kB, UID:14546 pgtables:58908kB oom_score_adj:0
Feb 25 12:08:32 itscrd90.cern.ch kernel: Out of memory: Killed process 25738 (dbus-broker-lau) total-vm:19644kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:60kB oom_score_adj:200
Feb 25 12:08:32 itscrd90.cern.ch kernel: Out of memory: Killed process 2859216 (cudafe++) total-vm:4439180kB, anon-rss:2533172kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:8728kB oom_score_adj:0
Feb 25 12:09:59 itscrd90.cern.ch kernel: Out of memory: Killed process 2859218 (cudafe++) total-vm:4830956kB, anon-rss:2404060kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:9504kB oom_score_adj:0
Feb 25 12:12:26 itscrd90.cern.ch kernel: Out of memory: Killed process 2859211 (cudafe++) total-vm:4830956kB, anon-rss:1651848kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:9496kB oom_score_adj:0
Feb 25 12:17:51 itscrd90.cern.ch kernel: Out of memory: Killed process 2859172 (cc1plus) total-vm:5225996kB, anon-rss:3906132kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:9800kB oom_score_adj:0

The first line is the failed generation of ggttgggg.mad yesterday.
The next lines are the failed builds.

NB: the builds failed already with inl0.
I only have gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg/build.*hrd0 and none has a complete CPPProcess.o

Will retry one by one as
./tput/throughputX.sh -ggttgggg -sa -512yonly -makeclean
…FLAGS+= -freport-bug" to prepare bug reports for internal compiler errors
@valassi
Copy link
Member Author

valassi commented Nov 25, 2023

I have rebased over upstream/master... I will probably close this MR as unmerged, but at least it's updated now. And I will cherry pick a few commits elseweher.

…FVs and for compiling them as separate object files (related to splitting kernels)
…the P subdirectory (depends on npar) - build succeeds for cpp, link fails for cuda

ccache /usr/local/cuda-12.0/bin/nvcc  -I. -I../../src  -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX  -std=c++17  -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -Xcompiler -fPIC -c -x cu CPPProcess.cc -o CPPProcess_cuda.o
ptxas fatal   : Unresolved extern function '_ZN9mg5amcGpu14helas_VVV1P0_1EPKdS1_S1_dddPd'
…cuda tests succeed

The build issues some warnings however
nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o'
nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'
nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o'
nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'
…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'
…c++, a factor 3 slower for cuda...

./tput/teeThroughputX.sh -ggtt -makej -makeclean -inlLonly

diff -u --color tput/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt tput/logs_ggtt_mad/log_ggtt_mad_d_inlL_hrd0.txt

-Process                     = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 4.589473e+07                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 1.164485e+08                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 1.280951e+08                 )  sec^-1
-MeanMatrixElemValue         = ( 2.086689e+00 +- 3.413217e-03 )  GeV^0
-TOTAL       :     0.528239 sec
-INFO: No Floating Point Exceptions have been reported
-     2,222,057,027      cycles                           #    2.887 GHz
-     3,171,868,018      instructions                     #    1.43  insn per cycle
-       0.826440817 seconds time elapsed
-runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inl0_hrd0/check_cuda.exe -p 2048 256 1
-==PROF== Profiling "sigmaKin": launch__registers_per_thread 214
+EvtsPerSec[Rmb+ME]     (23) = ( 2.667135e+07                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 4.116115e+07                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 4.251573e+07                 )  sec^-1
+MeanMatrixElemValue         = ( 2.086689e+00 +- 3.413217e-03 )  GeV^0
+TOTAL       :     0.550450 sec
+INFO: No Floating Point Exceptions have been reported
+     2,272,219,097      cycles                           #    2.889 GHz
+     3,361,475,195      instructions                     #    1.48  insn per cycle
+       0.842685843 seconds time elapsed
+runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inlL_hrd0/check_cuda.exe -p 2048 256 1
+==PROF== Profiling "sigmaKin": launch__registers_per_thread 190
 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
…P* (the source is the same but it must be compiled in each P* separately)
git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc
…ild failed?

./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlL

ccache /usr/local/cuda-12.0/bin/nvcc  -I. -I../../src  -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX  -std=c++17  -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_INLINE_HELAMPS -Xcompiler -fPIC -c -x cu CPPProcess.cc -o build.cuda_d_inl1_hrd0/CPPProcess_cuda.o
nvcc error   : 'ptxas' died due to signal 9 (Kill signal)
make[2]: *** [cudacpp.mk:754: build.cuda_d_inl1_hrd0/CPPProcess_cuda.o] Error 9
make[2]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg'
make[1]: *** [makefile:142: build.cuda_d_inl1_hrd0/.cudacpplibs] Error 2
make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg'
make: *** [makefile:282: bldcuda] Error 2
make: *** Waiting for unfinished jobs....
… build time is from cache

./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…mode (use that from the previous run, not from cache)

./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…factor x2 faster (c++? cuda?), runtime is 5-10% slower in C++, but 5-10% faster in cuda!?

./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlLonly

diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt  tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
...
 On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
 =========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 4.338149e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02                 )  sec^-1
-MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     2.242693 sec
-INFO: No Floating Point Exceptions have been reported
-     7,348,976,543      cycles                           #    2.902 GHz
-    16,466,315,526      instructions                     #    2.24  insn per cycle
-       2.591057214 seconds time elapsed
-runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1
+EvtsPerSec[Rmb+ME]     (23) = ( 4.063038e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02                 )  sec^-1
+MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
+TOTAL       :     2.552546 sec
+INFO: No Floating Point Exceptions have been reported
+     7,969,059,552      cycles                           #    2.893 GHz
+    17,401,037,642      instructions                     #    2.18  insn per cycle
+       2.954791685 seconds time elapsed
+runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1
 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
...
 =========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
 Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
-EvtsPerSec[Rmb+ME]     (23) = ( 3.459662e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 3.835352e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02                 )  sec^-1
 MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     1.528240 sec
+TOTAL       :     1.378567 sec
 INFO: No Floating Point Exceptions have been reported
-     4,140,408,789      cycles                           #    2.703 GHz
-     9,072,597,595      instructions                     #    2.19  insn per cycle
-       1.532357792 seconds time elapsed
-=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:94048) (512y:   91) (512z:    0)
+     3,738,350,469      cycles                           #    2.705 GHz
+     8,514,195,736      instructions                     #    2.28  insn per cycle
+       1.382567882 seconds time elapsed
+=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:80619) (512y:   89) (512z:    0)
 -------------------------------------------------------------------------
…10-15% slower in both C++ and cuda

diff -u --color tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt

-Executing ' ./build.512y_d_inlL_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
+Executing ' ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
  [OPENMPTH] omp_get_max_threads/nproc = 1/4
  [NGOODHEL] ngoodhel/ncomb = 128/128
  [XSECTION] VECSIZE_USED = 8192
@@ -401,10 +401,10 @@
  [XSECTION] ChannelId = 1
  [XSECTION] Cross section = 2.332e-07 [2.3322993086656014E-007] fbridge_mode=1
  [UNWEIGHT] Wrote 303 events (found 1531 events)
- [COUNTERS] PROGRAM TOTAL          :  320.6913s
- [COUNTERS] Fortran Overhead ( 0 ) :    4.5138s
- [COUNTERS] CudaCpp MEs      ( 2 ) :  316.1312s for    90112 events => throughput is 2.85E+02 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0463s
+ [COUNTERS] PROGRAM TOTAL          :  288.3304s
+ [COUNTERS] Fortran Overhead ( 0 ) :    4.4909s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :  283.7968s for    90112 events => throughput is 3.18E+02 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0426s

-Executing ' ./build.cuda_d_inlL_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
+Executing ' ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
  [OPENMPTH] omp_get_max_threads/nproc = 1/4
  [NGOODHEL] ngoodhel/ncomb = 128/128
  [XSECTION] VECSIZE_USED = 8192
@@ -557,10 +557,10 @@
  [XSECTION] ChannelId = 1
  [XSECTION] Cross section = 2.332e-07 [2.3322993086656006E-007] fbridge_mode=1
  [UNWEIGHT] Wrote 303 events (found 1531 events)
- [COUNTERS] PROGRAM TOTAL          :   19.6663s
- [COUNTERS] Fortran Overhead ( 0 ) :    4.9649s
- [COUNTERS] CudaCpp MEs      ( 2 ) :   13.4667s for    90112 events => throughput is 6.69E+03 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    1.2347s
+ [COUNTERS] PROGRAM TOTAL          :   18.0242s
+ [COUNTERS] Fortran Overhead ( 0 ) :    4.9891s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :   11.9530s for    90112 events => throughput is 7.54E+03 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    1.0821s
…arnings and runtime test failures in HELINL=0

There are still build failures in HELINL=L
…allCOUP2 instead of allCOUP) to FFV2_4_0 and FFV2_4_3, fixing build failures in HELINL=L
…d CI access, to fix the issues observed in ee_mumu

I did not find an easier way to do this, because the model is known in the aloha caller but not at the time of aloha codegen
…one, COUP1/COUP2 instead of COUP; two, CI/CD instead of CD)
Fix conflicts:
	epochX/cudacpp/tput/teeThroughputX.sh
	epochX/cudacpp/tput/throughputX.sh
Fix conflicts:
	epochX/cudacpp/tput/teeThroughputX.sh
	epochX/cudacpp/tput/throughputX.sh
@valassi
Copy link
Member Author

valassi commented Aug 29, 2024

I regenerated gg_ttgggg with the helas codegen of PR #978.

Using the HELINL=L option this still fails compilation on gcc. I guess it must be the color algebra that does not follow?

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> date; make -j BACKEND=cppnone HELINL=L; date
Thu Aug 29 06:07:46 PM CEST 2024
BACKEND='cppnone'
OMPFLAGS=
FPTYPE='d'
HELINL='L'
HRDCOD='0'
HASCURAND=hasCurand
HASHIPRAND=hasNoHiprand
Building in BUILDDIR=. for tag=none_d_inlL_hrd0_hasCurand_hasNoHiprand (USEBUILDDIR != 1)
gfortran -I. -fPIC -c fcheck_sa.f -o fcheck_sa_fortran.o
make -C ../../src  -f cudacpp_src.mk
make[1]: Entering directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgggg.sa/src'
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -DUSE_NVTX -I/usr/local/cuda-12.0/include/ -DMGONGPU_HAS_NO_HIPRAND -c check_sa.cc -o check_sa_cpp.o
mkdir -p ../lib
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c CPPProcess.cc -o CPPProcess_cpp.o
ccache g++  -I. -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c read_slha.cc -o read_slha_cpp.o
ccache g++  -I. -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c Parameters_sm.cc -o Parameters_sm_cpp.o
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c MatrixElementKernels.cc -o MatrixElementKernels_cpp.o
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c BridgeKernels.cc -o BridgeKernels_cpp.o
ccache g++  -I. -I../../src -O3 -std=c++17 -Wall -Wshadow -Wextra -march=x86-64 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -fno-fast-math -c CrossSectionKernels.cc -o CrossSectionKernels_cpp.o
ccache g++ -shared -o ../lib/libmg5amc_common_cpp.so ./read_slha_cpp.o ./Parameters_sm_cpp.o 
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c HelAmps.cc -o HelAmps_cpp.o
make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgggg.sa/src'
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c fbridge.cc -o fbridge_cpp.o
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c CommonRandomNumberKernel.cc -o CommonRandomNumberKernel_cpp.o
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c RamboSamplingKernels.cc -o RamboSamplingKernels_cpp.o
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -DMGONGPU_HAS_NO_HIPRAND -I/usr/local/cuda-12.0/include/ -c CurandRandomNumberKernel.cc -o CurandRandomNumberKernel_cpp.o
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -DMGONGPU_HAS_NO_HIPRAND -c HiprandRandomNumberKernel.cc -o HiprandRandomNumberKernel_cpp.o
ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c fsampler.cc -o fsampler_cpp.o
ccache g++  -I. -I../../src -I../../../../../test/googletest/install_gcc11.3.1/include -I../../../../../test/googletest/install_gcc11.3.1/include -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c testxxx.cc -o testxxx_cpp.o
ccache g++  -I. -I../../src -I../../../../../test/googletest/install_gcc11.3.1/include -I../../../../../test/googletest/install_gcc11.3.1/include -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c testmisc.cc -o testmisc_cpp.o
ccache g++  -I. -I../../src -I../../../../../test/googletest/install_gcc11.3.1/include -I../../../../../test/googletest/install_gcc11.3.1/include -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c runTest.cc -o runTest_cpp.o
g++: internal compiler error: Segmentation fault signal terminated program cc1plus
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugs.almalinux.org/> for instructions.
make: *** [makefile:748: CPPProcess_cpp.o] Error 4
Thu Aug 29 06:25:23 PM CEST 2024

@valassi
Copy link
Member Author

valassi commented Aug 29, 2024

Also clang fails with a different error 255

make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgggg.sa/src'
make: *** [makefile:751: CPPProcess_cpp.o] Error 255
Thu Aug 29 08:55:11 PM CEST 2024
[avalassi@itscrd90 clang17.0.1/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant