-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes in xxxxx for IEEE_DIVIDE_BY_ZERO FPE; separate cpu/gpu namespaces and fix runtest segfault #723
Conversation
…_ZERO (see firemodels/fds/issues/5638 on gh) with -ffpe flags However, the build gives this warning ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -O3 -std=c++17 -I. -I../../src -I../../../../../test/googletest/install/include -I../../../../../test/googletest/install/include -Wall -Wshadow -Wextra -ffast-math -fopenmp -march=skylake-avx512 -mprefer-vector-width=256 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -ffpe-trap=invalid,zero,overflow -ffpe-summary=none -fPIC -c testxxx.cc -o testxxx.o cc1plus: warning: command-line option ‘-ffpe-trap=invalid,zero,overflow’ is valid for Fortran but not for C++ cc1plus: warning: command-line option ‘-ffpe-summary=none’ is valid for Fortran but not for C++ I will revert
Revert "[fpe] in ggttsa cudacpp.mk, try to debug madgraph5#701 IEEE_DIVIDE_BY_ZERO (see firemodels/fds/issues/5638 on gh) with -ffpe flags" This reverts commit d75e426.
…als to debug madgraph5#701 (see https://stackoverflow.com/a/17473528) This works as expected: [avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> ./runTest.exe --gtest_filter=*xxx Running main() from /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest_main.cc Note: Google Test filter = *xxx [==========] Running 2 tests from 2 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX [ RUN ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx Floating point exception (core dumped)
…signal handler for madgraph5#701 [avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> make -j AVX=512y ... [avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> ./runTest.exe --gtest_filter=*xxx Running main() from /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest_main.cc Note: Google Test filter = *xxx [==========] Running 2 tests from 2 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX [ RUN ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx Floating Point Exception (CPU neppV=4): 'ipzxxx'
…CPP_RUNTIME_DISABLEFPE is set Note: as observed last week, a debug build triggers an FPE exception already in ixxxxx [avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> ./runTest.exe Running main() from /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest_main.cc [==========] Running 3 tests from 3 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX [ RUN ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx Floating Point Exception (CPU neppV=4): 'ixxxxx' Conversely, in the same debug build, disabling FPEs with the env variable gives a successful test [avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> CUDACPP_RUNTIME_DISABLEFPE=1 ./runTest.exe Running main() from /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest_main.cc [==========] Running 3 tests from 3 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX [ RUN ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx (0 ms) [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX (0 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_CPU_MISC [ RUN ] SIGMA_SM_GG_TTX_CPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_CPU_MISC.testmisc (0 ms) [----------] 1 test from SIGMA_SM_GG_TTX_CPU_MISC (0 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_CPU/MadgraphTest [ RUN ] SIGMA_SM_GG_TTX_CPU/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it [ OK ] SIGMA_SM_GG_TTX_CPU/MadgraphTest.CompareMomentaAndME/0 (34 ms) [----------] 1 test from SIGMA_SM_GG_TTX_CPU/MadgraphTest (34 ms total) [----------] Global test environment tear-down [==========] 3 tests from 3 test suites ran. (35 ms total) [ PASSED ] 3 tests.
No change in runTest behaviour, FPEs by default, succeeds if FPEs disabled
…et cast) No change in runTest behaviour, FPEs by default, succeeds if FPEs disabled
…handler). This also includes a resetHstMomentaToPar0, which is commented out for the moment. The idea was to modify the momenta befaore each xxx call, to ensure that they are all consistent. But I will instead implement a more solid fix. No change in runTest behaviour, FPEs by default, succeeds if FPEs disabled
…dgraph5#701 in function ixxxxx This builds ok
In debug mode this fails like this [==========] Running 3 tests from 3 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX [ RUN ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx nsp=-1 ievt=0: 500, 0, 0, 500, IXXXXX: sqp0p3={ -0, -0, -0, -0 } Floating Point Exception (CPU neppV=4): 'ixxxxx' ievt=0 Note: last week the sqp0p3 were not all 0. I am not sure what I was doing (I was using hstReset?). Anyway: I will revert this commit an dthe previous one. We need a much more solid fix in all xxx functions.
…l start from scratch Revert "[fpe] in ggtt.sa HelAmps_sm.h, add some debugging printouts for ixxxxx" This reverts commit fdacc5e Revert "[fpe] in ggtt.sa HelAmps_sm.h, first (OLD!) attempt of BUG FIX FOR madgraph5#701 in function ixxxxx" This reverts commit 7674824.
The build fails because maskand is also defined in testmisc.cc
… mgOnGpuVectors.h now
Thiw now shows (in debug builds) that the first tests executed is ixxxxx and it immediately fails with FPE [==========] Running 3 tests from 3 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX [ RUN ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx nsp=-1 ievt=0: 500, 0, 0, 500, Prepare test ixxxxx ievt=0 Floating Point Exception (CPU neppV=4): 'ixxxxx' ievt=0
…ptype& r )" to create cx vectors from fp scalars
…ion ixxxxx This builds and runs ok. The FPE (always in debug mode) is now moved from ixxxxx to the next ipzxxx [==========] Running 3 tests from 3 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX [ RUN ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx nsp=-1 ievt=0: 500, 0, 0, 500, Prepare test ixxxxx ievt=0 Prepare test ipzxxx ievt=0 Floating Point Exception (CPU neppV=4): 'ipzxxx' ievt=0
…ginning of each test (prepare to modify momenta for ipzxxx) No change in runTest behaviour, FPEs by default in ipzxxx, succeeds if FPEs disabled
…respecting the relevant assumptions Assumption example for ipzxxx: (FMASS == 0) and (PX == PY == 0 and E == +PZ > 0) This is done by testing one ievt and copying all momenta to that ievt NB: after adding the woraround for ipzxxx, now the test fails in vxxxxx, which is the real issue in madgraph5#701 [==========] Running 3 tests from 3 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX [ RUN ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx nsp=-1 ievt=0: 500, 0, 0, 500, Prepare test ixxxxx ievt=0 Prepare test ipzxxx ievt=0 Prepare test vxxxxx ievt=0 Floating Point Exception (CPU neppV=4): 'vxxxxx' ievt=0
…ion vxxxxx This builds and runs ok. The FPE (always in debug mode) is now moved from vxxxxx to the next oxxxxx Running main() from /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest_main.cc [==========] Running 3 tests from 3 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX [ RUN ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx nsp=-1 ievt=0: 500, 0, 0, 500, Prepare test ixxxxx ievt=0 Prepare test ipzxxx ievt=0 Prepare test vxxxxx ievt=0 Prepare test sxxxxx ievt=0 Prepare test oxxxxx ievt=0 Floating Point Exception (CPU neppV=4): 'oxxxxx' ievt=0
NB1: This also adds LIBFLAGS to link command for shared libraries This is needed to avoid "hidden symbol `__gcov_init' in ...libgcov.a(_gcov.o) is referenced by DSO" errors NB2: I will not add a gcov target to .mad makefiles (they have no debug target either yet)
…make clean' Revert "[fpe] in ggt.sa .gitignore, add gcov suffixes to gitignore" This reverts commit eb5594d.
…rocesses for f in `gitls */SubProcesses/MemoryAccessDenominators.h`; do \cp gg_tt.mad/SubProcesses/MemoryAccessDenominators.h $f; done for f in `gitls */SubProcesses/MemoryAccessNumerators.h`; do \cp gg_tt.mad/SubProcesses/MemoryAccessNumerators.h $f; done
Note: the performance is very similar to that of upstream/master. Maybe only the simplest 2->2 processes are a bit slower, but that's acceptable. The number of SIMD instructions has changed, but not in all builds, which is a bit surprising. All in all, things look ok. STARTED AT Thu Jul 20 18:19:00 CEST 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Thu Jul 20 21:22:53 CEST 2023 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Thu Jul 20 21:48:39 CEST 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Thu Jul 20 21:58:10 CEST 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Thu Jul 20 22:01:12 CEST 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Thu Jul 20 22:04:11 CEST 2023 [Status=0]
Note: performance remains very similar to upstream/master STARTED AT Thu Jul 20 22:07:15 CEST 2023 ENDED AT Fri Jul 21 02:16:03 CEST 2023 Status=0 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
… easier merging This ~completes the fpe and namespace patches, addressing madgraph5#701 and madgraph5#725, respectively. (HOWEVER, the CI on MacOS failed for this with madgraph5#730 - still a few things to change before merging). Unfortunately, I tested that this patch only fixes the IEEE_DIVIDE_BY_ZERO part of madgraph5#701, but there are still other issues remaining (being debugged in branch nobm). Revert "[fpe] rerun 15 tmad - ggttgg tests fail again madgraph5#655 as expected" This reverts commit 9212960. Revert "[fpe] rerun 78 tput alltees, all ok" This reverts commit 9a68868.
…esses towards src - this fixes HRDCOD=1 builds on non-SM processes madgraph5#731
…da of non-SM) to CODEGEN from heft_gg_h.sa
…madgraph5#730 and madgraph5#731 This completes the fpe and namespace patches, addressing madgraph5#701 and madgraph5#725, respectively. Unfortunately, I tested that this patch only fixes the IEEE_DIVIDE_BY_ZERO part of madgraph5#701, but there are still other issues remaining (being debugged in branch nobm and in madgraph5#733): IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
This is finally complete - as good as it gets - and passes the CI tests. I am self merging. I will then document it a posteriori. |
(this is the merge of fpe as of commit 49f9d3f, which will be merged to master in madgraph5#723)
This is some documentation for this MR #723. This is addressing two rather large/complex issues. It is fixing 8 github issues in total.
I think this should be more or less all for this MR. Next steps on this line of work would be
cc @roiser @oliviermattelaer @hageboeck @zeniheisser @Jooorgen |
This is WIP MR with comprehensive fixes in xxxxxx funxtions for FPE floating point exceptions
It is motivated by and is meant to fix bug #701