-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPE in vxxxxx during runTest.exe (testxxx) for HIP on LUMI #1011
Comments
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 3, 2024
… 72h) for release v1.00.00 - one new issue madgraph5#1011 (FPEs in vxxxxx for LUMI) (NB: this was run in parallel - a posteriori I reverted itscrd90 tput logs, except for 6 curhst logs, then squashed) (To revert the curhst logs: "git checkout 4865525 tput/logs_*curhst*") (1) Note, I had initially done a build and test without the -hip option, with some failures STARTED AT Wed 02 Oct 2024 09:48:45 PM EEST ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Wed 02 Oct 2024 10:14:30 PM EEST [Status=1] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Wed 02 Oct 2024 10:45:14 PM EEST [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Wed 02 Oct 2024 10:48:26 PM EEST [Status=1] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Wed 02 Oct 2024 10:50:27 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Wed 02 Oct 2024 10:50:58 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Wed 02 Oct 2024 10:52:58 PM EEST [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Wed 02 Oct 2024 11:13:57 PM EEST [Status=0] (2) This commit is the result of the second test, where I repeated using the -hip option (./tput/allTees.sh -hip) STARTED AT Thu 03 Oct 2024 12:57:14 AM EEST ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Thu 03 Oct 2024 01:29:36 AM EEST [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(2) AT Thu 03 Oct 2024 01:38:03 AM EEST [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean -nocuda ENDED(3) AT Thu 03 Oct 2024 01:47:01 AM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst -nocuda ENDED(4) AT Thu 03 Oct 2024 01:49:00 AM EEST [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda' ENDED(5) AT Thu 03 Oct 2024 01:49:00 AM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda ENDED(6) AT Thu 03 Oct 2024 01:50:58 AM EEST [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(7) AT Thu 03 Oct 2024 02:00:26 AM EEST [Status=0] NB: the results below come from an improved version of checklogs in tput/allTees.sh, from a later commit No errors found in logs tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x74b3d0 processed 0 events across 2 channels { } tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x728930 processed 0 events across 2 channels { } tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_common.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_common.txt:DEBUG: MEK 0x7618d0 processed 0 events across 2 channels { } tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_common.txt:DEBUG: MEK 0x74b3d0 processed 0 events across 2 channels { } tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd1.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x117f910 processed 0 events across 2 channels { } tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x77c170 processed 0 events across 2 channels { } tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_rmbhst.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_rmbhst.txt:DEBUG: MEK 0x119a3d0 processed 0 events across 2 channels { } tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_rmbhst.txt:DEBUG: MEK 0xc33230 processed 0 events across 2 channels { } tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0xc32660 processed 0 events across 2 channels { } tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0x7809a0 processed 0 events across 2 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_common.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_common.txt:DEBUG: MEK 0x8d9670 processed 0 events across 123 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_common.txt:DEBUG: MEK 0x8c5930 processed 0 events across 123 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_rmbhst.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_rmbhst.txt:DEBUG: MEK 0x8d9670 processed 0 events across 123 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_rmbhst.txt:DEBUG: MEK 0x8c5930 processed 0 events across 123 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0x8ec7f0 processed 0 events across 123 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0x8978e0 processed 0 events across 123 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x8d9670 processed 0 events across 123 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x8c5930 processed 0 events across 123 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd1.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x1262600 processed 0 events across 123 channels { } tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x94e8a0 processed 0 events across 123 channels { } tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0x75eb20 processed 0 events across 16 channels { } tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0x11bd0d0 processed 0 events across 16 channels { } tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd1.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd1.txt:DEBUG: MEK 0xd82780 processed 0 events across 16 channels { } tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x73e480 processed 0 events across 16 channels { } tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt:DEBUG: MEK 0xb9ace0 processed 0 events across 16 channels { } tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt:DEBUG: MEK 0xc4ab30 processed 0 events across 16 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0_rmbhst.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0_rmbhst.txt:DEBUG: MEK 0x6a5340 processed 0 events across 3 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0_rmbhst.txt:DEBUG: MEK 0x11ac900 processed 0 events across 3 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd1.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd1.txt:DEBUG: MEK 0xd1c010 processed 0 events across 3 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x6fc940 processed 0 events across 3 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0_common.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0_common.txt:DEBUG: MEK 0x6df940 processed 0 events across 3 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0_common.txt:DEBUG: MEK 0x67fb00 processed 0 events across 3 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0xb882a0 processed 0 events across 3 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0x783ec0 processed 0 events across 3 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x6df940 processed 0 events across 3 channels { } tput/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x67fb00 processed 0 events across 3 channels { } tput/logs_ggtt_mad/#log_ggtt_mad_f_inl0_hrd0.txt#:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_ggtt_mad/#log_ggtt_mad_f_inl0_hrd0.txt#:DEBUG: MEK 0x6df940 processed 0 events across 3 channels { } tput/logs_ggtt_mad/#log_ggtt_mad_f_inl0_hrd0.txt#:DEBUG: MEK 0x67fb00 processed 0 events across 3 channels { } tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:DEBUG: MEK 0xb83cf0 processed 0 events across 5 channels { } tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x7896a0 processed 0 events across 5 channels { } tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0xd1fcc0 processed 0 events across 5 channels { } tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:DEBUG: MEK 0xd1b3b0 processed 0 events across 5 channels { } tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x6e4740 processed 0 events across 5 channels { } tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x7298f0 processed 0 events across 5 channels { } tput/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x11a9de0 processed 0 events across 4 channels { } tput/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x11975c0 processed 0 events across 4 channels { } tput/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd1.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x74d7b0 processed 0 events across 4 channels { } tput/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x729a10 processed 0 events across 4 channels { } tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x72f1d0 processed 0 events across 72 channels { } tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x871370 processed 0 events across 72 channels { } tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd1.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x7ea630 processed 0 events across 72 channels { } tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x6dbd10 processed 0 events across 72 channels { } tput/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x6f2f60 processed 0 events across 6 channels { } tput/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt:DEBUG: MEK 0x6ee280 processed 0 events across 6 channels { } tput/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd1.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd1.txt:DEBUG: MEK 0xc36d80 processed 0 events across 6 channels { } tput/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x788210 processed 0 events across 6 channels { } tput/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt:DEBUG: MEK 0xd71c40 processed 0 events across 3 channels { } tput/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt:DEBUG: MEK 0xd6e8e0 processed 0 events across 3 channels { } tput/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd1.txt:Floating Point Exception (GPU): 'vxxxxx' ievt=17 tput/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x6f6ff0 processed 0 events across 3 channels { } tput/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd1.txt:DEBUG: MEK 0x117d970 processed 0 events across 3 channels { } eemumu MEK (channelid array) processed 512 events across 2 channels { 1 : 256, 2 : 256 } eemumu MEK (no multichannel) processed 512 events across 2 channels { no-multichannel : 512 } ggttggg MEK (channelid array) processed 512 events across 1240 channels { 1 : 32, 2 : 32, 4 : 32, 5 : 32, 7 : 32, 8 : 32, 14 : 32, 15 : 32, 16 : 32, 18 : 32, 19 : 32, 20 : 32, 22 : 32, 23 : 32, 24 : 32, 26 : 32 } ggttggg MEK (no multichannel) processed 512 events across 1240 channels { no-multichannel : 512 } ggttgg MEK (channelid array) processed 512 events across 123 channels { 2 : 32, 3 : 32, 4 : 32, 5 : 32, 6 : 32, 7 : 32, 8 : 32, 9 : 32, 10 : 32, 11 : 32, 12 : 32, 13 : 32, 14 : 32, 15 : 32, 16 : 32, 17 : 32 } ggttgg MEK (no multichannel) processed 512 events across 123 channels { no-multichannel : 512 } ggttg MEK (channelid array) processed 512 events across 16 channels { 1 : 64, 2 : 32, 3 : 32, 4 : 32, 5 : 32, 6 : 32, 7 : 32, 8 : 32, 9 : 32, 10 : 32, 11 : 32, 12 : 32, 13 : 32, 14 : 32, 15 : 32 } ggttg MEK (no multichannel) processed 512 events across 16 channels { no-multichannel : 512 } ggtt MEK (channelid array) processed 512 events across 3 channels { 1 : 192, 2 : 160, 3 : 160 } ggtt MEK (no multichannel) processed 512 events across 3 channels { no-multichannel : 512 } gqttq MEK (channelid array) processed 512 events across 5 channels { 1 : 128, 2 : 96, 3 : 96, 4 : 96, 5 : 96 } gqttq MEK (no multichannel) processed 512 events across 5 channels { no-multichannel : 512 } heftggbb MEK (channelid array) processed 512 events across 4 channels { 1 : 128, 2 : 128, 3 : 128, 4 : 128 } heftggbb MEK (no multichannel) processed 512 events across 4 channels { no-multichannel : 512 } smeftggtttt MEK (channelid array) processed 512 events across 72 channels { 1 : 32, 2 : 32, 3 : 32, 4 : 32, 5 : 32, 6 : 32, 7 : 32, 8 : 32, 9 : 32, 10 : 32, 11 : 32, 12 : 32, 13 : 32, 14 : 32, 15 : 32, 16 : 32 } smeftggtttt MEK (no multichannel) processed 512 events across 72 channels { no-multichannel : 512 } susyggt1t1 MEK (channelid array) processed 512 events across 6 channels { 2 : 128, 3 : 96, 4 : 96, 5 : 96, 6 : 96 } susyggt1t1 MEK (no multichannel) processed 512 events across 6 channels { no-multichannel : 512 } susyggtt MEK (channelid array) processed 512 events across 3 channels { 1 : 192, 2 : 160, 3 : 160 } susyggtt MEK (no multichannel) processed 512 events across 3 channels { no-multichannel : 512 }
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 3, 2024
… '{ }' (identify madgraph5#1011 FPE in vxxxxx for HIP on LUMI)
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 3, 2024
…rd90 Revert "[install] rerun 30 tmad tests on LUMI worker node (small-g 72h) for release v1.00.00 - all as expected (heft fails madgraph5#833, skip ggttggg madgraph5#933)" This reverts commit a6c94d0. Revert "[install] rerun 96 tput builds and tests on LUMI worker node (small-g 72h) for release v1.00.00 - one new issue madgraph5#1011 (FPEs in vxxxxx for LUMI)" This reverts commit 217368c.
While I have a LUMI environment up and running, a few observations I added -g to CXXFLAGS and then
|
Very strange, but this workaround seems to solve it
Will target a later release v1.00.01 |
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 3, 2024
…P: replace "pvec0 / ( vmass * pp )" by "pvec0 / vmass / pp"
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 3, 2024
…P: replace "pvec0 / ( vmass * pp )" by "pvec0 / vmass / pp"
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 3, 2024
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 4, 2024
…vxxxxx on HIP: replace "pvec0 / ( vmass * pp )" by "pvec0 / vmass / pp"
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 4, 2024
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 4, 2024
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 4, 2024
…) with the workaround for HIP FPEs madgraph5#1011 - now all tests succeed ./tput/allTees.sh -hip STARTED AT Fri 04 Oct 2024 09:31:32 AM EEST ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Fri 04 Oct 2024 10:33:14 AM EEST [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(2) AT Fri 04 Oct 2024 11:09:17 AM EEST [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean -nocuda ENDED(3) AT Fri 04 Oct 2024 11:17:27 AM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst -nocuda ENDED(4) AT Fri 04 Oct 2024 11:19:15 AM EEST [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda' ENDED(5) AT Fri 04 Oct 2024 11:19:15 AM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda ENDED(6) AT Fri 04 Oct 2024 11:21:02 AM EEST [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(7) AT Fri 04 Oct 2024 11:53:25 AM EEST [Status=0] No errors found in logs No FPEs or '{ }' found in logs eemumu MEK (channelid array) processed 512 events across 2 channels { 1 : 256, 2 : 256 } eemumu MEK (no multichannel) processed 512 events across 2 channels { no-multichannel : 512 } ggttggg MEK (channelid array) processed 512 events across 1240 channels { 1 : 32, 2 : 32, 4 : 32, 5 : 32, 7 : 32, 8 : 32, 14 : 32, 15 : 32, 16 : 32, 18 : 32, 19 : 32, 20 : 32, 22 : 32, 23 : 32, 24 : 32, 26 : 32 } ggttggg MEK (no multichannel) processed 512 events across 1240 channels { no-multichannel : 512 } ggttgg MEK (channelid array) processed 512 events across 123 channels { 2 : 32, 3 : 32, 4 : 32, 5 : 32, 6 : 32, 7 : 32, 8 : 32, 9 : 32, 10 : 32, 11 : 32, 12 : 32, 13 : 32, 14 : 32, 15 : 32, 16 : 32, 17 : 32 } ggttgg MEK (no multichannel) processed 512 events across 123 channels { no-multichannel : 512 } ggttg MEK (channelid array) processed 512 events across 16 channels { 1 : 64, 2 : 32, 3 : 32, 4 : 32, 5 : 32, 6 : 32, 7 : 32, 8 : 32, 9 : 32, 10 : 32, 11 : 32, 12 : 32, 13 : 32, 14 : 32, 15 : 32 } ggttg MEK (no multichannel) processed 512 events across 16 channels { no-multichannel : 512 } ggtt MEK (channelid array) processed 512 events across 3 channels { 1 : 192, 2 : 160, 3 : 160 } ggtt MEK (no multichannel) processed 512 events across 3 channels { no-multichannel : 512 } gqttq MEK (channelid array) processed 512 events across 5 channels { 1 : 128, 2 : 96, 3 : 96, 4 : 96, 5 : 96 } gqttq MEK (no multichannel) processed 512 events across 5 channels { no-multichannel : 512 } heftggbb MEK (channelid array) processed 512 events across 4 channels { 1 : 128, 2 : 128, 3 : 128, 4 : 128 } heftggbb MEK (no multichannel) processed 512 events across 4 channels { no-multichannel : 512 } smeftggtttt MEK (channelid array) processed 512 events across 72 channels { 1 : 32, 2 : 32, 3 : 32, 4 : 32, 5 : 32, 6 : 32, 7 : 32, 8 : 32, 9 : 32, 10 : 32, 11 : 32, 12 : 32, 13 : 32, 14 : 32, 15 : 32, 16 : 32 } smeftggtttt MEK (no multichannel) processed 512 events across 72 channels { no-multichannel : 512 } susyggt1t1 MEK (channelid array) processed 512 events across 6 channels { 2 : 128, 3 : 96, 4 : 96, 5 : 96, 6 : 96 } susyggt1t1 MEK (no multichannel) processed 512 events across 6 channels { no-multichannel : 512 } susyggtt MEK (channelid array) processed 512 events across 3 channels { 1 : 192, 2 : 160, 3 : 160 } susyggtt MEK (no multichannel) processed 512 events across 3 channels { no-multichannel : 512 }
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this issue
Oct 4, 2024
Revert "[amd] rerun 30 tmad tests on LUMI worker node (small-g 72h) - no change (heft fails madgraph5#833, skip ggttggg madgraph5#933)" This reverts commit 07c2a53. Revert "[amd] rerun 96 tput builds and tests on LUMI worker node (small-g 72h) with the workaround for HIP FPEs madgraph5#1011 - now all tests succeed" This reverts commit 0ec8c1c.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have rerun one final batch of large scale tests for the v1.00.00 release, including LUMI.
I now systematically get FPEs in vxxxxx in runTests.exe on LUMI.
This is most likely related to #806. In that issue I bypassed a segfault using -O2 instead of -O3. The segfault was difficult to identify precisely but there were indications that it was in vxxxxx. Initially my tests all seemed to succeed. Now after a few updates (I am not sure which ones, I thought the code was almost identical?), I systematically get many more tests failing all in vxxxxx.
I assign this to me as I have some ideas what to look for, but anyone feel free to also investigate (let me know in case please). I will release v1.00.00 with the issue anyway and mark it down as pending.
NB: this should be nicely encapsulated to debug, because the error is probably in the testxxx tests. These are executed for all physics processes, but they are completely independent of physics processes.
Details
The text was updated successfully, but these errors were encountered: