Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workaround for FPE in vxxxxx on HIP (and fixes for v1.00.01 tags) #1012

Merged
merged 11 commits into from
Oct 4, 2024

Conversation

valassi
Copy link
Member

@valassi valassi commented Oct 3, 2024

This is a WIP PR with a workaround for FPE in vxxxxx on HIP #1011

It is WIP because

@valassi valassi self-assigned this Oct 3, 2024
@valassi valassi requested a review from a team as a code owner October 3, 2024 13:34
@valassi valassi linked an issue Oct 3, 2024 that may be closed by this pull request
@valassi valassi marked this pull request as draft October 3, 2024 13:34
@valassi
Copy link
Member Author

valassi commented Oct 4, 2024

  • must first release 1.00.00

this is done, I have resynced this to the latest master

I also fixed #1013 an dincluded this here

  • must backport to codegen, regenerate code, run all tests

codegen backported, I regenerated all processes

I will run all tests later on

…) with the workaround for HIP FPEs madgraph5#1011 - now all tests succeed

./tput/allTees.sh -hip

STARTED  AT Fri 04 Oct 2024 09:31:32 AM EEST
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean  -nocuda
ENDED(1) AT Fri 04 Oct 2024 10:33:14 AM EEST [Status=0]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean  -nocuda
ENDED(2) AT Fri 04 Oct 2024 11:09:17 AM EEST [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean  -nocuda
ENDED(3) AT Fri 04 Oct 2024 11:17:27 AM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst  -nocuda
ENDED(4) AT Fri 04 Oct 2024 11:19:15 AM EEST [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common  -nocuda'
ENDED(5) AT Fri 04 Oct 2024 11:19:15 AM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common  -nocuda
ENDED(6) AT Fri 04 Oct 2024 11:21:02 AM EEST [Status=0]
./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean  -nocuda
ENDED(7) AT Fri 04 Oct 2024 11:53:25 AM EEST [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

eemumu MEK (channelid array) processed 512 events across 2 channels { 1 : 256, 2 : 256 }
eemumu MEK (no multichannel) processed 512 events across 2 channels { no-multichannel : 512 }
ggttggg MEK (channelid array) processed 512 events across 1240 channels { 1 : 32, 2 : 32, 4 : 32, 5 : 32, 7 : 32, 8 : 32, 14 : 32, 15 : 32, 16 : 32, 18 : 32, 19 : 32, 20 : 32, 22 : 32, 23 : 32, 24 : 32, 26 : 32 }
ggttggg MEK (no multichannel) processed 512 events across 1240 channels { no-multichannel : 512 }
ggttgg MEK (channelid array) processed 512 events across 123 channels { 2 : 32, 3 : 32, 4 : 32, 5 : 32, 6 : 32, 7 : 32, 8 : 32, 9 : 32, 10 : 32, 11 : 32, 12 : 32, 13 : 32, 14 : 32, 15 : 32, 16 : 32, 17 : 32 }
ggttgg MEK (no multichannel) processed 512 events across 123 channels { no-multichannel : 512 }
ggttg MEK (channelid array) processed 512 events across 16 channels { 1 : 64, 2 : 32, 3 : 32, 4 : 32, 5 : 32, 6 : 32, 7 : 32, 8 : 32, 9 : 32, 10 : 32, 11 : 32, 12 : 32, 13 : 32, 14 : 32, 15 : 32 }
ggttg MEK (no multichannel) processed 512 events across 16 channels { no-multichannel : 512 }
ggtt MEK (channelid array) processed 512 events across 3 channels { 1 : 192, 2 : 160, 3 : 160 }
ggtt MEK (no multichannel) processed 512 events across 3 channels { no-multichannel : 512 }
gqttq MEK (channelid array) processed 512 events across 5 channels { 1 : 128, 2 : 96, 3 : 96, 4 : 96, 5 : 96 }
gqttq MEK (no multichannel) processed 512 events across 5 channels { no-multichannel : 512 }
heftggbb MEK (channelid array) processed 512 events across 4 channels { 1 : 128, 2 : 128, 3 : 128, 4 : 128 }
heftggbb MEK (no multichannel) processed 512 events across 4 channels { no-multichannel : 512 }
smeftggtttt MEK (channelid array) processed 512 events across 72 channels { 1 : 32, 2 : 32, 3 : 32, 4 : 32, 5 : 32, 6 : 32, 7 : 32, 8 : 32, 9 : 32, 10 : 32, 11 : 32, 12 : 32, 13 : 32, 14 : 32, 15 : 32, 16 : 32 }
smeftggtttt MEK (no multichannel) processed 512 events across 72 channels { no-multichannel : 512 }
susyggt1t1 MEK (channelid array) processed 512 events across 6 channels { 2 : 128, 3 : 96, 4 : 96, 5 : 96, 6 : 96 }
susyggt1t1 MEK (no multichannel) processed 512 events across 6 channels { no-multichannel : 512 }
susyggtt MEK (channelid array) processed 512 events across 3 channels { 1 : 192, 2 : 160, 3 : 160 }
susyggtt MEK (no multichannel) processed 512 events across 3 channels { no-multichannel : 512 }
…ge (heft fails madgraph5#833, skip ggttggg madgraph5#933)

./tmad/allTees.sh -hip

STARTED  AT Fri 04 Oct 2024 11:53:26 AM EEST
(SM tests)
ENDED(1) AT Fri 04 Oct 2024 02:12:45 PM EEST [Status=0]
(BSM tests)
ENDED(1) AT Fri 04 Oct 2024 02:22:24 PM EEST [Status=0]

16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

eemumu MEK processed 81920 events across 2 channels { 1 : 81920 }
eemumu MEK processed 8192 events across 2 channels { 1 : 8192 }
ggttggg MEK processed 81920 events across 1240 channels { 1 : 81920 }
ggttggg MEK processed 8192 events across 1240 channels { 1 : 8192 }
ggttgg MEK processed 81920 events across 123 channels { 112 : 81920 }
ggttgg MEK processed 8192 events across 123 channels { 112 : 8192 }
ggttg MEK processed 81920 events across 16 channels { 1 : 81920 }
ggttg MEK processed 8192 events across 16 channels { 1 : 8192 }
ggtt MEK processed 81920 events across 3 channels { 1 : 81920 }
ggtt MEK processed 8192 events across 3 channels { 1 : 8192 }
gqttq MEK processed 81920 events across 5 channels { 1 : 81920 }
gqttq MEK processed 8192 events across 5 channels { 1 : 8192 }
heftggbb MEK processed 81920 events across 4 channels { 1 : 81920 }
heftggbb MEK processed 8192 events across 4 channels { 1 : 8192 }
smeftggtttt MEK processed 81920 events across 72 channels { 1 : 81920 }
smeftggtttt MEK processed 8192 events across 72 channels { 1 : 8192 }
susyggt1t1 MEK processed 81920 events across 6 channels { 3 : 81920 }
susyggt1t1 MEK processed 8192 events across 6 channels { 3 : 8192 }
susyggtt MEK processed 81920 events across 3 channels { 1 : 81920 }
susyggtt MEK processed 8192 events across 3 channels { 1 : 8192 }
Revert "[amd] rerun 30 tmad tests on LUMI worker node (small-g 72h) - no change (heft fails madgraph5#833, skip ggttggg madgraph5#933)"
This reverts commit 07c2a53.

Revert "[amd] rerun 96 tput builds and tests on LUMI worker node (small-g 72h) with the workaround for HIP FPEs madgraph5#1011 - now all tests succeed"
This reverts commit 0ec8c1c.
@valassi valassi changed the title WIP workaround for FPE in vxxxxx on HIP workaround for FPE in vxxxxx on HIP Oct 4, 2024
@valassi valassi marked this pull request as ready for review October 4, 2024 15:15
@valassi
Copy link
Member Author

valassi commented Oct 4, 2024

Hi @oliviermattelaer this is also ready for merging, it fixes some FPEs on HIP GPUs

(And it includes #1014)

Can you approve please? Thanks Andrea

Copy link
Member

@oliviermattelaer oliviermattelaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect thanks

@valassi valassi changed the title workaround for FPE in vxxxxx on HIP workaround for FPE in vxxxxx on HIP (and fixes for v1.00.01 tags) Oct 4, 2024
@valassi
Copy link
Member Author

valassi commented Oct 4, 2024

Very good @oliviermattelaer thanks!
Merging now
Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants