-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for ROCRAND (via HIPRAND) #809
Conversation
… just do EXACTLY as for Curand, but replace Curand by Rocrand (add RocrandRandomNumberKernel.cc, modify check_sa.cc, RandomNumberKernels.h, mgOnGpuConfig.h)
…d HASROCRAND (for TAG), and remove CXXFLAGS based on that as there is no #ifdef HASMGONGPU_HAS_NO_CURAND in any source code file in src)
NB this is based on PR #801, it is just a few additions on top of that, but I would keep them separate |
…CURAND and HASROCRAND variables, and try to rationalize the logic, making space for rocrand in parallel to curand
…dev and --rorhst options)
…er and CommonRandomNumberKernel objects to avoid an error
…be fixed in RandomNumberKernels.h instead Revert "[rocrand] in gg_tt.mad cudacpp.mk, add RNDCXXFLAGS to runTest, fsampler and CommonRandomNumberKernel objects to avoid an error" This reverts commit 3968e2e.
…n to bypass undefined rocrandGenerator_st error in runTest, fsampler and CommonRandomNumberKernel, and general cleanup to remove all unnecessary ifdefs (the class definition needs no typedef protections) In file included from fsampler.cc:12: RandomNumberKernels.h:189:5: error: 'rocrandGenerator_st' does not name a type; did you mean 'curandGenerator_st'? 189 | rocrandGenerator_st* m_rnGen; | ^~~~~~~~~~~~~~~~~~~ | curandGenerator_st In file included from CommonRandomNumberKernel.cc:9: RandomNumberKernels.h:189:5: error: 'rocrandGenerator_st' does not name a type; did you mean 'curandGenerator_st'? 189 | rocrandGenerator_st* m_rnGen; | ^~~~~~~~~~~~~~~~~~~ | curandGenerator_st In file included from runTest.cc:18: RandomNumberKernels.h:189:5: error: 'rocrandGenerator_st' does not name a type; did you mean 'curandGenerator_st'? 189 | rocrandGenerator_st* m_rnGen; | ^~~~~~~~~~~~~~~~~~~ | curandGenerator_st
… in RocrandRandomNumberKernel.cc
…l.o to the relevant executables
…o gRocrandRandomNumberKernel.cu
…andomNumberKernel
…ator ordering (I could not find the syntax yet)
…cc, ensure __HIP_PLATFORM_NVIDIA__ is not defined and __HIP_PLATFORM_AMD__ is defined
… GPUs The status at this point is that - the code builds - gcheck.exe runs in rocranddevice, but produces nan MEs (1) - gcheck.exe and check.exe fail in rocrandhost (2) (1) nan MEs from rocranddevice Workflow summary = HIP:DBL+CXS:RORDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=8, zero=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] Random number generation = ROCRND DEVICE (HIP code) ------------------------------------------------------------------------------- HelicityComb Good/Tot = 16/16 ------------------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123) = ( 1.329026e-02 ) sec TotalTime[Rambo+ME] (23) = ( 5.200556e-03 ) sec TotalTime[RndNumGen] (1) = ( 8.089702e-03 ) sec TotalTime[Rambo] (2) = ( 4.494608e-03 ) sec TotalTime[MatrixElems] (3) = ( 7.059480e-04 ) sec MeanTimeInMatrixElems = ( 7.059480e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.059480e-04 , 7.059480e-04 ] sec TotalTime[MECalcOnly] (3a) = ( 6.957380e-04 ) sec MeanTimeInMECalcOnly = ( 6.957380e-04 ) sec [Min,Max]TimeInMECalcOnly = [ 6.957380e-04 , 6.957380e-04 ] sec ------------------------------------------------------------------------------- TotalEventsComputed = 8 EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.019447e+02 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 1.538297e+03 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.133228e+04 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.149858e+04 ) sec^-1 ******************************************************************************* NumMatrixElems(notAbnormal) = 0 MeanMatrixElemValue = ( -nan +- -nan ) GeV^0 [Min,Max]MatrixElemValue = [ 1.797693e+308 , -1.797693e+308 ] GeV^0 StdDevMatrixElemValue = ( -nan ) GeV^0 MeanWeight = ( -nan +- -nan [Min,Max]Weight = [ 1.797693e+308 , -1.797693e+308 ] StdDevWeight = ( -nan ) (2) failures from rocrandhost HiprandAssert: gRocrandRandomNumberKernel.cu:95 code=1000 gcheck.exe: gRocrandRandomNumberKernel.cu:25: void assertHiprand(hiprandStatus_t, const char *, int, bool): Assertion `code == HIPRAND_STATUS_SUCCESS' failed. Aborted
The status at this point is that
|
This is fixed
This I have now understood as being simply not supported yet. I checked and I always get a status 1000, not implemented. This is not surprising, see https://github.com/ROCm/hipRAND/blob/be4f5a97d55b6a38733a7a66970d13aef6c8944f/library/src/amd_detail/hiprand.cpp#L103
Also see issue 76 in https://github.com/ROCm/hipRAND/issues. This is actively being worked on in Dec 2023. I would do the following:
In any case none of this is relevant yet to Madgraph production (we are still at Fortran ranmar). We can wait for the hiprand host api... |
…so on cuda gcheck.exe when using curand On LUMI HIP, this also fixes the nans in gcheck.exe using --rordev. However, the error in gheck.exe using --rorhst does remain. The same error for RocrandHost happens in check.exe using the default --rorhst.
…rand host generators are not supported yet
…rand ordering is not implemented yet
…annd in all variables inside all files
…stead of rocrand host)
…ack to itscrd90 test logs git checkout 495d4cf tput/logs_* tmad/
…and test logs from master to avoid too many conflicts git checkout upstream/master $(git ls-tree --name-only HEAD *.mad *.sa) git checkout upstream/master tput/logs_* tmad
…5#368 adding _cu.o and removing gXXX.cu links) into rocrand Fix conflicts: epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/cudacpp.mk epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/model_handling.py The conflicts were in the interplay of adding Hiprand and adding _cu.o (so for instance gHipRandRandomNumberKernel.o becomes HipRandRandomNumberKernel_cu.o)
…ng gXXX.cu removal PR madgraph5#368
….cu symlinks disappear)
Hi @oliviermattelaer this is a PR to add support for Hiprand on AMD GPUs as suggested by @roiser (thanks for the suggestion). This is now ready for review - can you please review and approve @oliviermattelaer ? I am still running a full battery of tests, but in principle on some small scale tests everything looks good. Note, as mentioned above:
I would go ahead and merge this rather than waiting for the full support, which can come later. The biggest part of the work is done (and involves quite a few code changes to make curand-specific stuff a bit more general). Enabling hiprand host and adding hiprand ordering when available will be a matter of 10 lines of code I think. Thanks |
STARTED AT Mon Feb 5 05:44:10 PM CET 2024 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Mon Feb 5 09:22:41 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Mon Feb 5 09:51:14 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Mon Feb 5 10:01:13 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Mon Feb 5 10:04:31 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Mon Feb 5 10:07:46 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Mon Feb 5 10:11:07 PM CET 2024 [Status=0]
STARTED AT Mon Feb 5 10:11:07 PM CET 2024 ENDED AT Tue Feb 6 02:29:23 AM CET 2024 Status=0 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I can not test on HIP, I have just read the code change and do not spot anything relevant to comment (but stuff not related to this PR).
In my opinion this can go trough, @roiser do you agree? ( I would wait for his confirmation)
…ual issue in gqttq) (1) Rebuild 78 tput tests for HIP on LUMI login nodes (21 hours) STARTED AT Mon 05 Feb 2024 06:44:18 PM EET ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -makeonly ENDED(1) AT Tue 06 Feb 2024 02:39:24 PM EET [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -makeonly ENDED(2) AT Tue 06 Feb 2024 02:57:17 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean -makeonly ENDED(3) AT Tue 06 Feb 2024 03:10:00 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst -makeonly ENDED(4) AT Tue 06 Feb 2024 03:12:13 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst -makeonly ENDED(5) AT Tue 06 Feb 2024 03:14:27 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -makeonly ENDED(6) AT Tue 06 Feb 2024 03:16:42 PM EET [Status=0] (2) Run 72 tput tests on LUMI worker node STARTED AT Thu 08 Feb 2024 05:40:47 PM EET ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Thu 08 Feb 2024 06:38:25 PM EET [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Thu 08 Feb 2024 06:55:45 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Thu 08 Feb 2024 07:14:46 PM EET [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Thu 08 Feb 2024 07:18:31 PM EET [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ' ENDED(5) AT Thu 08 Feb 2024 07:18:31 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Thu 08 Feb 2024 07:22:13 PM EET [Status=0] ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0_bridge.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0_bridge.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd1.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd1.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd1.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd1.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:ERROR! Fortran calculation (F77/CUDA) crashed
(1) all tmad tests but ggttggg STARTED AT Thu 08 Feb 2024 07:26:30 PM EET ENDED AT Thu 08 Feb 2024 07:51:12 PM EET (2) ggttggg tmad tests STARTED AT Thu 08 Feb 2024 07:26:33 PM EET ENDED AT Thu 08 Feb 2024 10:02:37 PM EET Status=0 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 12 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 12 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 12 /users/valassia/GPU2023/madgraph4gpu2/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
…am/master' (including fbridgesequence_nomultichannel PR madgraph5#796) into rocrand Checked that ggtt.mad can be regenerated correctly an dthat its tput/tmad tests succeed
I have merged the latest master (including merged PR #796) into this, and checked that all is ok. @oliviermattelaer has already approved (thanks!) @roiser do you have any feedback and/or can this be merged? thanks |
@roiser just to ping you such that you can approve this such that we can move forward on it too |
quick code inspection, this also looks fine, again didn't look at the makefile changes but sure ok Really minor comment and I propose not to revert this, but there is no need to bump up the copyright year numbers, see e.g. blog post https://matija.suklje.name/how-and-why-to-properly-write-copyright-statements-in-your-code (got the link from the CERN open source policy office OSPO) |
thanks @roiser for the approval, I will merge now
interesting point... i was actually going to propose that we update all copyright statements to be 2020-2024 in the code, now I am a bit less sure after reading your blog post i must say i still have a preference for having all copyright statements updated to 2020-2024: i mean that post suggests you do not need to update the years, but it is not saying that you should absolutely not do that... I find it visually more clear on the one hand, and on the other hand it IS true that we are continually updating the code... and in any case yes 50-70 years are a long time, so it makes a little difference... I think I would especialy like a self consistent practice throughout the code: for instance, we put 2020-2024 everywhere, or we put 2020 (or year of first publication if >2020) everywhere... thoughts? anyway, I am merging this |
PS I moved the copyright discussion to issue #817 |
… previous ones) into patchmad_nofileeeeeeeeeeeeeeeeeeeee
Quick update. The following issues were updated in https://github.com/ROCm/hipRAND/issues
On LUMI the default is still ROCm 6.0. So this still needs to wait a bit before testing. |
As suggested by @roiser in PR #801, this is a WIP PR for adding rocrand (I think in the end it will be rocrand actually) support.
Very much WIP, more than the implementation itself there was a lot of logic in env variables and switches to sort out, I focused on that first