Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] ROCm packages #197885

Open
25 of 34 tasks
Madouura opened this issue Oct 26, 2022 · 76 comments
Open
25 of 34 tasks

[Tracking] ROCm packages #197885

Madouura opened this issue Oct 26, 2022 · 76 comments
Assignees
Labels
5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems 6.topic: hardware 6.topic: rocm

Comments

@Madouura
Copy link
Contributor

Madouura commented Oct 26, 2022

Tracking issue for ROCm derivations.

moar packages

Key

  • Package
    • Dependencies

WIP

Ready

TODO

Merged

ROCm-related

Notes

  • Update command: nix-shell maintainers/scripts/update.nix --argstr commit true --argstr keep-going true --arg predicate '(path: pkg: builtins.elem (pkg.pname or null) [ "rocm-llvm-llvm" "rocm-core" "rocm-cmake" "rocm-thunk" "rocm-smi" "rocm-device-libs" "rocm-runtime" "rocm-comgr" "rocminfo" "clang-ocl" "rdc" "rocm-docs-core" "hip-common" "hipcc" "clr" "hipify" "rocprofiler" "roctracer" "rocgdb" "rocdbgapi" "rocr-debug-agent" "rocprim" "rocsparse" "rocthrust" "rocrand" "rocfft" "rccl" "hipcub" "hipsparse" "hipfort" "hipfft" "tensile" "rocblas" "rocsolver" "rocwmma" "rocalution" "rocmlir" "hipsolver" "hipblas" "miopengemm" "composable_kernel" "half" "miopen" "migraphx" "rpp-hip" "mivisionx-hip" "hsa-amd-aqlprofile-bin" ])'

Won't implement

  • ROCmValidationSuite
    • Too many assumptions, not going to rewrite half the cmake files
  • rocm_bandwidth_test
    • Not really needed, will implement on request
  • atmi
    • Out-of-date
  • aomp
    • We basically already do this
  • Implement strictDeps for all derivations
    • Seems pointless now and I don't see many other derivations doing this
@Madouura
Copy link
Contributor Author

Madouura commented Oct 30, 2022

Updating to 5.3.1, marking all WIP until pushed to their respective PRs and verified.

@Madouura
Copy link
Contributor Author

Madouura commented Oct 30, 2022

If anyone is interested in helping me debug rocBLAS, here's the current derivation
Already fixed.

@Flakebi
Copy link
Member

Flakebi commented Oct 31, 2022

Hi, thanks a lot for your work on ROCm packages!

So far, the updates where all aggregated in a single rocm: 5.a.b -> 5.x.y pr. I think that makes more sense than splitting the package updates into single prs for a couple of reasons:

  • Often, packages have backward- (and forward-) incompatible changes, i.e. a the 5.3.0 version of rocm-runtime only works with 5.3.0 of rocm-comgr, but not with 5.2.0 or 5.4.0 (made up example).
  • Nobody tests a mixture of versions, i.e. only all packages at the same version are known to work.
  • If I want to test hip, OpenCL and other things for an update, it’s easier to do it one time (and compile everything a single time), rather than 10 times.

tl;dr, do you mind merging all your 5.3.1 updates into a single PR?

PS: Not sure how you did the update, I usually do it with for f in rocm-smi rocm-cmake rocm-thunk rocm-runtime rocm-opencl-runtime rocm-device-libs rocm-comgr rocclr rocminfo llvmPackages_rocm.llvm hip; nix-shell maintainers/scripts/update.nix --argstr commit true --argstr package $f; end.

@Madouura
Copy link
Contributor Author

I was actually afraid of the opposite being true so I split them up.
Got it, I'll aggregate them.
Thanks for the tip on the update script, that would have saved me a lot of time.

@Madouura
Copy link
Contributor Author

Madouura commented Oct 31, 2022

Hip I think should stay separate though, since there are other changes.
Actually never mind it's just an extra dependency so should be fine to split it.

@Madouura
Copy link
Contributor Author

Madouura commented Dec 17, 2023

ROCm 6.0.0 has been released.
rocmPackages_5 is now in maintenance-mode.
I will eventually backport the changes I am making with rocmPackages_6 to rocmPackages_5, however it is not a high priority.

@kurnevsky
Copy link
Member

kurnevsky commented Dec 17, 2023

By setting environment variable manually

Interesting - now pytorch works for me, but it doesn't seem to work correctly. I'm trying to generate an image from sdxl+lora with diffusers, and it generates an incorrect image...

I tried identical code and model with manually defined seeds in google colab with cuda - it works there. Also seems to work locally on cpu with f32 types.

(or it might be some problem in one of the libs, since locally I use all python libs from nix)

@sersorrel
Copy link
Contributor

The export LD_LIBRARY_PATH=/nix/store/...-clr-5.7.1/lib solution fixed the same torchWithRocm problem for me, also with a 7900 XTX. I couldn't see how you got that path – it's returned by nix build --print-out-paths nixpkgs#rocmPackages.clr, right?

@ScatteredRay
Copy link
Contributor

Hey, giving this a try. Still very much WIP, but it's working so far for my current project.

@dwf
Copy link
Contributor

dwf commented Mar 21, 2024

@Madouura First, thanks for all your work on this front.

You left a comment to the effect that rocBLASLt is "Very broken with Tensile at the moment, only supports GFX9". It looks like other platforms might be supported now, but I wondered if you might be able to elaborate with the "very broken with Tensile" part. I notice that they ship a vendored "Tensilelite", was that what you were trying to use?

Any pointers you have on how I might manage to build this would be useful. I'm currently eyeing the rocBLAS derivation as a potentially good starting point.

Edit: no longer a priority for me

@yshui
Copy link
Contributor

yshui commented Apr 1, 2024

pytorch now fails to build after 5 -> 6 transition, because it depends on miopengemm which was removed.

@SomeoneSerge
Copy link
Contributor

I edited the description to add an entry for rocblaslt. It's, apparently, a dependency for zluda

@errnoh errnoh mentioned this issue Apr 10, 2024
13 tasks
@samueldr samueldr added the 5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems label Apr 23, 2024
@jalil-salame
Copy link
Contributor

Apparently pytorch now requires hipBLASLt:

python3.11-torch> CMake Error at cmake/public/LoadHIP.cmake:37 (find_package):
python3.11-torch>   By not providing "Findhipblaslt.cmake" in CMAKE_MODULE_PATH this project
python3.11-torch>   has asked CMake to find a package configuration file provided by
python3.11-torch>   "hipblaslt", but CMake did not find one.
python3.11-torch>   Could not find a package configuration file provided by "hipblaslt" with
python3.11-torch>   any of the following names:
python3.11-torch>     hipblasltConfig.cmake
python3.11-torch>     hipblaslt-config.cmake
python3.11-torch>   Add the installation prefix of "hipblaslt" to CMAKE_PREFIX_PATH or set
python3.11-torch>   "hipblaslt_DIR" to a directory containing one of the above files.  If
python3.11-torch>   "hipblaslt" provides a separate development package or SDK, be sure it has
python3.11-torch>   been installed.
python3.11-torch> Call Stack (most recent call first):
python3.11-torch>   cmake/public/LoadHIP.cmake:160 (find_package_and_print_version)
python3.11-torch>   cmake/Dependencies.cmake:1258 (include)
python3.11-torch>   CMakeLists.txt:754 (include)
python3.11-torch>
python3.11-torch> -- Configuring incomplete, errors occurred!

@ony
Copy link
Contributor

ony commented Jun 16, 2024

As per pytorch/pytorch#119081 (comment) in 2.4.0+ (future release) it should be possible to use something like:

  pythonPackagesExtensions = prev.pythonPackagesExtensions ++ [
    (python-final: python-prev: {
      torch = python-prev.torch.overrideDerivation (oldAttrs: {
        TORCH_BLAS_PREFER_HIPBLASLT = 0;  # not yet in nixpkgs
      });
    })
  ];

@AngryLoki
Copy link

@ony , TORCH_BLAS_PREFER_HIPBLASLT is environment variable for runtime; pytorch still links and requires hipblaslt, even when unused. pytorch/pytorch#120551 should help, but I have no idea whether and when it could be accepted.

By the way, hipblaslt is not difficult to build. Just don't build 6.0 release, skip directly to 6.1. When I tried, bundled TensileLine in 6.0 generated wall of unreadable errors, while 6.1 worked from first attempt.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/testing-gpu-compute-on-amd-apu-nixos/47060/4

@SomeoneSerge SomeoneSerge mentioned this issue Oct 16, 2024
13 tasks
@DerDennisOP
Copy link
Contributor

I'm not able to build rocmlir-rock-6.0.2, when trying to install zluda.

FAILED: mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o
/nix/store/16pvlpl13g06f1rqxp7z0il9i4l9mlww-rocm-llvm-clang-wrapper-6.0.2/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/
build/source/build/mlir/lib/Dialect/Rock/Transforms -I/build/source/mlir/lib/Dialect/Rock/Transforms -I/build/source/external/llvm-project/llvm/include -I/build/source/build/external/llvm-project/llvm/include -I/build/source/external/llv
m-project/mlir/include -I/build/source/build/external/llvm-project/llvm/tools/mlir/include -I/build/source/external/mlir-hal/mlir/include -I/build/source/build/external/mlir-hal/include -I/build/source/external/mlir-hal/include -I/build/
source/mlir/include -I/build/source/build/mlir/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wm
issing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmislead
ing-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused
-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsugg
est-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -std=gnu++17 -fPIC   -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LI
BCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS  -fno-exception
s -funwind-tables -fno-rtti -UNDEBUG -MD -MT mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o -MF mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o.d
-o mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o -c /build/source/mlir/lib/Dialect/Rock/Transforms/ViewToTransform.cpp
In file included from /build/source/mlir/lib/Dialect/Rock/Transforms/ViewToTransform.cpp:14:
/build/source/mlir/include/mlir/Conversion/TosaToRock/TosaToRock.h:21:10: fatal error: 'mlir/Conversion/RocMLIRPasses.h.inc' file not found
#include "mlir/Conversion/RocMLIRPasses.h.inc"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Is there an easy fix for it?

@AngryLoki
Copy link

@DerDennisOP , it was addressed in pull-request ROCm/rocMLIR#1640 (issue ROCm/rocMLIR#1620), you may want use it.

@ilylily
Copy link

ilylily commented Oct 26, 2024

@DerDennisOP @AngryLoki i think you'll actually also need ROCm/rocMLIR#1542 (closes ROCm/rocMLIR#1500). similar patch in a nearby file

@arunoruto
Copy link
Contributor

@DerDennisOP , it was addressed in pull-request ROCm/rocMLIR#1640 (issue ROCm/rocMLIR#1620), you may want use it.

@DerDennisOP @AngryLoki i think you'll actually also need ROCm/rocMLIR#1542 (closes ROCm/rocMLIR#1500). similar patch in a nearby file

Is there a plan to patch these things in upstream too? As far as I can see, the hydro logs show the same error as @DerDennisOP.
Would a general update to a newer version resolve this problem? Maybe updating everything to 6.2.4 or even 6.3.0 would be feasible.

@mschwaig
Copy link
Member

Right now I do not have the time to update ROCm, but I could help out as a reviewer.
I think updating to a newer version would be a great idea.

@arunoruto
Copy link
Contributor

arunoruto commented Dec 11, 2024

Right now I do not have the time to update ROCm, but I could help out as a reviewer. I think updating to a newer version would be a great idea.

While I do not have that much experience with ROCm, I could try it.
I was also wondering, there are a lot of PRs from the autoupdater @r-ryantm, for example #358391. If the nixpkgs-review succeeds on these PRs, would they be directly viable for a marge? Do we need to bring all the packages up to date at the same time? Are there more things to consider?

@bgamari
Copy link
Contributor

bgamari commented Dec 12, 2024

Would a general update to a newer version resolve this problem? Maybe updating everything to 6.2.4 or even 6.3.0 would be feasible.

FWIW, I have a branch where I have tried updating things to 6.2.4. Unfortunately, I am seeing linking failures in the clang (specifically libcxx) bootstrap that I doubt I will have time to fix in the near-term.

I was also wondering, there are a lot of PRs from the autoupdater @r-ryantm,

I don't think any of those PRs are viable. Apart from the MRs themselves not building, the auto-updater generally doesn't seem to respect the fact that ROCM components expect to be upgraded in lock-step.

@bgamari
Copy link
Contributor

bgamari commented Dec 12, 2024

FWIW, I have opened a draft MR to record the state of my attempt: #364423

@LunNova
Copy link
Member

LunNova commented Dec 12, 2024

I have a mix of 6.3 and 6.2 working here with pytorch nightly but in no state to upstream. Might be helpful for someone with more time trying to fix it in nixpkgs.

https://github.com/LunNova/ml.nix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems 6.topic: hardware 6.topic: rocm
Projects
None yet
Development

No branches or pull requests