Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3Packages.torchWithRocm: fix build #298609

Merged
merged 1 commit into from
Mar 30, 2024

Conversation

mschwaig
Copy link
Member

@mschwaig mschwaig commented Mar 24, 2024

Description of changes

torchWithRocm fails to build, since we merged ROCm 6.0 (#287846).
This PR together with #298451 should fix that.

I am still in the process of building this fix locally, to see if it works.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Mar 24, 2024
@ofborg ofborg bot requested review from teh, thoughtpolice and tscholak March 24, 2024 12:10
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 11-100 labels Mar 24, 2024
@mschwaig mschwaig removed the 8.has: package (new) This PR adds a new package label Mar 24, 2024
@mschwaig
Copy link
Member Author

torchWithRocmstill fails to build with this and the PR linked above:

FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/detail/HIPHooks.cpp.o 
/nix/store/kvlhk0gpm2iz1asbw1xjac2ch0r8kyw9-gcc-wrapper-13.2.0/bin/g++ -DAT_PER_OPERATOR_HEADERS -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMAGMA_V2 ->
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
In file included from /build/source/aten/src/ATen/core/IListRef.h:631,
                 from /build/source/aten/src/ATen/DeviceGuard.h:3,
                 from /build/source/aten/src/ATen/hip/detail/HIPHooks.cpp:6:
/build/source/aten/src/ATen/core/IListRef_inl.h: In static member function ‘static c10::detail::IListRefConstRef<at::OptionalTensorRef> c10::detail::IListRefTagImpl<c10::IListRefTag::Boxed,>
/build/source/aten/src/ATen/core/IListRef_inl.h:171:17: warning: possibly dangling reference to a temporary [-Wdangling-reference]
  171 |     const auto& ivalue = (*it).get();
      |                 ^~~~~~
/build/source/aten/src/ATen/core/IListRef_inl.h:171:35: note: the temporary was destroyed at the end of the full expression ‘(& it)->c10::impl::ListIterator<std::optional<at::Tensor>, __gnu>
  171 |     const auto& ivalue = (*it).get();
      |                          ~~~~~~~~~^~
/build/source/aten/src/ATen/hip/detail/HIPHooks.cpp: In member function ‘virtual bool at::cuda::detail::CUDAHooks::isPinnedPtr(const void*) const’:
/build/source/aten/src/ATen/hip/detail/HIPHooks.cpp:158:15: error: ‘hipPointerAttribute_t’ {aka ‘struct hipPointerAttribute_t’} has no member named ‘memoryType’
  158 |   return attr.memoryType == hipMemoryTypeHost;
      |               ^~~~~~~~~~

Maybe the correct fix for now is to switch to rocmPackages_5.

@mschwaig
Copy link
Member Author

mschwaig commented Mar 24, 2024

I tried switching back to rcomPackages_5, and that succeeds at building torch, but my runtime test of running nanoGPT with nix develop fails at runtime:

 (Triggered internally at /build/source/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
  return node.target(*args, **kwargs)
/tmp/nix-shell.XPsinF/tmpqculhp7v/main.c:2:10: fatal error: hip/hip_runtime.h: No such file or directory
    2 | #include <hip/hip_runtime.h>
      |          ^~~~~~~~~~~~~~~~~~~
compilation terminated.
/tmp/nix-shell.XPsinF/tmpdyfzmrqg/main.c:2:10: fatal error: hip/hip_runtime.h: No such file or directory
    2 | #include <hip/hip_runtime.h>
      |          ^~~~~~~~~~~~~~~~~~~
compilation terminated.

I can fix this like so (not sure if it should be in torchWithRocm or downstream:

propagatedBuildInputs = [ pkgs.rocmPackages_5.hip-common ];

But, then it's complaining about not finding hip/hip_version.h, and that file is there on the master branch of https://github.com/rocm/hip, but not in the version tagged as 5.7.1 tag.

At the same time it does not seem like upstream torch supports ROCm 6.0 already.

@mschwaig
Copy link
Member Author

mschwaig commented Mar 24, 2024

The hip/hip_version.h file would normally be generated by a CMmakeLists.txt file, but that's not checked in in the release branches either.
I manually put together such a file to see how far that would go. The build for that is still running.

The branch for it is here: master...mschwaig:nixpkgs:fix-torch-with-rocm-2

@mschwaig mschwaig force-pushed the fix-torch-with-rocm branch from 8400be6 to 45437c9 Compare March 25, 2024 22:26
@mschwaig
Copy link
Member Author

Actually I found out that those runtime errors involving hip-common were expected, since I only had nanoGPT working with the --compile=False flag.

So the right solution for now should be to go back to rocmPackages_5 and leave HIP alone.
This PR also has some commits to fix missing or erroneous deprecation warnings for ROCm 6.0.

I'm running nixpkgs-review for this now.

@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Mar 25, 2024
@mschwaig
Copy link
Member Author

mschwaig commented Mar 26, 2024

Result of nixpkgs-review pr 298609 run on x86_64-linux 1

3 packages marked as broken and skipped:
  • rocmPackages.mivisionx
  • rocmPackages.mivisionx-cpu
  • rocmPackages.mivisionx-hip
3 packages failed to build:
  • rocmPackages_5.mivisionx (rocmPackages_5.mivisionx-hip)
  • rocmPackages_5.mivisionx-cpu
  • rocmPackages_5.mivisionx-opencl
10 packages built:
  • python311Packages.torchWithRocm
  • python311Packages.torchWithRocm.cxxdev
  • python311Packages.torchWithRocm.dev
  • python311Packages.torchWithRocm.dist
  • python311Packages.torchWithRocm.lib
  • python312Packages.torchWithRocm
  • python312Packages.torchWithRocm.cxxdev
  • python312Packages.torchWithRocm.dev
  • python312Packages.torchWithRocm.dist
  • python312Packages.torchWithRocm.lib

rocmPackages_5.mivisionxneeds a fix for its version of libjpeg_turbo, which I plan to open another PR for.

@mschwaig mschwaig marked this pull request as ready for review March 26, 2024 17:37
@mschwaig mschwaig marked this pull request as draft March 26, 2024 17:47
@mschwaig mschwaig force-pushed the fix-torch-with-rocm branch from 347c3be to 45437c9 Compare March 28, 2024 11:08
@mschwaig
Copy link
Member Author

Result of nixpkgs-review pr 298609 run on x86_64-linux 1

6 packages marked as broken and skipped:
  • rocmPackages.mivisionx
  • rocmPackages.mivisionx-cpu
  • rocmPackages.mivisionx-hip
  • rocmPackages_6.mivisionx
  • rocmPackages_6.mivisionx-cpu
  • rocmPackages_6.mivisionx-hip
3 packages failed to build:
  • rocmPackages_5.mivisionx (rocmPackages_5.mivisionx-hip)
  • rocmPackages_5.mivisionx-cpu
  • rocmPackages_5.mivisionx-opencl
10 packages built:
  • python311Packages.torchWithRocm
  • python311Packages.torchWithRocm.cxxdev
  • python311Packages.torchWithRocm.dev
  • python311Packages.torchWithRocm.dist
  • python311Packages.torchWithRocm.lib
  • python312Packages.torchWithRocm
  • python312Packages.torchWithRocm.cxxdev
  • python312Packages.torchWithRocm.dev
  • python312Packages.torchWithRocm.dist
  • python312Packages.torchWithRocm.lib

@mschwaig mschwaig marked this pull request as ready for review March 28, 2024 11:10
@mschwaig mschwaig requested a review from ulrikstrid March 28, 2024 13:52
@ulrikstrid
Copy link
Member

Can we split this into 2 PRs? One for pytorch and one for deprecated packages. I don't see why they should be in the same one.

If you do that we can merge both quickly 😊 or if you can give a compelling reason

@mschwaig mschwaig force-pushed the fix-torch-with-rocm branch from 45437c9 to 1c92ae9 Compare March 30, 2024 11:51
Copy link
Member

@ulrikstrid ulrikstrid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@ulrikstrid ulrikstrid merged commit ae32f55 into NixOS:master Mar 30, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: python 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 11-100
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants