Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate for CUDA 12.x (x to be determined) #6630

Closed
h-vetinari opened this issue Oct 31, 2024 · 21 comments · Fixed by #6720
Closed

Migrate for CUDA 12.x (x to be determined) #6630

h-vetinari opened this issue Oct 31, 2024 · 21 comments · Fixed by #6720

Comments

@h-vetinari
Copy link
Member

h-vetinari commented Oct 31, 2024

This has been discussed several times in various places already for example, e.g. @jakirkham had proposed adding 12.4 in #5613 a while ago already. More and more, feedstocks are starting to depend on features (or bug fixes) from CUDA >12.0, so IMOwe should pick a version x>=4 and migrate. Examples I'm aware of (+ some searching):

@hmaarrfk
Copy link
Contributor

I've seen that some packages are somewhat named differently from 12.0 to 12.6. to help the transition.

As I understand it, buliding with cuda 12.6 would allow running with 12.0 if all the package names match correct?

@h-vetinari
Copy link
Member Author

As I understand it, building with cuda 12.6 would allow running with 12.0 if all the package names match correct?

If you meant the CUDA version of the enduser, then IIUC we need {runtime}>={buildtime}, so packages built with 12.6 would only be available for CUDA >=12.6. Hope that someone from @conda-forge/cuda would correct me if I say something wrong here.

I've seen that some packages are somewhat named differently from 12.0 to 12.6. to help the transition.

Yeah, there was some reshuffling of various infra bits throughout the rollout, which is why I wanted to make this a migration and not just add that version to the pinning (i.e. why I was against #5613).

@hmaarrfk
Copy link
Contributor

so packages built with 12.6 would only be available for CUDA >=12.6

is cupy special for some reason?
https://github.com/conda-forge/cupy-feedstock/blob/main/recipe/meta.yaml#L66

@h-vetinari
Copy link
Member Author

is cupy special for some reason?

If you look at the linked issue and its PR, this was added at a time when there was only 12.0 being built on the feedstock. I believe it is a bug in the metadata. AFAIU, using a build against 12.5 on a system with CUDA 12.0 is likely (though not guaranteed) to fail.

CC @conda-forge/cupy

@bdice
Copy link
Contributor

bdice commented Oct 31, 2024

There is “CUDA Enhanced Compatibility” which can allow building with 12.y and running with 12.x, (x < y) if the library and conda packaging support it.

We wrote up a guide on this in the CUDA conda docs, with lots of links and sample code snippets: https://github.com/conda-forge/cuda-feedstock/blob/main/recipe/doc/recipe_guide.md#cuda-enhanced-compatibility

By default, we do not know if a library is designed to work with earlier CUDA versions (based on what symbols / APIs it uses) so cuda-nvcc_{{ target_platform }} issues a strong run-export. https://github.com/conda-forge/cuda-nvcc-impl-feedstock/blob/661e76382d5529999619e8ff3cf7668714defb48/recipe/meta.yaml#L79

That can then be ignored if the maintainer knows their library’s compatibility is wider. This is how RAPIDS is packaged: we build with CUDA 12.5 (and CUDA 11.8) and allow runtimes of CUDA 12.0-12.5 (and CUDA 11.4-11.8). CuPy is the same (it ignores the compiler’s run exports). https://github.com/conda-forge/cupy-feedstock/blob/564d47cb293a8584d86294b9285ecbec2bffabdb/recipe/meta.yaml#L120

tl;dr migrating to building with CUDA 12.x would not necessarily break compatibility for all packages with earlier CUDA runtimes. By default, it would, for packages that do not have these run-exports ignored and replaced by their true set of compatible versions. But we can help make package maintainers aware of how to opt out of the stricter run exports and leverage CUDA Enhanced Compatibility where needed.

@vyasr
Copy link

vyasr commented Oct 31, 2024

Concretely, @h-vetinari is correct that the CUDA runtime provides sufficient ABI stability guarantees that if you build with 12.x and you run on 12.y for x < y, that will work. The converse is conditionally true: if you build with 12.y and run on 12.x (again x < y), that will work as long as you do not use any newer symbols. cudart symbols are stable across minor versions, but newer minor versions may introduce new symbols so as @bdice mentions it is incumbent on the library to ensure that no newer symbols are used. CUDA libraries (e.g. cuBlas) should provide similar guarantees to the runtime. Driver compatibility is separate: the driver promises binary compatibility across major releases as well as minor releases, again with the caveat that callers are responsible for putting in suitable guards to avoid using features only available in newer drivers when running on an older one.

@h-vetinari
Copy link
Member Author

h-vetinari commented Oct 31, 2024

The converse is conditionally true: if you build with 12.y and run on 12.x (again x < y), that will work as long as you do not use any newer symbols.

Thanks for confirming, this is what I had in mind (we're in the same situation with other ABI-stable libraries, which may still add new symbols in otherwise fully backward-compatible releases).

Since I'm not aware if or how the use of newer symbols is actually controlled during CUDA builds my assumption was that use of newer symbols may happen (and relatedly, I believe it comes into play when compiling for new GPU arches that older CUDA versions don't support), and therefore that allowing "older versions than at build time" is fundamentally a risky proposition.

I'd much rather we have separate builds for 11.8, 12.0 and 12.x, with respective constraints on CUDA like >=11.8,<12, >=12.0,<13, >=12.x,<13. That way we eliminate the above risks, while still offering all users a choice to install (well, except those <11.8, but we dropped support for those in conda-forge a while ago already).

@vyasr
Copy link

vyasr commented Oct 31, 2024

I think that typically makes sense. The migration doesn't constrain packages from choosing to opt in to building a library that is compatible across CUDA minor versions, nor does it prevent packages from setting up their runtime constraints so as to allow users to install those libraries on all compatible CUDA versions, right? Assuming both of those are true I would be fine with your plan. I would still want to preserve each library's ability to choose to build for greater compatibility, but I assume that can be done by overriding the CBC and .ci_support files on a per-feedstock basis regardless of what is pinned here. AFAICT the only thing piece of CUDA that is actually pinned here is the compiler.

@h-vetinari
Copy link
Member Author

Yeah, pretty much everything can be overridden. The default should be safe, but if you know what you're doing, you can turn it off (e.g. ignore certain run-exports and set up your own constraints)

@hmaarrfk
Copy link
Contributor

Ubuntu 22.04 only reports __cuda=12.4=0 with nvidia-driver-550 instaled. As such, I think 12.6 might be "underutilized" unless we build 12.4 as well.

I think Ubuntu 20.04 is stuck on nvidia driver 535.

WIth the new info (thanks for the thoroughness), it might be good to have:

  • 12.0
  • 12.4
  • 12.latest????

@isuruf
Copy link
Member

isuruf commented Oct 31, 2024

WIth the new info (thanks for the thoroughness), it might be good to have:

That's not needed.

If you build with cuda-nvcc=12.6 you get a run_export of cuda-version>=12.6 and all the conda packaged libraries. However the driver version you need is __cuda>=12.0. It's weird that __cuda does not report the driver version, but that's what we have.

@hmaarrfk
Copy link
Contributor

So the fact that we can install all the cuda stuff can be installed via conda, then we can simply have 12.0 and 12.max? and users on Ubuntu 22.04 will be fine?

@isuruf
Copy link
Member

isuruf commented Nov 1, 2024

then we can simply have 12.0 and 12.max?

I don't see a reason to have both 12.0 and 12.max. Just 12.max will be fine.

and users on Ubuntu 22.04 will be fine?

yes

@kmittman
Copy link

kmittman commented Nov 1, 2024

I think Ubuntu 20.04 is stuck on nvidia driver 535.

For Ubuntu 20.04, newer NVIDIA driver versions (currently up to 565 branch) are available in the CUDA repository.

@h-vetinari
Copy link
Member Author

If you build with cuda-nvcc=12.6 you get a run_export of cuda-version>=12.6 and all the conda packaged libraries. However the driver version you need is __cuda>=12.0. It's weird that __cuda does not report the driver version, but that's what we have.

That's an aspect of the whole thing that does make things confusing for sure. We do now have packages installable for newer CUDA pretty much everywhere, but there is a relationship to the driver, and I don't think we're capturing that information explicitly anywhere (it's probably implicit in the CUDA version?).

It's actually more complicated still because NVIDA also offers a forward-looking cuda-compat where you can run applications requiring newer CUDA on older drivers. AFAIU, we're packaging this on https://github.com/conda-forge/cuda-compat-feedstock? It doesn't seem used in many places though, and in particular cuda-version doesn't depend on it.

I certainly don't pretend to understand the full set of constraints and relationships here...

I don't see a reason to have both 12.0 and 12.max. Just 12.max will be fine.

In any case, if it works to only build for 12.max, all the better!

@bdice
Copy link
Contributor

bdice commented Nov 1, 2024

I can summarize the "relationship to the driver" portion. It's not a significant constraint here. For CUDA 12, driver 525 was the first CUDA 12 driver released. So essentially all this means is "to run any CUDA 12.x, you need a driver that was released with 12.0 or later."

cuda-compat is a bit trickier. It is probably not relevant for this discussion. It is Linux-only and has some hardware requirements (FAQ). It's better for us to just say that CUDA 12.x requires a CUDA 12(.0 or newer)-capable driver.

I agree it will probably suffice to build only 12.max. Some packages may want to relax their constraints to take advantage of CUDA Enhanced Compatibility, so it could be helpful to link to the docs for that (https://github.com/conda-forge/cuda-feedstock/blob/main/recipe/doc/recipe_guide.md#cuda-enhanced-compatibility) in the migration PR.

@carterbox
Copy link
Member

I don't think a migration is necessary.

Some of these feedstocks are just pinning the cuda-version at runtime unnecessarily, when they should just be trusting the strong run_exports from the cuda compiler package. For example,

https://github.com/conda-forge/nnpops-feedstock/blob/8b28c88525f6a45e3e5f040f39307b06c5dadded/recipe/meta.yaml#L65-L66

This pinning in nnpops is incorrect. If they want the cuda-version at test time to match the build environment, they should add the cuda-version package to the test/requires section instead of constraining all of their end users to use CTK 12.0 libraries.

https://github.com/conda-forge/petsc-feedstock/blob/cb83722cf53833e553d4bc372ef5686372875561/recipe/meta.yaml#L122

petsc also probably has redundant pinnings. These pinning are very similar and maybe the same as what is already exported by the compiler package.

I think the only reason a package should need to build with cuda-version >12.0 is there is a feature in the cuda compiler that the library needs, but that's a library specific and not channel wide issue.

If there is a bug fix in a CTK component library, then that library can be pinned separately from the cuda-version.

@h-vetinari
Copy link
Member Author

@carterbox: I think the only reason a package should need to build with cuda-version >12.0 is there is a feature in the cuda compiler that the library needs, but that's a library specific and not channel wide issue.

It does become a wider issue if more and more packages require the newer compiler (whether due to features or bugs, e.g. with C++20 as mentioned in the pytorch issue), in the sense that we should have an easy solution for feedstocks to use a CUDA >12.0, rather than playing around with trying to override things correctly in a local CBC.

@bdice: I agree it will probably suffice to build only 12.max.

That's great! Thanks for the confirmation!

@carterbox: I don't think a migration is necessary.

If only due to the build changes between 12.0, 12.1, 12.2 on the side of the conda-forge infrastructure (some reshuffles happened in the beginning, e.g. in which package certain binaries live), I think it would be a good idea.

The migration could add 12.6 now, and once we're ready to close it, we could drop 12.0 and just keep 11.8 & 12.6.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 3, 2024

The migration could add 12.6 now, and once we're ready to close it, we could drop 12.0 and just keep 11.8 & 12.6.

The thing is, that for many package, cuda builds are just flaky, long, and hard to get right (see pytorch work), and having 11.8 + 12.0 + 12.6 might just be more churn than the migration doing a full swap of 12.0 for 12.6.

If the "migration" could "drop 12.0 and add 12.6" that would be great since we don't need to troubleshoot a "little value 12.0 build".

@carterbox
Copy link
Member

I change my opinion. As we discussed in the meeting, there are bug fixes or changes in newer compilers that the channel could benefit from as a whole. I think it would be a good idea to migrate the default cuda pinning from 12.0 to 12.6, then continue to migrate it whenever there's a new minor release.

@isuruf isuruf mentioned this issue Nov 19, 2024
5 tasks
@isuruf
Copy link
Member

isuruf commented Nov 19, 2024

Done in #6720

weiji14 added a commit to regro-cf-autotick-bot/deepspeed-feedstock that referenced this issue Nov 26, 2024
weiji14 added a commit to regro-cf-autotick-bot/deepspeed-feedstock that referenced this issue Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants