-
-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudaPackages_{10*,11*}: warn about upcoming removal #342112
Conversation
f765876
to
390a8c1
Compare
pkgs/top-level/cuda-packages.nix
Outdated
@@ -123,4 +123,6 @@ let | |||
fixedPoints.extends composedExtension passthruFunction | |||
); | |||
in | |||
cudaPackages | |||
lib.warnIf (lib.versionOlder cudaVersion "12.0" && config.allowAliases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib.warnIf (lib.versionOlder cudaVersion "12.0" && config.allowAliases) | |
lib.warnIf (lib.versionOlder cudaVersion "11.4" && config.allowAliases) |
is safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR/commit message I gave reasons for choosing 12.0 as a baseline rather than 11.4. By the time of 25.05’s release, 11.x will all depend on GCC versions that are unsupported upstream, which could result in old GCC versions sticking around solely for obsolete versions of CUDA.
Picking the policy listed in the release notes currently in this PR also helps us establish expectations for whether we will be obligated in future to keep an indefinitely‐growing list of versions of CUDA that haven’t been updated in years and only work with compilers that haven’t been updated in years and don’t have upstream support.
I think it’s better to set those expectations now, unless there’s a strong reason not to do it like this. My survey of the affected packages suggested that there’s only one package in Nixpkgs that is currently pinned to CUDA 11.x and that may not support 12.x (but I don’t know whether it does or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course we should probably get TensorFlow off the unnecessary 11.x pin before actually merging this to avoid annoying users. I’m not sure I’m the best person for that job, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I must say I couldn't find any explicit information about gcc11's projected end of life, but I agree we don't need to support it any longer than upstream does.
Regarding updates for cuda11, maybe it doesn't needs any (EDIT: well, except for supporting modern toolchains 🤡): the stack is somewhat rigid and stable, slow to evolve, the newer releases tend to just add support for new devices/features, and most applications (hpc) aren't security-critical...
Regarding users of cuda11, I'm afraid we (I) were rather hasty about discarding e.g. 1<=torch<2 which is used by a substantial amount of research code, and I don't think one could easily build it against cuda12.
Regarding tensorflow... nobody wants to deal with bazel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In either case, thanks for picking this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d hope that it’s as simple as just giving it a higher version
With a compatible cuDNN version in place (8.9?), it's probably worth a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I’ll try giving it a shot soon. One thing confuses me: we list cuDNN 8.9 as supporting CUDA 12.2 at most, but TensorFlow pairs it with CUDA 12.3. What’s going on there? (And are those versions minimum requirements or precise pinned versions? TensorFlow only pins the major version right now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using our handy cudaPackages.cudaOlder
instead of the calls to lib
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's proof read I guess.
we list cuDNN 8.9 as supporting CUDA 12.2 at most,
Seems to check out:
https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-897/support-matrix/index.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the guard to make sure this doesn’t break eval doesn’t work for ofborg. Looking into it. |
(I still don’t know why ofborg-eval is unhappy, even after digging into the ofborg source code and running |
Now that #334996 is merged, I’ll take a look at the TensorFlow source build and maybe a couple of the other stragglers over the next few days. @SomeoneSerge Do you consider the evaluation warning here vital, or is the release note enough in your view? Unfortunately trying to dig in to why this is breaking CI led to ofborg code nobody really understands and it remains unclear to me why it’s happening even with the |
I'm inclined to push for the warnings. I'm confident consumers notice warnings in a timely manner, less so about release notes. It maybe would be good to transition to a protocol where release notes are an expected notice and the only expected notice, but we'd need to be more verbal about that (maybe mention this in the warnings when deprecating the next few features, jaja)
Oh actually, isn't it that we have to remove all the in-tree consumers of the aliased packages? I think it's OK to have a package set attributes wrapped in |
Please don't burn out |
So the reason for the weird The problem is that apparently the assumption isn’t true, or something else really weird is happening? But it’s going to take some diving to figure out. The alternative would be to deal with all the reverse dependencies first, but that’s not a great idea since e.g. we still need DCGM to carry the dependencies for the next release cycle to maintain support.
Oh don’t worry you’re stuck with me, and I’ll never stop removing things 😆 Though I’ll probably just look at TensorFlow (and maybe TensorRT since it seems important?) and leave the rest to others once the deadline is set, since I’m really not qualified to poke at this stuff and don’t use it myself. |
|
390a8c1
to
58eadd1
Compare
New, grosser warning hack just dropped; let’s see if ofborg is happy now. I didn’t realize the TensorFlow source build is out of date, so I’ve started poking at it but am indeed in hell right now. I think I can probably get it working, but am I right to assume that all the serious CUDA users will be using the binary TensorFlow package anyway and so it doesn’t hugely matter? I also learned that Caffe is currently stuck on CUDA 10. BVLC/caffe#7082 implies that it might work to just bump the CUDA version. If not, though, I’m not sure there’s anything we can really do since it’s been unmaintained for so many years. The package is in a pretty bad state already. |
ofborg also evaluates unfree packages, so this is probably back to the drawing board. The idea of only shipping one GCC version, like many other distros, was floated in the Staging room on Matrix and I had to shut it down on the grounds of CUDA support. I’m curious, though – given that Nvidia don’t seem to be too behind on GCC support in the latest minor updates, what’s the motivation for using a specific minor version of CUDA that leads us to decide to ship all of them? Is it just a matter of which versions have been verified to work with software? But TensorFlow supports only a specific minor version and yet #334996 just pinned it to the major. If we only shipped one minor version of a given CUDA major release, then we could feasibly ship only one to three GCC versions (default GCC, latest GCC, GCC required by the latest CUDA – some of which may be the same version), which would reduce the burden on GCC maintainers further, but I don’t know enough about CUDA to tell whether this is a completely insane proposal or not :) |
Conditioning on It looks like the pkg-config tests re-import nixpkgs but without preserving critical options like nixpkgs/pkgs/top-level/pkg-config/tests.nix Lines 6 to 17 in eb28b94
Use |
From my understanding, the range of CUDA packages made available has been driven by what people want to package and the hardware they want to use. From CUDA 11.4, NVIDIA has provided JSON manifests and made packages available as individual tarballs, which allowed me to automate a great deal of packaging. I think it's unrealistic to offer only a single minor version of CUDA in Nixpkgs, given popular consumers (e.g., OpenCV, PyTorch, Tensorflow, JAX, etc.) each specify their own range of supported versions for a release. Additionally, packages can break in spectacular or devious ways when using versions outside their support matrix (numerical accuracy regressions making me realize I should be glad to get runtime failures). I feel it's also important to discuss extensibility; if we don't maintain a range of CUDA versions in-tree, how easy is it for downstream users to add it themselves? Currently, I imagine that would look like re-implementing most of the CUDA packaging we have in-tree. One of the reasons I moved to using the module system to create the CUDA package sets was to expose an interface to allow downstream users to supply their own manifests, effectively extending whatever we currently support. This is still something that is very much in progress, and I'd feel much more comfortable deprecating the majority of the CUDA package sets we make available if it were ready right now. |
Will rebase this to fix the eval issues soon; thanks @lilyinstarlight!
Thanks, that makes a lot of sense. It seems like the best compromise here is to stick with the proposed policy of “we’ll ship compiler versions that upstream supports and that are required for something important”, then, so that we can continue to ship a range of CUDA 12.x minor versions.
The problem is that these old CUDA versions don’t work with supported compilers, right? So if and when those compilers are removed from Nixpkgs, anyone packaging these unmaintained CUDA versions out‐of‐tree would also have to repackage old versions of GCC, which is pretty gnarly – that’s exactly why it’s an unreasonable maintenance burden on GCC maintainers currently. But for anyone who is up for handling that, I think the maintenance effort of having to vendor some of the CUDA packaging code isn’t too bad compared to that. For most users who, for whatever reason, can’t use a CUDA version that Nvidia and the wider ecosystem is officially supporting and updating, and who therefore don’t need updates, I think pinning an older Nixpkgs release is likely to be the better approach. That said, this would only introduce a deprecation warning and so users would still be able to use those versions on 24.11 while work is done on the CUDA side to make it more modular and extensible in the unstable branch. It’s just that right now, these old, unmaintained versions of CUDA are acting as a significant drag on maintainers of core packages like GCC. Old CUDA versions are the only significant thing depending on unsupported versions of GCC with no viable plan to migrate them to supported ones. There are currently five GCC versions that it would be possible and desired to remove support for if not for that. You can see in my pull request to remove GCC 6 how much additional code, complexity, and vendored patches these old versions require, including shipping versions of multiple old libraries just to keep those GCCs alive; indeed, a significant portion of the total complexity of the current GCC packaging relates to support of these old versions. The need to support these old GCCs has also lead to additional work and complexity for the upcoming rework of Darwin support. We had a recent flurry of removing these old and almost entirely unused compiler versions, but I had to stop the train at #342242 when it reached GCC 7 because of these old CUDA versions. It might not be realistic to only carry one version of GCC and one version of CUDA, but I don’t think we can let a long tail of old CUDA versions be a burden on the rest of Nixpkgs for an indefinite period. I appreciate that CUDA is a really important ecosystem with a lot of users, but given my extensive survey of what versions the in‐tree users of CUDA support, I think deprecating support for versions older than 12.0 over the course of a release cycle seems like a reasonable trade‐off given the unreasonability of expecting Nixpkgs maintainers to keep unsupported compilers going indefinitely just for versions that even NVIDIA doesn’t care about keeping maintained any more. |
This was enabling aliases on ofborg.
We currently package all CUDA versions from 10.0 onwards. In some cases, CUDA is the only thing preventing us from removing old versions of GCC. Since we currently don’t deprecate or remove CUDA versions, this will be an increasing drag on compiler maintenance in Nixpkgs going forward unless we establish a sensible policy. After discussing this with @SomeoneSerge in the context of old versions of GCC, I learned that there was already a desire to remove at least versions prior to 11.3, as those versions were only packaged in the old “runfile” format, but that it was blocked on someone doing the work to warn about the upcoming deprecation for a release cycle. This change adds a release note and warnings indicating that CUDA 10.x and 11.x will be removed in Nixpkgs 25.05, about 8 months from now. I chose this version cut‐off because these versions of CUDA require GCC < 12. GCC releases a major version every year, and seems to support about four releases at a time, releasing the last update to the oldest version and marking it as unsupported on their site around the time of the release of the next major version. Therefore, by the time of the 25.05 release, we should expect GCC 15 to be released and GCC 11 to become unsupported. Adding a warning and communicating the policy of only shipping CUDA versions that work with supported compilers in the release notes means that we should be able to clean up old versions as required without any issue or extensive deprecation period in future, without obligating us to do so if there is a strongly compelling reason to be more lenient. That should help solve both shipping an indefinitely‐growing list of CUDA versions and an indefinitely‐growing list of GCC and LLVM versions. As I’m not a user of CUDA myself, I can’t be sure of how sensible this version support policy is, but I think it’s fair to say that it’s reasonable for Nixpkgs to choose not to maintain compiler versions that are unsupported upstream just for the sake of versions of CUDA that are also unmaintained. CUDA 11.x has not received an update for two years already, and would only become unsupported in Nixpkgs in over half a year’s time. CUDA 10.x is currently unused in‐tree except for the unmaintained Caffe and NVIDIA DCGM, which depends on multiple CUDA versions solely so that it can provide plugins for those versions. The latest DCGM version has already removed support for CUDA 10.x and is just awaiting an update in Nixpkgs. They maintain a list of supported versions to build plugins for in their CMake build system, so it should be simple enough for us to only build support for the versions of CUDA that we support in Nixpkgs. From what I can tell, CUDA 11.x is currently used by the following packages other than DCGM: * `catboost`, because of <catboost/catboost#2540>. It looks like upstream has since redesigned this part of their build system, so perhaps the problem is no longer present, or would be easier to fix. * `magma_2_6_2`, an old version from before upstream added CUDA 12 support. This seems okay to break to me; that version is not maintained and will never be updated for new CUDA versions, and the CUDA support is optional. * `paddlepaddle`, which, uh, also requires OpenSSL 1.1 of all things. <PaddlePaddle/Paddle#67571> states that PaddlePaddle supports up to 12.3. * `python3Packages.cupy`, which is listed as “possibly incompatible with cutensor 2.0 that comes with `cudaPackages_12`”. I’m not sure what the “possibly” means here, but according to <https://github.com/cupy/cupy/tree/v13.3.0?tab=readme-ov-file#installation> they ship binary wheels using CUDA 12.x so I think this should be fine. * `python3Packages.tensorrt`, which supports CUDA 12.x going by <https://github.com/NVIDIA/TensorRT/blob/release/10.4/CMakeLists.txt#L111>. * TensorFlow, which has a link to <https://www.tensorflow.org/install/source#gpu> above the `python3Packages.tensorflow-bin` definition, but that page lists the versions we package as supporting CUDA 12.x. Given the years since CUDA 11.x received any update upstream, and the seemingly very limited set of packages that truly require it, I think the policy of being able to drop versions that require unsupported compilers starting from the next Nixpkgs release is a reasonable one, but of course I’m open to feedback from the CUDA maintainers about this.
58eadd1
to
77eb5df
Compare
Hopefully this should work. If it does I think this should be ready to merge, although my TensorFlow question in #342112 (comment) is still relevant. |
The binary package is alright, but there are still a few problems that make it less-than-ideal:
|
ofborg is happy 🎉 Thanks for the response @hacker1024. I agree that having a working source build of TensorFlow is a good thing and that we should have it. Would it be reasonable to say, though, that since the TensorFlow source build is multiple versions behind, if nobody manages to wrangle it into shape by 25.05 it’d be okay to set |
This sounds reasonable to me. Sadly, I think the only real way to find out how much it matters to Nixpkgs users as a whole is to do this and see who complains. Note that some Keras packages need patching with this change due to some sort of package name discrepancy. I'll hopefully be able to share them later this week, but IIRC they're just some simple substitutions in |
Sounds reasonable. I think that it should be okay to ship this as‐is, then, since we are only warning and not removing right now; I’ll try to send a follow‐up PR for the TensorFlow source build, and I’ll try and handle TensorRT if someone else doesn’t beat me to it, but I can’t guarantee I’ll be able to wrangle TensorFlow successfully and anyway it would be best if someone with an investment in it maintained it since I’m ending up as the maintainer of last resort for too many things anyway 😅 |
@SomeoneSerge @ConnorBaker gentle ping :) |
Description of changes
We currently package all CUDA versions from 10.0 onwards. In some cases, CUDA is the only thing preventing us from removing old versions of GCC. Since we currently don’t deprecate or remove CUDA versions, this will be an increasing drag on compiler maintenance in Nixpkgs going forward unless we establish a sensible policy. After discussing this with @SomeoneSerge in the context of old versions of GCC, I learned that there was already a desire to remove at least versions prior to 11.3, as those versions were only packaged in the old “runfile” format, but that it was blocked on someone doing the work to warn about the upcoming deprecation for a release cycle.
This change adds a release note and warnings indicating that CUDA 10.x and 11.x will be removed in Nixpkgs 25.05, about 8 months from now.
I chose this version cut‐off because these versions of CUDA require GCC < 12. GCC releases a major version every year, and seems to support about four releases at a time, releasing the last update to the oldest version and marking it as unsupported on their site around the time of the release of the next major version. Therefore, by the time of the 25.05 release, we should expect GCC 15 to be released and GCC 11 to become unsupported. Adding a warning and communicating the policy of only shipping CUDA versions that work with supported compilers in the release notes means that we should be able to clean up old versions as required without any issue or extensive deprecation period in future, without obligating us to do so if there is a strongly compelling reason to be more lenient. That should help solve both shipping an indefinitely‐growing list of CUDA versions and an indefinitely‐growing list of GCC and LLVM versions.
As I’m not a user of CUDA myself, I can’t be sure of how sensible this version support policy is, but I think it’s fair to say that it’s reasonable for Nixpkgs to choose not to maintain compiler versions that are unsupported upstream just for the sake of versions of CUDA that are also unmaintained. CUDA 11.x has not received an update for two years already, and would only become unsupported in Nixpkgs in over half a year’s time.
CUDA 10.x is currently unused in‐tree except for the unmaintained Caffe and NVIDIA DCGM, which depends on multiple CUDA versions solely so that it can provide plugins for those versions. The latest DCGM version has already removed support for CUDA 10.x and is just awaiting an update in Nixpkgs. They maintain a list of supported versions to build plugins for in their CMake build system, so it should be simple enough for us to only build support for the versions of CUDA that we support in Nixpkgs.
From what I can tell, CUDA 11.x is currently used by the following packages other than DCGM:
catboost
, because of Respect CMAKE_CUDA_ARCHITECTURES catboost/catboost#2540. It looks like upstream has since redesigned this part of their build system, so perhaps the problem is no longer present, or would be easier to fix.magma_2_6_2
, an old version from before upstream added CUDA 12 support. This seems okay to break to me; that version is not maintained and will never be updated for new CUDA versions, and the CUDA support is optional.paddlepaddle
, which, uh, also requires OpenSSL 1.1 of all things. Request for support for CUDA 12.4 PaddlePaddle/Paddle#67571 states that PaddlePaddle supports up to 12.3.python3Packages.cupy
, which is listed as “possibly incompatible with cutensor 2.0 that comes withcudaPackages_12
”. I’m not sure what the “possibly” means here, but according to https://github.com/cupy/cupy/tree/v13.3.0?tab=readme-ov-file#installation they ship binary wheels using CUDA 12.x so I think this should be fine.python3Packages.tensorrt
, which supports CUDA 12.x going by https://github.com/NVIDIA/TensorRT/blob/release/10.4/CMakeLists.txt#L111.TensorFlow, which has a link to https://www.tensorflow.org/install/source#gpu above the
python3Packages.tensorflow-bin
definition, but that page lists the versions we package as supporting CUDA 12.x.Given the years since CUDA 11.x received any update upstream, and the seemingly very limited set of packages that truly require it, I think the policy of being able to drop versions that require unsupported compilers starting from the next Nixpkgs release is a reasonable one, but of course I’m open to feedback from the CUDA maintainers about this.
Result of
nixpkgs-review
run on x86_64-linux 11 package blacklisted:
1 package built:
Things done
nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)Add a 👍 reaction to pull requests you find important.