-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{lib}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + add missing patches for TensorFlow v2.15.1 + NCCL v2.18.3 #20358
{lib}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + add missing patches for TensorFlow v2.15.1 + NCCL v2.18.3 #20358
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
Test report by @yqshao |
Thank you for the effort. Can you add the patch files to your pr? TensorFlow-2.15.1_remove-duplicate-gpu-tests.patch and TensorFlow-2.15.1_fix-cuda_build_defs.patch? |
Sorry, missed that, |
#20191 is now merged. From your previous comment here, I think you wanted to make some more changes in this PR? Let me know once those are done, then we can also start reviewing/testing this one again :) |
2887c06
to
f9d74a0
Compare
866e5cc
to
dae815e
Compare
dae815e
to
0bcde97
Compare
Test report by @yqshao |
Hi, I checked again the PR and there is not much addition from the CPU version; however I have to admit that I have let some patches slip through which should still be relevant (sorry for the hindsight) I added back the following back to both the CPU and CUDA configs, but those are not tested on our system, so I would appreciate cross-checks. @casparvl @Flamefire
|
Makes sense
I can check if this is still required on skylake and cascade-lake but I'm pretty sure it is |
Hm, good point, I should have probably also tested that PR for the CPU version on our GPU nodes, they have AVX512 capabilities... For now at least, I'll upload test reports for this full pr from our GPU nodes and another one for the CPU version from our CPU nodes. Build is going right now, so test reports should appear later this afternoon... |
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2157944314 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @casparvl |
Oh, silly, I forgot to instruct boegelbot to use the new easyblock. So... we can ignore that failure. Regarding my own tests: the GPU build on my GPU node failed. It has failing tests:
The output is all looking similar to this:
In other words: the numbers are close, but not close enough to meet the tolerance. My bet is this is another example of tolerances that are exceeded as a result of the TF32 datatype. Do these ring a bell @Flamefire ? Didn't you at some point have a patch to increase those tolerances (or was that for PyTorch...)? |
@boegelbot please test @ generoso |
Test report by @casparvl |
@casparvl Is this ready to go now you think? |
Yeah, I've been hesitant to pull the trigger on this one, but @Flamefire 's failing build was in one of the dependencies. I've asked @laraPPr to upload some test report from her system, I'll also trigger boegelbot again for a final set of tests. If succesfull, I say we merge, since it probably works for the majority of people (and tackle any remaining issues in follow-up PRs). |
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2286064510 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
@boegelbot please test @ jsc-zen3 |
@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2286423224 processed Message to humans: this is just bookkeeping information for me, |
Ah...
So that explains the |
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2286482354 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2288508553 processed Message to humans: this is just bookkeeping information for me, |
Test report by @laraPPr |
the third one failed because of lock will clean it up and retrigger the one that failed later |
Test report by @boegel |
Test report by @boegelbot |
Test report by @boegel |
@boegelbot please test @ generoso |
@boegel: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2291702091 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Going in, thanks @yqshao! |
(created using
eb --new-pr
)Waiting for #20191depends on: