Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + add missing patches for TensorFlow v2.15.1 + NCCL v2.18.3 #20358

Merged

Conversation

yqshao
Copy link
Contributor

@yqshao yqshao commented Apr 13, 2024

@yqshao

This comment was marked as resolved.

@yqshao
Copy link
Contributor Author

yqshao commented Apr 15, 2024

Test report by @yqshao
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-10 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 2 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/yqshao/fad950f321c4e87bccd8f3a4369e8bf9 for a full test report.

@migueldiascosta migueldiascosta added this to the 4.x milestone Apr 16, 2024
@tiwoe
Copy link

tiwoe commented Apr 16, 2024

Thank you for the effort. Can you add the patch files to your pr? TensorFlow-2.15.1_remove-duplicate-gpu-tests.patch and TensorFlow-2.15.1_fix-cuda_build_defs.patch?

@yqshao
Copy link
Contributor Author

yqshao commented Apr 16, 2024

Sorry, missed that, there's also a rebased-on-dependencies version at yqshao/[email protected], but I'll will wait a bit (until the deps are merged) before force-pushing here...

@casparvl
Copy link
Contributor

casparvl commented Jun 5, 2024

#20191 is now merged. From your previous comment here, I think you wanted to make some more changes in this PR? Let me know once those are done, then we can also start reviewing/testing this one again :)

@yqshao yqshao force-pushed the 20240413152217_new_pr_TensorFlow2151 branch from 2887c06 to f9d74a0 Compare June 7, 2024 10:05
@yqshao yqshao force-pushed the 20240413152217_new_pr_TensorFlow2151 branch 3 times, most recently from 866e5cc to dae815e Compare June 7, 2024 13:28
@yqshao yqshao force-pushed the 20240413152217_new_pr_TensorFlow2151 branch from dae815e to 0bcde97 Compare June 7, 2024 13:51
@yqshao
Copy link
Contributor Author

yqshao commented Jun 7, 2024

Test report by @yqshao
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
alvis1-04 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 2 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/yqshao/2a41b12be92022e0171b420282dd028b for a full test report.

@yqshao
Copy link
Contributor Author

yqshao commented Jun 7, 2024

Hi, I checked again the PR and there is not much addition from the CPU version; however I have to admit that I have let some patches slip through which should still be relevant (sorry for the hindsight) I added back the following back to both the CPU and CUDA configs, but those are not tested on our system, so I would appreciate cross-checks. @casparvl @Flamefire

@Flamefire
Copy link
Contributor

Makes sense

disable-avx512-extensions: though I did no seem to reproduce the issue with our build without the patch on Skylake cpus;

I can check if this is still required on skylake and cascade-lake but I'm pretty sure it is

@casparvl
Copy link
Contributor

Hm, good point, I should have probably also tested that PR for the CPU version on our GPU nodes, they have AVX512 capabilities... For now at least, I'll upload test reports for this full pr from our GPU nodes and another one for the CPU version from our CPU nodes. Build is going right now, so test reports should appear later this afternoon...

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13690

Test results coming soon (I hope)...

- notification for comment with ID 2157944314 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/5e92fce2e6c07a0eacfafce52d978280 for a full test report.

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/52894745607e7457540cdc9bd83633a7 for a full test report.

@casparvl
Copy link
Contributor

Oh, silly, I forgot to instruct boegelbot to use the new easyblock. So... we can ignore that failure.

Regarding my own tests: the GPU build on my GPU node failed. It has failing tests:

[  FAILED  ] 4 tests, listed below:
[  FAILED  ] Test/FusedMatMulWithBiasOpTest/0.MatMul256x128x64, where TypeParam = float
[  FAILED  ] Test/FusedMatMulWithBiasOpTest/0.MatMul1x256x256, where TypeParam = float
[  FAILED  ] Test/FusedMatMulWithBiasOpTest/0.MatMul256x256x1, where TypeParam = float
[  FAILED  ] Test/FusedMatMulWithBiasOpTest/0.MatMul256x128x64WithActivation, where TypeParam = float

 4 FAILED TESTS

The output is all looking similar to this:

tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (2.6044831275939941 not close to 2.6047005653381348)
Expected: true
i = 0 Tx[i] = 2.6044831275939941 Ty[i] = 2.6047005653381348
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (2.9819025993347168 not close to 2.9816701412200928)
Expected: true
i = 1 Tx[i] = 2.9819025993347168 Ty[i] = 2.9816701412200928
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (2.4911799430847168 not close to 2.491544246673584)
Expected: true
i = 2 Tx[i] = 2.4911799430847168 Ty[i] = 2.491544246673584
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (1.9187320470809937 not close to 1.9185069799423218)
Expected: true
i = 4 Tx[i] = 1.9187320470809937 Ty[i] = 1.9185069799423218
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (1.6246750354766846 not close to 1.6246215105056763)
Expected: true
i = 7 Tx[i] = 1.6246750354766846 Ty[i] = 1.6246215105056763
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (0.22215569019317627 not close to 0.22175788879394531)
Expected: true
i = 8 Tx[i] = 0.22215569019317627 Ty[i] = 0.22175788879394531
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (1.4460333585739136 not close to 1.4467771053314209)
Expected: true
i = 11 Tx[i] = 1.4460333585739136 Ty[i] = 1.4467771053314209
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (0.91000175476074219 not close to 0.90951979160308838)
Expected: true
i = 13 Tx[i] = 0.91000175476074219 Ty[i] = 0.90951979160308838
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (3.9072332382202148 not close to 3.907407283782959)
Expected: true
i = 14 Tx[i] = 3.9072332382202148 Ty[i] = 3.907407283782959
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (0.46104967594146729 not close to 0.46095246076583862)
Expected: true
i = 15 Tx[i] = 0.46104967594146729 Ty[i] = 0.46095246076583862
tensorflow/core/framework/tensor_testutil.cc:187: Failure
Expected: (num_failures) < (max_failures), actual: 10 vs 10
Too many mismatches (atol = 1.0000000000000001e-05 rtol = -1), giving up.

In other words: the numbers are close, but not close enough to meet the tolerance. My bet is this is another example of tolerances that are exceeded as a result of the TF32 datatype. Do these ring a bell @Flamefire ? Didn't you at some point have a patch to increase those tolerances (or was that for PyTorch...)?

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303"

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
gcn80.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 9334 32-Core Processor, 4 x NVIDIA NVIDIA H100, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/5deb92ce7fa72e33cfc44d7994aae9c2 for a full test report.

@boegel
Copy link
Member

boegel commented Aug 2, 2024

@casparvl Is this ready to go now you think?

@casparvl
Copy link
Contributor

Yeah, I've been hesitant to pull the trigger on this one, but @Flamefire 's failing build was in one of the dependencies. I've asked @laraPPr to upload some test report from her system, I'll also trigger boegelbot again for a final set of tests. If succesfull, I say we merge, since it probably works for the majority of people (and tackle any remaining issues in follow-up PRs).

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14057

Test results coming soon (I hope)...

- notification for comment with ID 2286064510 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
FAILED
Build succeeded for 1 out of 3 (3 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/ddfe8bcfef27486b4af9bb587be27070 for a full test report.

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen3
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4668

Test results coming soon (I hope)...

- notification for comment with ID 2286423224 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@casparvl
Copy link
Contributor

Ah...

Kenneth Hoste (boegel)
  4:02 PM
FYI: I’m doing a forced rebuild of Python/3.11.3-GCCcore-12.3.0 on jsc-zen3, it got messed up by a pip install command that is run via setup.py, see https://github.com/Juniper/py-junos-eznc/issues/1318 + https://github.com/easybuilders/easybuild-easyconfigs/pull/21166
edit: same problem on generoso (edited) 

So that explains the generoso failure...

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14059

Test results coming soon (I hope)...

- notification for comment with ID 2286482354 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/edf15f9663f507ee4ef7379411c8e0a5 for a full test report.

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14074

Test results coming soon (I hope)...

- notification for comment with ID 2288508553 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@laraPPr
Copy link
Contributor

laraPPr commented Aug 14, 2024

Test report by @laraPPr
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
node4009.donphan.os - Linux RHEL 8.8 (Ootpa), x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 1 x NVIDIA NVIDIA A2, 545.23.08, Python 3.11.3
See https://gist.github.com/laraPPr/f0a871731a5f0138e53fd6e6454d1000 for a full test report.

@laraPPr
Copy link
Contributor

laraPPr commented Aug 14, 2024

the third one failed because of lock will clean it up and retrigger the one that failed later

@boegel
Copy link
Member

boegel commented Aug 14, 2024

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3900.accelgor.os - Linux RHEL 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA A100-SXM4-80GB, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/242c392c61a7c8765863dc76fd7b4eb3 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/92f838f6a845e5eb7d22abf5cafc7b9d for a full test report.

@boegel
Copy link
Member

boegel commented Aug 15, 2024

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3302.joltik.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/43f6d9edffd144a580247e8a04247699 for a full test report.

@boegel
Copy link
Member

boegel commented Aug 15, 2024

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303 TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb"

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303 TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14082

Test results coming soon (I hope)...

- notification for comment with ID 2291702091 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/b0da64e1ba8b233116b68db4cbf9ba3c for a full test report.

@boegel boegel changed the title {lib}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 {lib}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + add missing patches for TensorFlow v2.15.1 + NCCL v2.18.3 Aug 20, 2024
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel boegel modified the milestones: 4.x, release after 4.9.2 Aug 20, 2024
@boegel
Copy link
Member

boegel commented Aug 20, 2024

Going in, thanks @yqshao!

@boegel boegel merged commit b688717 into easybuilders:develop Aug 20, 2024
9 checks passed
@yqshao yqshao deleted the 20240413152217_new_pr_TensorFlow2151 branch August 20, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.