Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ansible): upgrade for CUDA, TensorRT and CUDNN #5608

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

amadeuszsz
Copy link
Contributor

@amadeuszsz amadeuszsz commented Dec 24, 2024

Description

Updates versions for CUDA, CUDNN and TensorRT.

For more information see discussion and issue.

How was this PR tested?

  • Build docker locally.
./docker/build.sh --devel-only

Notes for reviewers

  • Jetpack 6.1 & 6.2 uses prebuilt TensorRT with CUDNN 9. From CUDNN 9, NVIDIA changed package naming convention, therefore ansible script has to handle this exception (already done).

Effects on system behavior

None.

Copy link

github-actions bot commented Dec 24, 2024

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@amadeuszsz amadeuszsz mentioned this pull request Dec 24, 2024
19 tasks
@amadeuszsz amadeuszsz changed the title feat(ansible): CUDA, TensorRT, CUDNN upgrade feat(ansible): upgrade for CUDA, TensorRT and CUDNN Dec 24, 2024
@amadeuszsz amadeuszsz self-assigned this Jan 10, 2025
@amadeuszsz amadeuszsz added the type:containers Docker containers, containerization of components, or container orchestration. label Jan 15, 2025
@amadeuszsz amadeuszsz marked this pull request as ready for review January 21, 2025 12:47
@amadeuszsz amadeuszsz added tag:run-health-check Run health-check and removed tag:run-health-check Run health-check labels Jan 21, 2025
@amadeuszsz
Copy link
Contributor Author

amadeuszsz commented Jan 22, 2025

@youtalk

  1. setup-universe fails due to space issues. I isolated layers differences between current image and this PR:
  • [autoware.dev_env.cuda : Install CUDA devel libraries except for cuda-drivers]: 2804 MB v.s. 2899 MB
  • [autoware.dev_env.tensorrt : Install cuDNN and TensorRT]: 2379 v.s. 4239 MB
  • [autoware.dev_env.tensorrt : Install cuDNN and TensorRT Dev]: 2778 v.s. 4672 MB

If I'm not mistaken, accumulated size of layer increases from 11.51 GB to 15.27 GB. Does it make difference for us?

  1. Also during analyze of setup-iniverse build log, I noticed in [autoware.dev_env.cuda : Install cuda-drivers] we always install latest available nvidia driver (565). It's not a bad thing (for now), but maybe we could stick with same driver as described in docs if possible.

  2. Health-check should fail, but succeed. CUDA upgrade requires this PR, but health-check use latest release from autoware.repos, which is autoware.universe 0.40.0 (this tag does not include necessary changes yet). The reason of build success is because in CI we pulls cached autoware-base:cuda-latest image instead of building it from scratch. Cached autoware-base:cuda-latest consist of old CUDA dependencies and therefore we can build image using autoware.universe 0.40.0 tag.
    I'm aware base dependencies upgrade it's not so often, but could we consider this in our CI pipeline? For local docker build I had an issue that autoware-base:cuda-latest and another images were build in parallel, so before autoware-base:cuda-latest finished, another images already pulled cached autoware-base:cuda-latest. We fixed it be building autoware-base images first. Here, I don't even see that autoware-base:cuda-latest starts building process.

Nevertheless, after fixing space issue in setup-universe, we have a few ways how to merge this PR:

  • Fix issue in health-check, merge this PR just a moment before releasing 0.41.0 and then make 0.41.0 release.
  • Fix issue in health-check, release autoware & autoware.universe 0.41.0 and then merge this PR.
  • Do not fix issue in health-check, merge this PR and deploy autoware-base:cuda-latest just before 0.41.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tag:run-health-check Run health-check type:containers Docker containers, containerization of components, or container orchestration.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant