Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon Linux AMI 2023 : docker docker run yum install very slow, while docker build is fine #5712

Open
atalman opened this issue Sep 25, 2024 · 0 comments

Comments

@atalman
Copy link
Contributor

atalman commented Sep 25, 2024

New Amazon Linux AMI 2023 contain the issue, that happening during Docker builds. Please refer to this PR: pytorch/pytorch#136544

The problem is Docker package a systemd service with LimitNOFILE=infinity (sets --ulimit).

Patch needs to be applied via:
https://github.com/pytorch/test-infra/blob/main/terraform-aws-github-runner/modules/runners-instances/templates/user-data.sh#L89

This is the patch:

sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Sep 25, 2024
Migrate these builds to linux 2023. We want to build and test the Docker images in CD.

Looks like we are hitting this issue: docker/buildx#379 when trying to build Docker on Amazon Linux 2023.

Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544

Proposed Solution is to fix it in user_data . Please see: pytorch/test-infra#5712

I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544

Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576
Pull Request resolved: #136544
Approved by: https://github.com/ZainRizvi

Co-authored-by: Nikita Shulga <[email protected]>
BoyuanFeng pushed a commit to BoyuanFeng/pytorch that referenced this issue Sep 25, 2024
Migrate these builds to linux 2023. We want to build and test the Docker images in CD.

Looks like we are hitting this issue: docker/buildx#379 when trying to build Docker on Amazon Linux 2023.

Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544

Proposed Solution is to fix it in user_data . Please see: pytorch/test-infra#5712

I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544

Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576
Pull Request resolved: pytorch#136544
Approved by: https://github.com/ZainRizvi

Co-authored-by: Nikita Shulga <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant