Skip to content

Commit

Permalink
Use amazon linux 2023 runners for Docker builds (pytorch#136544)
Browse files Browse the repository at this point in the history
Migrate these builds to linux 2023. We want to build and test the Docker images in CD.

Looks like we are hitting this issue: docker/buildx#379 when trying to build Docker on Amazon Linux 2023.

Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544

Proposed Solution is to fix it in user_data . Please see: pytorch/test-infra#5712

I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544

Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576
Pull Request resolved: pytorch#136544
Approved by: https://github.com/ZainRizvi

Co-authored-by: Nikita Shulga <[email protected]>
  • Loading branch information
2 people authored and BoyuanFeng committed Sep 25, 2024
1 parent 6fb1204 commit b6d12d8
Show file tree
Hide file tree
Showing 5 changed files with 19 additions and 5 deletions.
6 changes: 6 additions & 0 deletions .ci/docker/conda/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,12 @@ esac

(
set -x
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker

docker build \
--target final \
--progress plain \
Expand Down
1 change: 1 addition & 0 deletions .ci/docker/manywheel/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8

ARG DEVTOOLSET_VERSION=9

# Note: This is required patch since CentOS have reached EOL
# otherwise any yum install setp will fail
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
Expand Down
9 changes: 8 additions & 1 deletion .ci/docker/manywheel/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,14 @@ if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then
fi
(
set -x
DOCKER_BUILDKIT=1 docker build \

# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker

DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--target "${TARGET}" \
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build-conda-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ concurrency:
jobs:
build-docker:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: am2.linux.9xlarge.ephemeral
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
cuda_version: ["11.8", "12.1", "12.4", "cpu"]
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/build-manywheel-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
build-docker-cuda:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}am2.linux.9xlarge.ephemeral"
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
cuda_version: ["12.4", "12.1", "11.8"]
Expand Down Expand Up @@ -156,7 +156,7 @@ jobs:
build-docker-rocm:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}am2.linux.9xlarge.ephemeral"
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
rocm_version: ["6.1", "6.2"]
Expand Down Expand Up @@ -192,7 +192,7 @@ jobs:
build-docker-cpu:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}am2.linux.9xlarge.ephemeral"
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
Expand Down

0 comments on commit b6d12d8

Please sign in to comment.