Skip to content

Commit

Permalink
Bump based CUDA image to ubuntu24.04 (#1166)
Browse files Browse the repository at this point in the history
Ubuntu24.04 uses `python-3.12` as a main interpreter. Unfortunately, not
all python packages, we use here, has py-3.12 wheel for amd64/arm64, so
need to build the following packages from source:
1. TF-Text
2. Lingvo

Also `python-3.12` added a system-wide protection layer ([PEP
668](https://peps.python.org/pep-0668/)) when install packages using
`pip`:
```
error: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try apt install
    python3-xyz, where xyz is the package you are trying to
    install.
```
There are at least 2 possible solutions:
1. install everything into `venv` (the initial solution was proposed by
@olupton).
2. System-wide installation by forcing pip installation with env flag
`PIP_BREAK_SYSTEM_PACKAGES=1`

This branch contains both solutions, but collective mind and experience
of PyTorch team suggests to finalize the second solution (system-wide
installation)

---------

Co-authored-by: Yu-Hang 'Maxin' Tang <[email protected]>
  • Loading branch information
DwarKapex and yhtang authored Dec 4, 2024
1 parent 5c4b687 commit 2a74610
Show file tree
Hide file tree
Showing 13 changed files with 202 additions and 199 deletions.
31 changes: 26 additions & 5 deletions .github/container/Dockerfile.base
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# syntax=docker/dockerfile:1-labs
ARG BASE_IMAGE=nvidia/cuda:12.6.2-devel-ubuntu22.04
ARG BASE_IMAGE=nvidia/cuda:12.6.2-devel-ubuntu24.04
ARG GIT_USER_NAME="JAX Toolbox"
ARG [email protected]
ARG CLANG_VERSION=18
Expand Down Expand Up @@ -60,7 +60,8 @@ apt_packages=(
wget
jq
# llvm.sh
lsb-release software-properties-common
lsb-release
software-properties-common
# GCP autoconfig
pciutils hwloc bind9-host
)
Expand All @@ -74,8 +75,6 @@ apt-get install -y ${apt_packages[@]}

# Install LLVM/Clang
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" -- ${CLANG_VERSION}
apt-get remove -y software-properties-common lsb-release
apt-get autoremove -y # removes python3-blinker which conflicts with pip-compile in JAX

# Make sure that clang and clang++ point to the new version. This list is based
# on the symlinks installed by the `clang` (as opposed to `clang-14`) and `lld`
Expand Down Expand Up @@ -106,6 +105,21 @@ EOL

apt-get clean
rm -rf /var/lib/apt/lists/*

# There are several python packages (in the list below) that are installed with OS
# package manager (the run of `apt-get install` above) and can not be uninstall
# using pip (in pip-finalize.sh script) during JAX installation. Remove then in
# advance to avoid JAX installation issue.
remove_packages=(
python3-gi
software-properties-common
lsb-release
python3-yaml
python3-pygments
)

apt-get remove -y ${remove_packages[@]}
apt-get autoremove -y # removes python3-blinker which conflicts with pip-compile in JAX
EOF

RUN <<"EOF" bash -ex
Expand All @@ -129,7 +143,14 @@ git apply </opt/pip/pip-vcs-equivalency.patch
git add -u
git commit -m 'Adds JAX_TOOLBOX_VCS_EQUIVALENCY as a trigger to treat all github VCS installs for a package as equivalent. The spec of the last encountered version will be used'
EOF
RUN pip install --upgrade --no-cache-dir -e /opt/pip pip-tools && rm -rf ~/.cache/*

# install all python packages system-wide.
ENV PIP_BREAK_SYSTEM_PACKAGES=1
# An extra flag `--ignore-installed` is added below, because of the following reason:
# after upgrading to ver 23.3.1 (from /opt/pip) `pip` tries to uninstall itself (default pip-24.0)
# and fails due to pip-24.0 has been installed with system tool `apt` but not `python`. So we keep
# both pip-24.0 and pip-23.3.1 in the system, but use 23.3.1 with equivalency patch (see above).
RUN pip install --upgrade --ignore-installed --no-cache-dir -e /opt/pip pip-tools && rm -rf ~/.cache/*

###############################################################################
## Install TCPx
Expand Down
1 change: 0 additions & 1 deletion .github/container/Dockerfile.jax
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@ RUN --mount=type=ssh \
--mount=type=secret,id=SSH_KNOWN_HOSTS,target=/root/.ssh/known_hosts \
<<"EOF" bash -ex
git-clone.sh ${URLREF_JAX} ${SRC_PATH_JAX}
sed 's/^numpy.*/numpy<2.0.0/' ${SRC_PATH_JAX}/build/requirements.in
git-clone.sh ${URLREF_XLA} ${SRC_PATH_XLA}
EOF

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

ARG BASE_IMAGE=ghcr.io/nvidia/jax-mealkit:jax
ARG URLREF_MAXTEXT=https://github.com/google/maxtext.git#main
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#v2.13.0
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master
ARG SRC_PATH_MAXTEXT=/opt/maxtext
ARG SRC_PATH_TFTEXT=/opt/tensorflow-text

Expand All @@ -17,18 +17,20 @@ FROM ${BASE_IMAGE} as wheel-builder
# build tensorflow-text from source
#------------------------------------------------------------------------------

# Remove TFTEXT build from source when it has py-3.12 wheels for x86/arm64
FROM wheel-builder as tftext-builder
ARG URLREF_TFTEXT
ARG SRC_PATH_TFTEXT

RUN pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.18.0
RUN git-clone.sh ${URLREF_TFTEXT} ${SRC_PATH_TFTEXT}
RUN <<"EOF" bash -exu -o pipefail
pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.13.0
git-clone.sh ${URLREF_TFTEXT} ${SRC_PATH_TFTEXT}
cd ${SRC_PATH_TFTEXT}

# The tftext build script queries GitHub, but these requests are sometimes
# throttled by GH, resulting in a corrupted uri for tensorflow in WORKSPACE.
# A workaround (needs to be updated when the tensorflow version changes):
sed -i "s/# Update TF dependency to installed tensorflow/commit_sha=1cb1a030a62b169d90d34c747ab9b09f332bf905/" oss_scripts/prepare_tf_dep.sh
sed -i "s/# Update TF dependency to installed tensorflow./commit_slug=6550e4bd80223cdb8be6c3afd1f81e86a4d433c3/" oss_scripts/prepare_tf_dep.sh

# Newer versions of LLVM make lld's --undefined-version check of lld is strict
# by default (https://reviews.llvm.org/D135402), but the tftext build seems to
Expand All @@ -38,14 +40,13 @@ echo "write_to_bazelrc \"build --linkopt='-Wl,--undefined-version'\"" >> oss_scr
./oss_scripts/run_build.sh
EOF


###############################################################################
## Download source and add auxiliary scripts
###############################################################################

FROM ${BASE_IMAGE} as mealkit
ARG URLREF_MAXTEXT
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#v2.13.0
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master
ARG SRC_PATH_MAXTEXT
ARG SRC_PATH_TFTEXT=/opt/tensorflow-text

Expand All @@ -56,6 +57,17 @@ RUN echo "tensorflow-text @ file://$(ls /opt/tensorflow_text*.whl)" >> /opt/pip-
RUN <<"EOF" bash -ex
git-clone.sh ${URLREF_MAXTEXT} ${SRC_PATH_MAXTEXT}
echo "-r ${SRC_PATH_MAXTEXT}/requirements.txt" >> /opt/pip-tools.d/requirements-maxtext.in

# specify some restrictions to speed up the build and
# avoid pip to download and check all available versions of packages
for pattern in \
"s|absl-py|absl-py>=2.1.0|g" \
"s|protobuf==3.20.3|protobuf>=3.19.0|g" \
"s|tensorflow-datasets|tensorflow-datasets>=4.8.0|g" \
; do
sed -i "${pattern}" ${SRC_PATH_MAXTEXT}/requirements.txt;
done
echo "tensorflow-metadata>=1.15.0" >> ${SRC_PATH_MAXTEXT}/requirements.txt
EOF

###############################################################################
Expand All @@ -73,3 +85,6 @@ FROM mealkit as final
RUN pip-finalize.sh

WORKDIR ${SRC_PATH_MAXTEXT}

# When tftext and lingvo wheels are published on pypi.org, revert this
# Dockerfile to 5c4b687b918e6569bca43758c346ad8e67460154
34 changes: 0 additions & 34 deletions .github/container/Dockerfile.maxtext.amd64

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
ARG BASE_IMAGE=ghcr.io/nvidia/jax-mealkit:jax
ARG URLREF_PAXML=https://github.com/google/paxml.git#main
ARG URLREF_PRAXIS=https://github.com/google/praxis.git#main
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#v2.13.0
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master
ARG URLREF_LINGVO=https://github.com/tensorflow/lingvo.git#master
ARG SRC_PATH_PAXML=/opt/paxml
ARG SRC_PATH_PRAXIS=/opt/praxis
Expand All @@ -21,18 +21,19 @@ FROM ${BASE_IMAGE} as wheel-builder
# build tensorflow-text from source
#------------------------------------------------------------------------------

# Remove TFTEXT build from source when it has py-3.12 wheels for x86/arm64
FROM wheel-builder as tftext-builder
ARG URLREF_TFTEXT
ARG SRC_PATH_TFTEXT
RUN <<"EOF" bash -exu -o pipefail
pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.13.0
pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.18.0
git-clone.sh ${URLREF_TFTEXT} ${SRC_PATH_TFTEXT}
cd ${SRC_PATH_TFTEXT}

# The tftext build script queries GitHub, but these requests are sometimes
# throttled by GH, resulting in a corrupted uri for tensorflow in WORKSPACE.
# A workaround (needs to be updated when the tensorflow version changes):
sed -i "s/# Update TF dependency to installed tensorflow/commit_sha=1cb1a030a62b169d90d34c747ab9b09f332bf905/" oss_scripts/prepare_tf_dep.sh
sed -i "s/# Update TF dependency to installed tensorflow./commit_slug=6550e4bd80223cdb8be6c3afd1f81e86a4d433c3/" oss_scripts/prepare_tf_dep.sh

# Newer versions of LLVM make lld's --undefined-version check of lld is strict
# by default (https://reviews.llvm.org/D135402), but the tftext build seems to
Expand All @@ -46,6 +47,7 @@ EOF
# build lingvo
#------------------------------------------------------------------------------

# Remove Lingvo build from source when it has py-3.12 wheels for x86/arm64
FROM wheel-builder as lingvo-builder
ARG URLREF_LINGVO
ARG SRC_PATH_TFTEXT
Expand All @@ -55,15 +57,16 @@ ARG SRC_PATH_LINGVO
COPY --from=tftext-builder /opt/manifest.d/git-clone.yaml /opt/manifest.d/git-clone.yaml
COPY --from=tftext-builder ${SRC_PATH_TFTEXT}/tensorflow_text*.whl /opt/

RUN <<"EOF" bash -exu -o pipefail
git-clone.sh ${URLREF_LINGVO} ${SRC_PATH_LINGVO}
EOF

ENV USE_BAZEL_VERSION=7.1.2

# build lingvo
RUN <<"EOF" bash -exu -o pipefail
git-clone.sh ${URLREF_LINGVO} ${SRC_PATH_LINGVO}
pushd ${SRC_PATH_LINGVO}

CPU_ARCH="$(dpkg --print-architecture)"
if [[ "${CPU_ARCH}" == "arm64" ]]; then

# Use aarch distribution of protobufs
patch -p1 <<"EOFINNER"
diff --git a/lingvo/repo.bzl b/lingvo/repo.bzl
Expand All @@ -84,13 +87,34 @@ index ce65822d2..d9c0277aa 100644
def icu():
EOFINNER

pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.13.0 /opt/tensorflow_text*.whl
sed -i 's/tensorflow=/#tensorflow=/' docker/dev.requirements.txt
sed -i 's/tensorflow-text=/#tensorflow-text=/' docker/dev.requirements.txt
sed -i 's/dataclasses=/#dataclasses=/' docker/dev.requirements.txt
fi

pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.18.0 /opt/tensorflow_text*.whl
for pattern in \
"s|tensorflow=|#tensorflow=|g" \
"s|tensorflow-text=|#tensorflow-text=|g" \
"s|dataclasses=|#dataclasses=|g" \
"s|==.*||g" \
; do
sed -i "${pattern}" ${SRC_PATH_LINGVO}/docker/dev.requirements.txt
done
# Lingvo support only python < 3.12, so we hack it and update dependencies
# to be able to build for py-3.12
for pattern in \
"s|tensorflow-text~=2.13.0|tensorflow-text~=2.18.0|g" \
"s|tensorflow~=2.13.0|tensorflow~=2.18.0|g" \
"s|python_requires='>=3.8,<3.11'|python_requires='>=3.8,<3.13'|" \
; do
sed -i "${pattern}" ${SRC_PATH_LINGVO}/pip_package/setup.py;
done
pip install -r docker/dev.requirements.txt

# Some tests are flaky right now, so we skip running the tests.
BUILD_ARCH="x86_64"
if [[ "$CPU_ARCH" == "arm64" ]]; then
BUILD_ARCH="aarch64";
fi
sed -i 's/manylinux2014_x86_64/manylinux_2_38_'"${BUILD_ARCH}"'/' pip_package/build.sh
SKIP_TESTS=1 PYTHON_MINOR_VERSION=$(python --version | cut -d ' ' -f 2 | cut -d '.' -f 2) pip_package/build.sh
EOF

Expand All @@ -108,15 +132,14 @@ ARG SRC_PATH_TFTEXT

# Preserve version information of tensorflow-text and lingvo
COPY --from=lingvo-builder /opt/manifest.d/git-clone.yaml /opt/manifest.d/git-clone.yaml
COPY --from=lingvo-builder /tmp/lingvo/dist/lingvo*linux_aarch64.whl /opt/
COPY --from=lingvo-builder /tmp/lingvo/dist/lingvo*-linux*.whl /opt/
RUN echo "lingvo @ file://$(ls /opt/lingvo*.whl)" >> /opt/pip-tools.d/requirements-paxml.in

COPY --from=tftext-builder ${SRC_PATH_TFTEXT}/tensorflow_text*.whl /opt/
RUN echo "tensorflow-text @ file://$(ls /opt/tensorflow_text*.whl)" >> /opt/pip-tools.d/requirements-paxml.in

# paxml + praxis
RUN <<"EOF" bash -ex
echo "tensorflow==2.13.0" >> /opt/pip-tools.d/requirements-paxml.in
echo "tensorflow_datasets==4.9.2" >> /opt/pip-tools.d/requirements-paxml.in
echo "auditwheel" >> /opt/pip-tools.d/requirements-paxml.in

Expand All @@ -131,11 +154,14 @@ for src in ${SRC_PATH_PAXML} ${SRC_PATH_PRAXIS}; do
for pattern in \
"s| @ git+https://github.com/google/flax||g" \
"s| @ git+https://github.com/google/jax||g" \
"s| @ git+https://github.com/google/fiddle||g" \
"s|^tensorflow|#tensorflow|" \
"s|^lingvo|#lingvo|" \
"s|^scikit-learn|#scikit-learn|" \
"s|^protobuf|#protobuf|" \
"s|^numpy|#numpy|" \
"s|^orbax-checkpoint|#orbax-checkpoint|" \
"s| @ git+https://github.com/google/CommonLoopUtils||g" \
; do
sed -i "${pattern}" */pip_package/requirements.txt requirements.in
done
Expand All @@ -148,6 +174,7 @@ for src in ${SRC_PATH_PAXML} ${SRC_PATH_PRAXIS}; do
fi
popd
done
sed -i 's/pysimdjson==[0-9.]*/pysimdjson/' ${SRC_PATH_PAXML}/setup.py
EOF

ADD test-pax.sh /usr/local/bin
Expand All @@ -159,3 +186,6 @@ ADD test-pax.sh /usr/local/bin
FROM mealkit as final

RUN pip-finalize.sh

# When tftext and lingvo wheels are published on pypi.org, revert this
# Dockerfile to 5c4b687b918e6569bca43758c346ad8e67460154
53 changes: 0 additions & 53 deletions .github/container/Dockerfile.pax.amd64

This file was deleted.

Loading

0 comments on commit 2a74610

Please sign in to comment.