Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump based CUDA image to ubuntu24.04 #1166

Merged
merged 23 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
d602ff3
Test docker hub ubuntu24.04
DwarKapex Nov 21, 2024
7a93390
Adobt build for ubuntu-24.04
DwarKapex Nov 22, 2024
3f4efa5
Fix build for pax, t5x, gemma
DwarKapex Nov 22, 2024
b2eab65
Use master branch of TF-text
DwarKapex Nov 22, 2024
71ad68b
Fix gemma TF-text urls
DwarKapex Nov 22, 2024
0b452c4
Fix T5x build
DwarKapex Nov 25, 2024
62e7ed7
Address comments
DwarKapex Nov 26, 2024
beb4f82
Fix gemma build
DwarKapex Nov 27, 2024
3c2ec97
Clone airio
DwarKapex Nov 27, 2024
d279373
Merge remote-tracking branch 'origin/main' into vkozlov/move-to-ubunt…
DwarKapex Nov 27, 2024
173ddc5
Update maxtext docker
DwarKapex Nov 27, 2024
92996e3
Uninstall several packages and add PIP_BREAK_SYSTEM_PACKAGES=1 env var
DwarKapex Dec 2, 2024
8993deb
Uninstall several packages and add PIP_BREAK_SYSTEM_PACKAGES=1 env var
DwarKapex Dec 2, 2024
8c10287
Edit remove packages list
DwarKapex Dec 2, 2024
c75c825
Edit remove packages list
DwarKapex Dec 3, 2024
8468c9f
Edit remove packages list
DwarKapex Dec 3, 2024
008b3fc
[skip ci] Resurect amd64/arm64 dockerfiles
DwarKapex Dec 3, 2024
d633578
[skip ci] Resurect amd64/arm64 dockerfiles: fix whitespace error
DwarKapex Dec 3, 2024
81b50cc
[skip ci] Resurect amd64/arm64 dockerfiles: fix whitespace error
DwarKapex Dec 3, 2024
14c52be
Merge branch 'main' into vkozlov/move-to-ubuntu24.04
DwarKapex Dec 3, 2024
96c16a9
Add comment for pip install pip-23.3.1
DwarKapex Dec 3, 2024
8461c7a
Merge branch 'vkozlov/move-to-ubuntu24.04' of github.com:NVIDIA/JAX-T…
DwarKapex Dec 3, 2024
2c1ee0d
remove arch-specific Dockerfiles and add pointer to utopian versions
yhtang Dec 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 23 additions & 5 deletions .github/container/Dockerfile.base
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# syntax=docker/dockerfile:1-labs
ARG BASE_IMAGE=nvidia/cuda:12.6.2-devel-ubuntu22.04
ARG BASE_IMAGE=nvidia/cuda:12.6.2-devel-ubuntu24.04
ARG GIT_USER_NAME="JAX Toolbox"
ARG [email protected]
ARG CLANG_VERSION=18
Expand Down Expand Up @@ -60,7 +60,8 @@ apt_packages=(
wget
jq
# llvm.sh
lsb-release software-properties-common
lsb-release
software-properties-common
# GCP autoconfig
pciutils hwloc bind9-host
)
Expand All @@ -74,8 +75,6 @@ apt-get install -y ${apt_packages[@]}

# Install LLVM/Clang
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" -- ${CLANG_VERSION}
apt-get remove -y software-properties-common lsb-release
apt-get autoremove -y # removes python3-blinker which conflicts with pip-compile in JAX

# Make sure that clang and clang++ point to the new version. This list is based
# on the symlinks installed by the `clang` (as opposed to `clang-14`) and `lld`
Expand Down Expand Up @@ -106,6 +105,21 @@ EOL

apt-get clean
rm -rf /var/lib/apt/lists/*

# There are several python packages (in the list below) that are installed with OS
# package manager (the run of `apt-get install` above) and can not be uninstall
# using pip (in pip-finalize.sh script) during JAX installation. Remove then in
# advance to avoid JAX installation issue.
remove_packages=(
python3-gi
software-properties-common
lsb-release
python3-yaml
python3-pygments
)

apt-get remove -y ${remove_packages[@]}
apt-get autoremove -y # removes python3-blinker which conflicts with pip-compile in JAX
EOF

RUN <<"EOF" bash -ex
Expand All @@ -129,7 +143,11 @@ git apply </opt/pip/pip-vcs-equivalency.patch
git add -u
git commit -m 'Adds JAX_TOOLBOX_VCS_EQUIVALENCY as a trigger to treat all github VCS installs for a package as equivalent. The spec of the last encountered version will be used'
EOF
RUN pip install --upgrade --no-cache-dir -e /opt/pip pip-tools && rm -rf ~/.cache/*

# install all python packages system-wide.
ENV PIP_BREAK_SYSTEM_PACKAGES=1
RUN pip install --upgrade --ignore-installed --no-cache-dir -e /opt/pip pip-tools && rm -rf ~/.cache/*


###############################################################################
## Install TCPx
Expand Down
1 change: 0 additions & 1 deletion .github/container/Dockerfile.jax
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@ RUN --mount=type=ssh \
--mount=type=secret,id=SSH_KNOWN_HOSTS,target=/root/.ssh/known_hosts \
<<"EOF" bash -ex
git-clone.sh ${URLREF_JAX} ${SRC_PATH_JAX}
sed 's/^numpy.*/numpy<2.0.0/' ${SRC_PATH_JAX}/build/requirements.in
git-clone.sh ${URLREF_XLA} ${SRC_PATH_XLA}
EOF

Expand Down
87 changes: 87 additions & 0 deletions .github/container/Dockerfile.maxtext
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# syntax=docker/dockerfile:1-labs

ARG BASE_IMAGE=ghcr.io/nvidia/jax-mealkit:jax
ARG URLREF_MAXTEXT=https://github.com/google/maxtext.git#main
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master
ARG SRC_PATH_MAXTEXT=/opt/maxtext
ARG SRC_PATH_TFTEXT=/opt/tensorflow-text

###############################################################################
## build tensorflow-text and lingvo, which do not have working arm64 pip wheels
###############################################################################

ARG BASE_IMAGE
FROM ${BASE_IMAGE} as wheel-builder

#------------------------------------------------------------------------------
# build tensorflow-text from source
#------------------------------------------------------------------------------

# Remove TFTEXT build from source when it has py-3.12 wheels for x86/arm64
FROM wheel-builder as tftext-builder
ARG URLREF_TFTEXT
ARG SRC_PATH_TFTEXT

RUN pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.18.0
RUN git-clone.sh ${URLREF_TFTEXT} ${SRC_PATH_TFTEXT}
RUN <<"EOF" bash -exu -o pipefail
cd ${SRC_PATH_TFTEXT}

# The tftext build script queries GitHub, but these requests are sometimes
# throttled by GH, resulting in a corrupted uri for tensorflow in WORKSPACE.
# A workaround (needs to be updated when the tensorflow version changes):
sed -i "s/# Update TF dependency to installed tensorflow./commit_slug=6550e4bd80223cdb8be6c3afd1f81e86a4d433c3/" oss_scripts/prepare_tf_dep.sh

# Newer versions of LLVM make lld's --undefined-version check of lld is strict
# by default (https://reviews.llvm.org/D135402), but the tftext build seems to
# rely on this behavior.
echo "write_to_bazelrc \"build --linkopt='-Wl,--undefined-version'\"" >> oss_scripts/configure.sh

./oss_scripts/run_build.sh
EOF

###############################################################################
## Download source and add auxiliary scripts
###############################################################################

FROM ${BASE_IMAGE} as mealkit
ARG URLREF_MAXTEXT
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master
ARG SRC_PATH_MAXTEXT
ARG SRC_PATH_TFTEXT=/opt/tensorflow-text

# Preserve version information of tensorflow-text
COPY --from=tftext-builder ${SRC_PATH_TFTEXT}/tensorflow_text*.whl /opt/
RUN echo "tensorflow-text @ file://$(ls /opt/tensorflow_text*.whl)" >> /opt/pip-tools.d/requirements-maxtext.in

RUN <<"EOF" bash -ex
git-clone.sh ${URLREF_MAXTEXT} ${SRC_PATH_MAXTEXT}
echo "-r ${SRC_PATH_MAXTEXT}/requirements.txt" >> /opt/pip-tools.d/requirements-maxtext.in

# specify some restrictions to speed up the build and
# avoid pip to download and check all available versions of packages
for pattern in \
"s|absl-py|absl-py>=2.1.0|g" \
"s|protobuf==3.20.3|protobuf>=3.19.0|g" \
"s|tensorflow-datasets|tensorflow-datasets>=4.8.0|g" \
; do
sed -i "${pattern}" ${SRC_PATH_MAXTEXT}/requirements.txt;
done
echo "tensorflow-metadata>=1.15.0" >> ${SRC_PATH_MAXTEXT}/requirements.txt
EOF

###############################################################################
## Add test script to the path
###############################################################################

ADD test-maxtext.sh /usr/local/bin

###############################################################################
## Install accumulated packages from the base image and the previous stage
###############################################################################

FROM mealkit as final

RUN pip-finalize.sh

WORKDIR ${SRC_PATH_MAXTEXT}
2 changes: 1 addition & 1 deletion .github/container/Dockerfile.maxtext.amd64
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,4 @@ FROM mealkit as final

RUN pip-finalize.sh

WORKDIR ${SRC_PATH_MAXTEXT}
WORKDIR ${SRC_PATH_MAXTEXT}
188 changes: 188 additions & 0 deletions .github/container/Dockerfile.pax
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# syntax=docker/dockerfile:1-labs

ARG BASE_IMAGE=ghcr.io/nvidia/jax-mealkit:jax
ARG URLREF_PAXML=https://github.com/google/paxml.git#main
ARG URLREF_PRAXIS=https://github.com/google/praxis.git#main
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master
ARG URLREF_LINGVO=https://github.com/tensorflow/lingvo.git#master
ARG SRC_PATH_PAXML=/opt/paxml
ARG SRC_PATH_PRAXIS=/opt/praxis
ARG SRC_PATH_TFTEXT=/opt/tensorflow-text
ARG SRC_PATH_LINGVO=/opt/lingvo

###############################################################################
## build tensorflow-text and lingvo, which do not have working arm64 pip wheels
###############################################################################

ARG BASE_IMAGE
FROM ${BASE_IMAGE} as wheel-builder

#------------------------------------------------------------------------------
# build tensorflow-text from source
#------------------------------------------------------------------------------

# Remove TFTEXT build from source when it has py-3.12 wheels for x86/arm64
FROM wheel-builder as tftext-builder
ARG URLREF_TFTEXT
ARG SRC_PATH_TFTEXT
RUN <<"EOF" bash -exu -o pipefail
pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.18.0
git-clone.sh ${URLREF_TFTEXT} ${SRC_PATH_TFTEXT}
cd ${SRC_PATH_TFTEXT}

# The tftext build script queries GitHub, but these requests are sometimes
# throttled by GH, resulting in a corrupted uri for tensorflow in WORKSPACE.
# A workaround (needs to be updated when the tensorflow version changes):
sed -i "s/# Update TF dependency to installed tensorflow./commit_slug=6550e4bd80223cdb8be6c3afd1f81e86a4d433c3/" oss_scripts/prepare_tf_dep.sh

# Newer versions of LLVM make lld's --undefined-version check of lld is strict
# by default (https://reviews.llvm.org/D135402), but the tftext build seems to
# rely on this behavior.
echo "write_to_bazelrc \"build --linkopt='-Wl,--undefined-version'\"" >> oss_scripts/configure.sh

./oss_scripts/run_build.sh
EOF

#------------------------------------------------------------------------------
# build lingvo
#------------------------------------------------------------------------------

# Remove Lingvo build from source when it has py-3.12 wheels for x86/arm64
FROM wheel-builder as lingvo-builder
ARG URLREF_LINGVO
ARG SRC_PATH_TFTEXT
ARG SRC_PATH_LINGVO

# Preserve the version of tensorflow-text
COPY --from=tftext-builder /opt/manifest.d/git-clone.yaml /opt/manifest.d/git-clone.yaml
COPY --from=tftext-builder ${SRC_PATH_TFTEXT}/tensorflow_text*.whl /opt/

ENV USE_BAZEL_VERSION=7.1.2

# build lingvo
RUN <<"EOF" bash -exu -o pipefail
git-clone.sh ${URLREF_LINGVO} ${SRC_PATH_LINGVO}
pushd ${SRC_PATH_LINGVO}

CPU_ARCH="$(dpkg --print-architecture)"
if [[ "${CPU_ARCH}" == "arm64" ]]; then

# Use aarch distribution of protobufs
patch -p1 <<"EOFINNER"
diff --git a/lingvo/repo.bzl b/lingvo/repo.bzl
index ce65822d2..d9c0277aa 100644
--- a/lingvo/repo.bzl
+++ b/lingvo/repo.bzl
@@ -232,9 +232,9 @@ filegroup(
)
""",
urls = [
- "https://github.com/protocolbuffers/protobuf/releases/download/v21.9/protoc-21.9-linux-x86_64.zip",
+ "https://github.com/protocolbuffers/protobuf/releases/download/v21.9/protoc-21.9-linux-aarch_64.zip",
],
- sha256 = "3cd951aff8ce713b94cde55e12378f505f2b89d47bf080508cf77e3934f680b6",
+ sha256 = "a584286dfa8ebb17032ece206ed74d5e9931e2edb9016e427be2a0dab3b21071",
)

def icu():
EOFINNER

fi

pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.18.0 /opt/tensorflow_text*.whl
for pattern in \
"s|tensorflow=|#tensorflow=|g" \
"s|tensorflow-text=|#tensorflow-text=|g" \
"s|dataclasses=|#dataclasses=|g" \
"s|==.*||g" \
; do
sed -i "${pattern}" ${SRC_PATH_LINGVO}/docker/dev.requirements.txt
done
# Lingvo support only python < 3.12, so we hack it and update dependencies
# to be able to build for py-3.12
for pattern in \
"s|tensorflow-text~=2.13.0|tensorflow-text~=2.18.0|g" \
"s|tensorflow~=2.13.0|tensorflow~=2.18.0|g" \
"s|python_requires='>=3.8,<3.11'|python_requires='>=3.8,<3.13'|" \
; do
sed -i "${pattern}" ${SRC_PATH_LINGVO}/pip_package/setup.py;
done
pip install -r docker/dev.requirements.txt

# Some tests are flaky right now, so we skip running the tests.
BUILD_ARCH="x86_64"
if [[ "$CPU_ARCH" == "arm64" ]]; then
BUILD_ARCH="aarch64";
fi
sed -i 's/manylinux2014_x86_64/manylinux_2_38_'"${BUILD_ARCH}"'/' pip_package/build.sh
SKIP_TESTS=1 PYTHON_MINOR_VERSION=$(python --version | cut -d ' ' -f 2 | cut -d '.' -f 2) pip_package/build.sh
EOF

###############################################################################
## Pax for AArch64
###############################################################################

ARG BASE_IMAGE
FROM ${BASE_IMAGE} as mealkit
ARG URLREF_PAXML
ARG URLREF_PRAXIS
ARG SRC_PATH_PAXML
ARG SRC_PATH_PRAXIS
ARG SRC_PATH_TFTEXT

# Preserve version information of tensorflow-text and lingvo
COPY --from=lingvo-builder /opt/manifest.d/git-clone.yaml /opt/manifest.d/git-clone.yaml
COPY --from=lingvo-builder /tmp/lingvo/dist/lingvo*-linux*.whl /opt/
RUN echo "lingvo @ file://$(ls /opt/lingvo*.whl)" >> /opt/pip-tools.d/requirements-paxml.in

COPY --from=tftext-builder ${SRC_PATH_TFTEXT}/tensorflow_text*.whl /opt/
RUN echo "tensorflow-text @ file://$(ls /opt/tensorflow_text*.whl)" >> /opt/pip-tools.d/requirements-paxml.in

# paxml + praxis
RUN <<"EOF" bash -ex
echo "tensorflow_datasets==4.9.2" >> /opt/pip-tools.d/requirements-paxml.in
echo "auditwheel" >> /opt/pip-tools.d/requirements-paxml.in

git-clone.sh ${URLREF_PAXML} ${SRC_PATH_PAXML}
git-clone.sh ${URLREF_PRAXIS} ${SRC_PATH_PRAXIS}
echo "-e file://${SRC_PATH_PAXML}[gpu]" >> /opt/pip-tools.d/requirements-paxml.in
echo "-e file://${SRC_PATH_PRAXIS}" >> /opt/pip-tools.d/requirements-paxml.in

for src in ${SRC_PATH_PAXML} ${SRC_PATH_PRAXIS}; do
pushd ${src}

for pattern in \
"s| @ git+https://github.com/google/flax||g" \
"s| @ git+https://github.com/google/jax||g" \
"s| @ git+https://github.com/google/fiddle||g" \
"s|^tensorflow|#tensorflow|" \
"s|^lingvo|#lingvo|" \
"s|^scikit-learn|#scikit-learn|" \
"s|^protobuf|#protobuf|" \
"s|^numpy|#numpy|" \
"s|^orbax-checkpoint|#orbax-checkpoint|" \
"s| @ git+https://github.com/google/CommonLoopUtils||g" \
; do
sed -i "${pattern}" */pip_package/requirements.txt requirements.in
done

if git diff --quiet; then
echo "broken dependencies no longer present in ${src}"
exit 1
else
git commit -a -m "remove broken dependencies from ${src}"
fi
popd
done
sed -i 's/pysimdjson==[0-9.]*/pysimdjson/' ${SRC_PATH_PAXML}/setup.py
EOF

ADD test-pax.sh /usr/local/bin

###############################################################################
## Install accumulated packages from the base image and the previous stage
###############################################################################

FROM mealkit as final

RUN pip-finalize.sh
Loading
Loading