Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI/Build] Dockerfile build for ARM64 / GH200 #10499

Closed
wants to merge 4 commits into from

Conversation

drikster80
Copy link
Contributor

@drikster80 drikster80 commented Nov 20, 2024

Updates the Dockerfile with $TARGETPLATFORM conditionals that will compile the necessary modules and extensions for aarch64 / ARM64 systems. This has been tested on the Nvidia GH200 platform.

Docker builds should use --platform "linux/arm64" to trigger the arm64 build process.

FIX #2021

Changes Overview:

  • Added an additional requirements-cuda-arm64.txt that uses the pytorch nightly modules that are compatible with ARM64+CUDA. This is temporary until they are moved to stable release (at which time this file can be removed).
  • Updates the current requirements files to not use torch/torchvision if platform_machine != 'aarch64'.
  • Uses conditionals to determine whether to build specific modules that are not currently shipped with aarch64 wheels.
  • Updated docs with notes and example command.

The following command was used to build and confirmed working on Nvidia GH200:

# Build time: ~40 min
# Max memory usage: 180GB
sudo docker build . --target vllm-openai --platform "linux/arm64" -t drikster80/vllm-gh200-openai:v0.6.4.post1 --build-arg max_jobs=66 --build-arg nvcc_threads=2 --build-arg torch_cuda_arch_list="9.0+PTX" --build-arg vllm_fa_cmake_gpu_arches="90-real" --build-arg RUN_WHEEL_CHECK='false'

NOTE: The order of the installing the requirements-cuda-arm64.txt is important since it needs to stomp over the currently installed torch version that are dependencies to other modules.

    if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
        pip uninstall -y torch && \
        python3 -m pip install -r requirements-cuda-arm64.txt; \
    fi

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 20, 2024
@simon-mo simon-mo self-assigned this Nov 20, 2024
@drikster80
Copy link
Contributor Author

Missed a sign-off on 1 commit, so rebased and force-pushed to pass the DCO check.

@drikster80
Copy link
Contributor Author

Noticed a bug where flashinfer x86_64 wheel was not installing by default. Since this was the default behavior on non-arm64 systems before, updated the conditional to always apply unless the target platform is specified as 'linux/arm64'.

if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
apt-get update && apt-get install zlib1g-dev && \
python3 -m pip install packaging pybind11 && \
git clone https://github.com/openai/triton && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we directly use pytorch nightly as base image so that we don't need to build triton, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Triton doesn't provide aarch64 whl files, so we'll always need to compile it if we want to use the latest version: https://pypi.org/project/triton/#files

It probably is a good idea to pin to the latest release tag of triton, instead of the main though. I'll update that.

My goal on this was to keep it as close as possible to the x86_64 implementation of VLLM, so I didn't want to use the nvidia pytorch container. That's what I was doing in the previous repo. Although it worked, it doubled the size of the final image (9.74GB vs 4.89GB).

Dockerfile Outdated
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=.git,target=.git \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
pip --verbose wheel --use-pep517 --no-deps -w /workspace/dist --no-build-isolation git+https://github.com/vllm-project/flash-attention.git ; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the vllm build already includes vllm-flash-attention

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe torch version should be unpinned from the source in CMakeList.txx, setup.py and pyproject.toml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the vllm build already includes vllm-flash-attention

Ah, good point. I'll remove that and test.

@youkaichao
Copy link
Member

@drikster80 overall it makes sense to me, but we don't need to build so many things in the docker. Just use the default should be fine, it already comes with flash-attention backend.

we don't need to build flashinfer / bitsandbytes / triton .

@@ -0,0 +1,3 @@
--index-url https://download.pytorch.org/whl/nightly/cu124
torchvision; platform_machine == 'aarch64'
torch; platform_machine == 'aarch64'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add xformers for aarch64 to the /vllm-project directory similar to flash-attention for the aarch64 build until the upstream pip package is available

Dockerfile Outdated
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=.git,target=.git \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
pip --verbose wheel --use-pep517 --no-deps -w /workspace/dist --no-build-isolation git+https://github.com/vllm-project/flash-attention.git ; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe torch version should be unpinned from the source in CMakeList.txx, setup.py and pyproject.toml

@drikster80
Copy link
Contributor Author

@drikster80 overall it makes sense to me, but we don't need to build so many things in the docker. Just use the default should be fine, it already comes with flash-attention backend.

we don't need to build flashinfer / bitsandbytes / triton .

None of these have aarch64 whl. When you say "use the default", are these all built into vllm as well? When I attempt to run the container without building these, it fails.

@youkaichao
Copy link
Member

the goal here is to have a runnable image for vllm on arm64 / GH200 . we don't need to have full features here. since the community is not fully ready for arm64, it would be a maintenance disaster if we build so many things here by ourselves. if a library does not support arm64, people should reach out to that library and let that library be compatible with arm64.

that's why I want to use pytorch nightly docker directly. docker image size is not my concern.

My goal on this was to keep it as close as possible to the x86_64 implementation of VLLM

this is not my goal. the first step is we can run vllm serve meta-llama/Llama-3.1-8B on GH200 , that's good enough.

@drikster80
Copy link
Contributor Author

the goal here is to have a runnable image for vllm on arm64 / GH200 . we don't need to have full features here. since the community is not fully ready for arm64, it would be a maintenance disaster if we build so many things here by ourselves. if a library does not support arm64, people should reach out to that library and let that library be compatible with arm64.

that's why I want to use pytorch nightly docker directly. docker image size is not my concern.

My goal on this was to keep it as close as possible to the x86_64 implementation of VLLM

this is not my goal. the first step is we can run vllm serve meta-llama/Llama-3.1-8B on GH200 , that's good enough.

Okay, it sounds like our goals just weren't aligned. I agree it could become a maintainability issue this way. FWIW, it looks like the other libraries do support ARM64, but don't provide a whl for them on pypi (probably due to GitHub Actions limitations). I'll create tickets on the other repos requesting the aarch64 whl be build/provided.

I had originally moved away from using the nvidia-pytorch container because they were slower at updating torch than VLLM was. It looks like they just came out with a version compatible with torch v2.6, so I can try to use that version.

In the meantime, I'll continue maintaining the fork and hosting a full-featured version under my docker-hub that matches the releases of VLLM.

@youkaichao
Copy link
Member

I had originally moved away from using the nvidia-pytorch container because they were slower at updating torch than VLLM was.

We don't need nvidia-pytorch container. A basic nvidia container is good enough, and we can just install nightly pytorch wheels.

In the meantime, I'll continue maintaining the fork and hosting a full-featured version under my docker-hub that matches the releases of VLLM.

thanks for your efforts! for this PR, let's get the basic support first 👍

cennn added a commit to cennn/vllm that referenced this pull request Dec 14, 2024
[CI/Build] Dockerfile build for ARM64 / GH200 vllm-project#10499 by cenzhiyao
@youkaichao
Copy link
Member

close as #11212 has been merged. @drikster80 thanks for your efforts! please continue to keep your branch with full-fledged feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ARM aarch-64 server build failed (host OS: Ubuntu22.04.3)
4 participants