Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling CI for AMD with new runner.. #2034

Closed
wants to merge 50 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
512ed5c
Enabling CI for AMD with new runner..
Narsil Jun 6, 2024
81704d2
Putting the fix for vllm for CIt
Narsil Jun 6, 2024
fa05db2
Fix integration-tests config for docker runt .
Narsil Jun 6, 2024
9376648
Checkout.
Narsil Jun 7, 2024
724fa6f
AMD CI.
Narsil Jun 7, 2024
97af55b
Inject slugs
Narsil Jun 7, 2024
c8128c7
Let's iterate a bit faster.
Narsil Jun 7, 2024
c73355b
Merge branch 'main' into ci_amd2
Narsil Jun 7, 2024
9101b2a
Fix.
Narsil Jun 7, 2024
3684439
Trying new split of tasks.
Narsil Jun 7, 2024
3ee92eb
?
Narsil Jun 7, 2024
f29371e
Naming.
Narsil Jun 7, 2024
3a8e9c2
Rename for everyone.
Narsil Jun 7, 2024
11c75f3
I hate this.
Narsil Jun 7, 2024
54e3340
gh..
Narsil Jun 7, 2024
6f31175
Give us sanitation tools already.
Narsil Jun 7, 2024
8712a36
Flying blind feels nice.
Narsil Jun 7, 2024
a759e2e
Not hitting myself against the wall.
Narsil Jun 7, 2024
e6a4dbe
I'm an certainly not a monkey.
Narsil Jun 7, 2024
aea77a8
Banana.
Narsil Jun 7, 2024
81ddb9d
Please let me out !
Narsil Jun 7, 2024
043de74
**Feigns death**
Narsil Jun 7, 2024
8205962
Ahah, I see an exit.
Narsil Jun 7, 2024
078fb55
Abbé Faria?
Narsil Jun 7, 2024
1e759f9
Wat?
Narsil Jun 7, 2024
cc7c2fd
runs on.
Narsil Jun 7, 2024
1f42489
Come on GH, dash, underscore, who cares at this point.
Narsil Jun 7, 2024
b10ba92
...
Narsil Jun 7, 2024
2a314fa
Bash in bash.
Narsil Jun 7, 2024
19f6327
esac. Great idea dev of the past.
Narsil Jun 7, 2024
87df3d5
?
Narsil Jun 7, 2024
5e769ce
?
Narsil Jun 7, 2024
a045ead
.
Narsil Jun 7, 2024
c6fa954
Test.
Narsil Jun 7, 2024
e79c83d
Attempt #727.
Narsil Jun 7, 2024
eda299b
.
Narsil Jun 7, 2024
65b2efc
.
Narsil Jun 7, 2024
fc4404d
.
Narsil Jun 7, 2024
741ab87
fromJSON
Narsil Jun 7, 2024
66e5983
.
Narsil Jun 7, 2024
98d3830
Extra spaces?
Narsil Jun 7, 2024
fa3e811
No fromJSON.
Narsil Jun 7, 2024
909e656
.
Narsil Jun 8, 2024
d9f704a
Are we done ?
Narsil Jun 8, 2024
8be9c19
Is this it ?
Narsil Jun 8, 2024
e62c51d
Here we go again.
Narsil Jun 8, 2024
452d442
We need tailscale.
Narsil Jun 8, 2024
0ced5fa
Fix.
Narsil Jun 8, 2024
eec6c32
.
Narsil Jun 8, 2024
41699e9
.
Narsil Jun 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
150 changes: 95 additions & 55 deletions .github/workflows/build.yaml
Original file line number Diff line number Diff line change
@@ -1,46 +1,29 @@
name: Build and push docker image to internal registry

on:
workflow_dispatch:
push:
branches:
- 'main'
tags:
- 'v*'
pull_request:
paths:
- ".github/workflows/build.yaml"
- "integration-tests/**"
- "server/**"
- "proto/**"
- "router/**"
- "launcher/**"
- "Cargo.lock"
- "rust-toolchain.toml"
- "Dockerfile"
- "Dockerfile_amd"
- "Dockerfile_intel"
branches:
- 'main'
workflow_call:
inputs:
hardware:
type: string
description: Hardware
# options:
# - cuda
# - rocm
# - intel
required: true

jobs:
build-and-push-image:
build-and-push:
outputs:
docker_image: ${{ steps.final.outputs.docker_image }}
docker_devices: ${{ steps.final.outputs.docker_devices }}
runs_on: ${{ steps.final.outputs.runs_on }}
label: ${{ steps.final.outputs.label }}
concurrency:
group: ${{ github.workflow }}-build-and-push-image-${{ matrix.name }}-${{ github.head_ref || github.run_id }}
group: ${{ github.workflow }}-build-and-push-image-${{ inputs.hardware }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
# TODO see with @Glegendre to get CPU runner here instead
runs-on: [self-hosted, nvidia-gpu , multi-gpu, 4-a10, ci]
strategy:
matrix:
include:
- name: "cuda"
label: ""
dockerfile: "Dockerfile"
- name: "amd"
label: "-rocm"
dockerfile: "Dockerfile_amd"
- name: "intel"
label: "-intel"
dockerfile: "Dockerfile_intel"
permissions:
contents: write
packages: write
Expand All @@ -50,33 +33,67 @@ jobs:
security-events: write
steps:
- name: Checkout repository
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Initialize Docker Buildx
uses: docker/setup-buildx-action@v2.0.0
uses: docker/setup-buildx-action@v3
with:
install: true
- name: Inject slug/short variables
uses: rlespinasse/[email protected]
- name: Construct harware variables
shell: bash
run: |
case ${{ inputs.hardware }} in
cuda)
export dockerfile="Dockerfile"
export label_extension=""
export docker_devices=""
export runs_on="nvidia-gpu"
;;
rocm)
export dockerfile="Dockerfile_amd"
export label_extension="-rocm"
export docker_devices="/dev/kfd,/dev/dri"
# TODO Re-enable when they pass.
# export runs_on="amd-gpu-tgi"
export runs_on="ubuntu-latest"
;;
intel)
export dockerfile="Dockerfile_intel"
export label_extension="-intel"
export docker_devices=""
export runs_on="ubuntu-latest"
;;
esac
echo $dockerfile
echo "Dockerfile=${dockerfile}"
echo $label_extension
echo $docker_devices
echo $runs_on
echo "DOCKERFILE=${dockerfile}" >> $GITHUB_ENV
echo "LABEL=${label_extension}" >> $GITHUB_ENV
echo "DOCKER_DEVICES=${docker_devices}" >> $GITHUB_ENV
echo "RUNS_ON=${runs_on}" >> $GITHUB_ENV
- name: Tailscale
uses: huggingface/tailscale-action@main
with:
authkey: ${{ secrets.TAILSCALE_AUTHKEY }}
- name: Login to GitHub Container Registry
if: github.event_name != 'pull_request'
uses: docker/login-action@v2
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Login to internal Container Registry
uses: docker/login-action@v2.1.0
uses: docker/login-action@v3
with:
username: ${{ secrets.TAILSCALE_DOCKER_USERNAME }}
password: ${{ secrets.TAILSCALE_DOCKER_PASSWORD }}
registry: registry.internal.huggingface.tech
- name: Login to Azure Container Registry
if: github.event_name != 'pull_request'
uses: docker/login-action@v2.1.0
uses: docker/login-action@v3
with:
username: ${{ secrets.AZURE_DOCKER_USERNAME }}
password: ${{ secrets.AZURE_DOCKER_PASSWORD }}
Expand All @@ -85,12 +102,12 @@ jobs:
- name: Extract metadata (tags, labels) for Docker
if: ${{ github.event_name == 'pull_request' }}
id: meta-pr
uses: docker/metadata-action@v4.3.0
uses: docker/metadata-action@v5
with:
images: |
registry.internal.huggingface.tech/api-inference/community/text-generation-inference
tags: |
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ matrix.label }}
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
# If main, release or tag
- name: Extract metadata (tags, labels) for Docker
if: ${{ github.event_name != 'pull_request' }}
Expand All @@ -104,38 +121,61 @@ jobs:
ghcr.io/huggingface/text-generation-inference
db4c2190dd824d1f950f5d1555fbadf0.azurecr.io/text-generation-inference
tags: |
type=semver,pattern={{version}}${{ matrix.label }}
type=semver,pattern={{major}}.{{minor}}${{ matrix.label }}
type=raw,value=latest${{ matrix.label }},enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) }}
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ matrix.label }}
type=semver,pattern={{version}}${{ env.LABEL }}
type=semver,pattern={{major}}.{{minor}}${{ env.LABEL }}
type=raw,value=latest${{ env.LABEL }},enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) }}
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
- name: Build and push Docker image
id: build-and-push
uses: docker/build-push-action@v4
with:
context: .
file: ${{ matrix.dockerfile }}
file: ${{ env.DOCKERFILE }}
push: true
platforms: 'linux/amd64'
build-args: |
GIT_SHA=${{ env.GITHUB_SHA }}
DOCKER_LABEL=sha-${{ env.GITHUB_SHA_SHORT }}${{ matrix.label }}
DOCKER_LABEL=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
tags: ${{ steps.meta.outputs.tags || steps.meta-pr.outputs.tags }}
labels: ${{ steps.meta.outputs.labels || steps.meta-pr.outputs.labels }}
cache-from: type=registry,ref=registry.internal.huggingface.tech/api-inference/community/text-generation-inference:cache${{ matrix.label }},mode=min
cache-to: type=registry,ref=registry.internal.huggingface.tech/api-inference/community/text-generation-inference:cache${{ matrix.label }},mode=min
cache-from: type=registry,ref=registry.internal.huggingface.tech/api-inference/community/text-generation-inference:cache${{ env.LABEL }},mode=min
cache-to: type=registry,ref=registry.internal.huggingface.tech/api-inference/community/text-generation-inference:cache${{ env.LABEL }},mode=min
- name: Final
id: final
run: |
echo "docker_image=registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sha-${{ env.GITHUB_SHA_SHORT}}${{ env.LABEL }}" >> "$GITHUB_OUTPUT"
echo "docker_devices=${{ env.DOCKER_DEVICES }}" >> "$GITHUB_OUTPUT"
echo "runs_on=${{ env.RUNS_ON }}" >> "$GITHUB_OUTPUT"
echo "label=${{ env.LABEL }}" >> "$GITHUB_OUTPUT"
integration_tests:
concurrency:
group: ${{ github.workflow }}-${{ github.job }}-${{ needs.build-and-push.outputs.label }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
needs: build-and-push
runs-on: ["self-hosted", "${{ needs.build-and-push.outputs.runs_on }}", "multi-gpu"]
if: needs.build-and-push.outputs.runs_on != 'ubuntu-latest'
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Inject slug/short variables
uses: rlespinasse/[email protected]
- name: Set up Python
if: matrix.name == 'cuda'
uses: actions/setup-python@v4
with:
python-version: 3.9
python-version: "3.10"
- name: Install
if: matrix.name == 'cuda'
run: |
make install-integration-tests
- name: Tailscale
uses: huggingface/tailscale-action@main
if: needs.build-and-push.outputs.runs_on != 'amd-gpu-tgi'
with:
authkey: ${{ secrets.TAILSCALE_AUTHKEY }}
- name: Run tests
if: matrix.name == 'cuda'
run: |
export DOCKER_VOLUME=/mnt/cache
export DOCKER_IMAGE=registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sha-${{ env.GITHUB_SHA_SHORT }}
export DOCKER_IMAGE=${{ needs.build-and-push.outputs.docker_image }}
export DOCKER_DEVICES=${{ needs.build-and-push.outputs.docker_devices }}
export HUGGING_FACE_HUB_TOKEN=${{ secrets.HUGGING_FACE_HUB_TOKEN }}
echo $DOCKER_IMAGE
pytest -s -vv integration-tests
19 changes: 19 additions & 0 deletions .github/workflows/build_pr_documentation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: Build PR Documentation

on:
pull_request:
paths:
- "docs/source/**"

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yaml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: text-generation-inference
additional_args: --not_python_module
36 changes: 36 additions & 0 deletions .github/workflows/ci_build.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: CI build

on:
push:
branches:
- 'main'
tags:
- 'v*'
pull_request:
paths:
- ".github/workflows/build.yaml"
- "integration-tests/**"
- "server/**"
- "proto/**"
- "router/**"
- "launcher/**"
- "Cargo.lock"
- "rust-toolchain.toml"
- "Dockerfile"
- "Dockerfile_amd"
- "Dockerfile_intel"
branches:
- 'main'

jobs:
build:
strategy:
# super important if you want to see all results, even if one fails
# fail-fast is true by default
fail-fast: false
matrix:
hardware: ["cuda", "rocm", "intel"]
uses: ./.github/workflows/build.yaml # calls the one above ^
with:
hardware: ${{ matrix.hardware }}
secrets: inherit
41 changes: 41 additions & 0 deletions .github/workflows/integration_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: Integration tests

on:
workflow_call:
inputs:
docker_image:
type: string
description: Hardware
required: true
docker_devices:
type: string
description: Hardware
runs_on:
type: string
required: true
description: Hardware to run integration tests
jobs:
integration_tests:
concurrency:
group: ${{ github.workflow }}-${{ github.job }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
runs-on: ${{ inputs.runs_on }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Inject slug/short variables
uses: rlespinasse/[email protected]
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.9
- name: Install
run: |
make install-integration-tests
- name: Run tests
run: |
export DOCKER_VOLUME=/mnt/cache
export DOCKER_IMAGE=${{ inputs.docker_image }}
export DOCKER_DEVICES=${{ inputs.docker_devices }}
export HUGGING_FACE_HUB_TOKEN=${{ secrets.HUGGING_FACE_HUB_TOKEN }}
pytest -s -vv integration-tests
File renamed without changes.
18 changes: 15 additions & 3 deletions integration-tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
DOCKER_IMAGE = os.getenv("DOCKER_IMAGE", None)
HUGGING_FACE_HUB_TOKEN = os.getenv("HUGGING_FACE_HUB_TOKEN", None)
DOCKER_VOLUME = os.getenv("DOCKER_VOLUME", "/data")
DOCKER_DEVICES = os.getenv("DOCKER_DEVICES")


class ResponseComparator(JSONSnapshotExtension):
Expand Down Expand Up @@ -453,16 +454,27 @@ def docker_launcher(
if DOCKER_VOLUME:
volumes = [f"{DOCKER_VOLUME}:/data"]

if DOCKER_DEVICES:
devices = DOCKER_DEVICES.split(",")
visible = os.getenv("ROCR_VISIBLE_DEVICES")
if visible:
env["ROCR_VISIBLE_DEVICES"] = visible
device_requests = []
else:
devices = []
device_requests = [
docker.types.DeviceRequest(count=gpu_count, capabilities=[["gpu"]])
]

container = client.containers.run(
DOCKER_IMAGE,
command=args,
name=container_name,
environment=env,
auto_remove=False,
detach=True,
device_requests=[
docker.types.DeviceRequest(count=gpu_count, capabilities=[["gpu"]])
],
device_requests=device_requests,
devices=devices,
volumes=volumes,
ports={"80/tcp": port},
shm_size="1G",
Expand Down
2 changes: 1 addition & 1 deletion server/Makefile-vllm
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
commit_cuda := b5dfc61db88a81069e45b44f7cc99bd9e62a60fa
commit_rocm := ca6913b3c2ffacdcb7d15e914dc34adbc6c89479
commit_rocm := 559200c1a028de990c1ddea761b0ccd62109e3a0
build-vllm-cuda:
if [ ! -d 'vllm' ]; then \
pip install -U ninja packaging --no-cache-dir && \
Expand Down
Loading