Add RayService vLLM TPU Inference script #1467

ryanaoleary · 2024-09-25T00:05:52Z

Description

This PR adds a simple inference script to be used for a Ray multi-host TPU example serving Meta-Llama-3-70B. Similar to the other scripts in the /llm/ folder, serve_tpu.py builds a serve deployment for vLLM, which can then be queried with text prompts to generate output. This script will be used as part of a tutorial in the GKE and Ray docs.

Tasks

The contributing guide has been read and followed.
The samples added / modified have been fully tested.
Workflow files have been added / modified, if applicable.
Region tags have been properly added, if new samples.
All dependencies are set to up-to-date versions, as applicable.
Merge this pull-request for me once it is approved.

Signed-off-by: Ryan O'Leary <[email protected]> bug fixes Signed-off-by: Ryan O'Leary <[email protected]> remove extra ray init Signed-off-by: Ryan O'Leary <[email protected]> Read hf token from os Signed-off-by: Ryan O'Leary <[email protected]> Fix bugs Signed-off-by: Ryan O'Leary <[email protected]> Remove hf token logic Signed-off-by: Ryan O'Leary <[email protected]> Fix serve script Signed-off-by: Ryan O'Leary <[email protected]>

andrewsykim · 2024-09-25T15:09:58Z

Do we need a RayService YAML in the repo with region tags that you can reference in the GCP docs?

ai-ml/gke-ray/rayserve/llm/serve_tpu.py

andrewsykim · 2024-09-25T15:13:00Z

ai-ml/gke-ray/rayserve/llm/serve_tpu.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# NOTE: this file was inspired from: https://github.com/richardsliu/vllm/blob/rayserve/examples/rayserve_tpu.py


@richardsliu can we get this example merged into the vllm repo?

I opened this one a while back: vllm-project/vllm#8038

I'll ping them again on it.

ai-ml/gke-ray/rayserve/llm/serve_tpu.py

Signed-off-by: Ryan O'Leary <[email protected]>

snippet-bot · 2024-09-27T09:57:26Z

Here is the summary of changes.

You are about to add 4 region tags.

ai-ml/gke-ray/rayserve/llm/tpu/ray-cluster.tpu-v5e-multihost.yaml:15, tag gke_ai_ml_gke_ray_rayserve_tpu_raycluster_v5e_multihost
ai-ml/gke-ray/rayserve/llm/tpu/ray-cluster.tpu-v6e-multihost.yaml:15, tag gke_ai_ml_gke_ray_rayserve_tpu_raycluster_v6e_multihost
ai-ml/gke-ray/rayserve/llm/tpu/ray-service.tpu-v5e-multihost.yaml:15, tag gke_ai_ml_gke_ray_rayserve_tpu_rayservice_v5e_multihost
ai-ml/gke-ray/rayserve/llm/tpu/ray-service.tpu-v6e-multihost.yaml:15, tag gke_ai_ml_gke_ray_rayserve_tpu_rayservice_v6e_multihost

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

Refresh this comment

ryanaoleary · 2024-09-27T10:00:25Z

Do we need a RayService YAML in the repo with region tags that you can reference in the GCP docs?

Yeah that sounds good. I'm still testing out the 405B RayService, but I added the 8B and 70B ones in fe6440c, we can then use envsubst to replace the image var.

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2024-10-02T05:14:36Z

I've tried running LLama-3.1-405B with TPU slice sizes up to 4x4x8 v4 and 8x16 v5e and ran into a few issues:

As slice sizes grow larger, the amount of time needed for vLLM initialization and memory profiling grows incredibly large
Attempting to run inference with smaller topologies than the aforementioned slice sizes leads to errors like RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space hbm. Used 20.44G of 15.75G hbm. Exceeded hbm capacity by 4.70G., since the TPUs only have 16 Gi and 32 Gi for v5e and v4 TPUs respectively. Relatively small HBM capacity (compared to GPUs) means that we need much larger slice sizes to fit the sharded weights.
Even when loading a model that will be sharded (i.e. with tensor-parallelism > 1), vLLM still downloads the entire model on each worker, only afterwards storing the relevant weights in new files to each worker. This means that larger slice sizes will require an extremely high amount of total disk space when loading large models.
Larger multi-host slice sizes lead to ValueError: Too large swap space. errors, where vLLM attempts to allocate more than the total amount of available CPU memory to the swap space. I've gotten around this error by simply setting swap_space=0 in the vLLM EngineArgs, but I'm worried this slows down the model loading.
vLLM lacks support for running multiple multi-host TPU slices (i.e. just with specifying pipeline-parallelism > 1)
vLLM TPU backend lacks support for loading quantized models

If the user has sufficient quota for TPU chips and SSD in their region, a v4 4x4x8 or v5e 8x16 are large enough to run multi-host inference with Llama-3.1-405B. However, I'm wondering whether I'm missing anything obvious here (with the current amount of TPU support in vLLM) that could allow us to a). load the model faster and b). require less disk space when initializing the model.

cc: @richardsliu @andrewsykim

Signed-off-by: Ryan O'Leary <[email protected]>

ai-ml/gke-ray/rayserve/llm/llama-3-8b-it/ray-cluster-v5e-tpu.yaml

ai-ml/gke-ray/rayserve/llm/llama-3.1-70b/ray-cluster-v4-tpu.yaml

Signed-off-by: Ryan O'Leary <[email protected]>

Signed-off-by: ryanaoleary <[email protected]>

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested review from alizaidis, yoshi-approver and a team as code owners September 25, 2024 00:05

ryanaoleary force-pushed the multihost-example branch 2 times, most recently from 3ff9092 to 717b6ef Compare September 25, 2024 00:24

ryanaoleary force-pushed the multihost-example branch from 717b6ef to dfd04dd Compare September 25, 2024 00:33

andrewsykim reviewed Sep 25, 2024

View reviewed changes

ai-ml/gke-ray/rayserve/llm/serve_tpu.py Outdated Show resolved Hide resolved

ryanaoleary mentioned this pull request Sep 27, 2024

[Doc] RayServe Single-Host TPU v6e Example with vLLM ray-project/ray#47814

Open

8 tasks

ryanaoleary added 2 commits September 27, 2024 08:11

Fix inference script

335e75c

Signed-off-by: Ryan O'Leary <[email protected]>

Add RayCluster and RayService CRs

fe6440c

Signed-off-by: Ryan O'Leary <[email protected]>

Fix working_dir link

ee804fd

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary and others added 4 commits October 2, 2024 05:18

Merge branch 'main' into multihost-example

94e0348

Set max model length

37f0280

Signed-off-by: Ryan O'Leary <[email protected]>

pass in max model len as env var

11c7ee6

Signed-off-by: Ryan O'Leary <[email protected]>

Add Ray CR manifests and fix inference script

deeee56

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from andrewsykim October 7, 2024 19:21

Fix model id

b12ffb7

Signed-off-by: Ryan O'Leary <[email protected]>

andrewsykim reviewed Oct 7, 2024

View reviewed changes

ai-ml/gke-ray/rayserve/llm/llama-3-8b-it/ray-cluster-v5e-tpu.yaml Outdated Show resolved Hide resolved

ai-ml/gke-ray/rayserve/llm/llama-3.1-70b/ray-cluster-v4-tpu.yaml Outdated Show resolved Hide resolved

ryanaoleary added 6 commits October 7, 2024 21:33

Don't specify mxla and slicebuilder ports

cbce283

Signed-off-by: Ryan O'Leary <[email protected]>

Pass env vars to runtime env

aa21d2c

Signed-off-by: Ryan O'Leary <[email protected]>

Add model composition example

5b42cb8

Signed-off-by: Ryan O'Leary <[email protected]>

Update name

ced331d

Signed-off-by: Ryan O'Leary <[email protected]>

Support passing TPU_HEADS as env var

9e9d466

Signed-off-by: Ryan O'Leary <[email protected]>

Update RayService TPU CR

06ed4fc

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from andrewsykim October 17, 2024 01:10

Merge branch 'main' into multihost-example

c6de351

ryanaoleary mentioned this pull request Nov 7, 2024

Add Single-Host v5e and v6e RayService Samples for vLLM Guide #1514

Merged

6 tasks

ryanaoleary added 3 commits November 14, 2024 01:43

Rescope PR to v5e and v6e for Llama-3.1-405B

41ed84a

Signed-off-by: ryanaoleary <[email protected]>

Add v5e and v6e RayServices

53bbfeb

Signed-off-by: ryanaoleary <[email protected]>

Fix serve_tpu script

44f3a8d

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from moficodes as a code owner January 22, 2025 13:02

ryanaoleary added 8 commits January 22, 2025 13:05

Fix caps

15760fd

Signed-off-by: Ryan O'Leary <[email protected]>

Restructure constructor args

29ddacf

Signed-off-by: Ryan O'Leary <[email protected]>

Remove download dir

b731893

Signed-off-by: Ryan O'Leary <[email protected]>

add config_format arg

eff29eb

Signed-off-by: Ryan O'Leary <[email protected]>

Revert serve_tpu

f57aad8

Signed-off-by: Ryan O'Leary <[email protected]>

Pass as env vars

1023c6e

Signed-off-by: Ryan O'Leary <[email protected]>

Add download dir

b6d0d0d

Signed-off-by: Ryan O'Leary <[email protected]>

Remove download dir

16c13f9

Signed-off-by: Ryan O'Leary <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RayService vLLM TPU Inference script #1467

Add RayService vLLM TPU Inference script #1467

ryanaoleary commented Sep 25, 2024

andrewsykim commented Sep 25, 2024

andrewsykim Sep 25, 2024

richardsliu Oct 3, 2024

snippet-bot bot commented Sep 27, 2024 •

edited

Loading

ryanaoleary commented Sep 27, 2024

ryanaoleary commented Oct 2, 2024

Add RayService vLLM TPU Inference script #1467

Are you sure you want to change the base?

Add RayService vLLM TPU Inference script #1467

Conversation

ryanaoleary commented Sep 25, 2024

Description

Tasks

andrewsykim commented Sep 25, 2024

andrewsykim Sep 25, 2024

Choose a reason for hiding this comment

richardsliu Oct 3, 2024

Choose a reason for hiding this comment

snippet-bot bot commented Sep 27, 2024 • edited Loading

ryanaoleary commented Sep 27, 2024

ryanaoleary commented Oct 2, 2024

snippet-bot bot commented Sep 27, 2024 •

edited

Loading