Ray Disaggregated Serving MVP #106

FanhaiLu1 · 2024-05-29T01:25:12Z

This PR enable pytorch engine disaggregated serving on multiple TPU POD slices.

This PR delivered:

Engine do prefill in one POD slice and do decode in another POD slice
Transfer prefill result from one POD slice to another POD slice
Load weight and sharding in multiple host on multiple POD slice
Compute meaningful decode result

Result validation:

Disaggregated serving, here is the logs (tpu-vm-1 as decode engine):

---- Do prefill in prefill engine pod_slice_name: tpu-vm-2
---- Transfer prefill result to decode engine pod_slice_name: tpu-vm-1
---- Do insert in decode engine pod_slice_name: tpu-vm-1

Correct result compared with interleave serving

Command:
python /home/{user}/jetstream-pytorch/run_interactive_disaggregated.py --size=7b --batch_size=1 --is_disaggregated=True --num_hosts=8 --decode_pod_slice_name={user}-tpu-vm-2 --model_name=llama-2 --max_cache_length=2048 --quantize_weights=False --quantize_kv_cache=False --checkpoint_path=/home/{user}/data/llama-2-7b-chat-safetensor/model.safetensors --tokenizer_path=/home/{user}/data/tokenizer.model --sharding_config=/home/{user}/jetstream-pytorch/default_shardings/llama.yaml

Interleave result:


to find purpose and fulfillment.

I believe that everyone has a unique purpose and that it is up to each individual to discover and pursue theirs.

Disaggregated result:


to find purpose and fulfillment.

I believe that everyone has a unique purpose and that it is up to each individual to discover and pursue theirs.

Next Steps:

Integrate with jetstream orchestrator
Add readme for both interactive run and jetstream run
Fix performance issue
Test llama 70
5: Support multiple prefill engine and multiple decode engine

jetstream_pt/ray_engine.py

allenwang28

High level comment - it looks like the main difference for is_disaggregated within PyTorchRayEngine is whether or not prefill returns outputs.

If the prefill/decode/interleave functionality is essentially the same, then I guess it's an implementation detail for orchestrator to trigger the transfer. If so, then it possible to exclude is_disaggregated from the worker? That'd simplify the complexity

install_everything.sh

jetstream_pt/ray_engine.py

FanhaiLu1 · 2024-05-29T21:11:33Z

High level comment - it looks like the main difference for is_disaggregated within PyTorchRayEngine is whether or not prefill returns outputs.

If the prefill/decode/interleave functionality is essentially the same, then I guess it's an implementation detail for orchestrator to trigger the transfer. If so, then it possible to exclude is_disaggregated from the worker? That'd simplify the complexity

Simplified the prefill call from engine side. On the worker side. Yes, they are same on insert and decode side. But I feel it's better to keep disaggregated and interleave for prefill. Several reasons:

Worker doesn't need handle the logic of disaggregated or interleave. Worker can do both interleave and disaggregated prefill, let the engine chose what worker can do.
There are large difference between disaggregated or interleave prefill. From interleave side, we directly save the prefill result in it's local HBM. But from disaggregated side, we need extract call to all gather the cache and return the prefill result. The another difference is that pre result cache is jax array in interleave, but it's np array disaggregated
Worker already had pretty complex logic, it's better to keep some logic in engine (engine has simple code logic right now)

allenwang28 · 2024-05-29T21:39:36Z

But I feel it's better to keep disaggregated and interleave for prefill. Several reasons:

I think that makes sense to me, thanks!

jetstream_pt/ray_engine.py

FanhaiLu1 added 4 commits May 28, 2024 23:48

Ray disaggregated MVP support

208425a

add jax cpu

d80e893

add comments

32562b9

format

0eb0906

FanhaiLu1 requested review from allenwang28, wang2yn84 and qihqi May 29, 2024 01:28

format

0584f21

wang2yn84 approved these changes May 29, 2024

View reviewed changes

jetstream_pt/ray_engine.py Outdated Show resolved Hide resolved

jetstream_pt/ray_engine.py Outdated Show resolved Hide resolved

assign call prefill in one line

b5c1764

allenwang28 reviewed May 29, 2024

View reviewed changes

install_everything.sh Outdated Show resolved Hide resolved

jetstream_pt/ray_engine.py Outdated Show resolved Hide resolved

jetstream_pt/ray_engine.py Show resolved Hide resolved

FanhaiLu1 added 4 commits May 29, 2024 20:26

refactor prefill in ray engine

8e7d8b1

format

4a0215a

clean up ray prefill

6e5abb7

remove duplicated flax installation

a84cfbc

allenwang28 approved these changes May 29, 2024

View reviewed changes

jetstream_pt/ray_engine.py Show resolved Hide resolved

add tuple as todo

01b808f

FanhaiLu1 merged commit c360158 into AI-Hypercomputer:main May 29, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray Disaggregated Serving MVP #106

Ray Disaggregated Serving MVP #106

FanhaiLu1 commented May 29, 2024 •

edited

Loading

allenwang28 left a comment

FanhaiLu1 commented May 29, 2024 •

edited

Loading

allenwang28 commented May 29, 2024

Ray Disaggregated Serving MVP #106

Ray Disaggregated Serving MVP #106

Conversation

FanhaiLu1 commented May 29, 2024 • edited Loading

This PR delivered:

Result validation:

Interleave result:

Disaggregated result:

Next Steps:

allenwang28 left a comment

Choose a reason for hiding this comment

FanhaiLu1 commented May 29, 2024 • edited Loading

allenwang28 commented May 29, 2024

FanhaiLu1 commented May 29, 2024 •

edited

Loading

FanhaiLu1 commented May 29, 2024 •

edited

Loading