Use vLLM to load LLMs #230

kyriediculous · 2024-10-17T11:54:49Z

This PR upgrades the LLM pipeline to use vLLM to load and perform inference on models using optimised batching and other features that come with vLLM.

Dependencies have been upgraded to be compatible with vLLM 0.6.3, these new dependency versions are untested with other pipelines (though could benefit them as well)

Both fp16 and 8 bit quantization is still supported, but could be further optimized by detecting GPUs on the machine and adjusting quantization methods to be used accordingly.
Docker file has been updated to use newer pip and torch
Docker file has been udpated to respect CUDA_PCI_BUS_ORDER , ensuring the same develop experience as go-livepeer when specifying GPU id's found in nvidia-smi
Adds Top_P and Top_K parameters to the LLM route
Change API to take messages in common LLM format instead of prompt and history fields

ad-astra-video · 2024-12-20T00:57:52Z

Hey Nico! I did a first round of review and have some questions/comments:

There are a lot of requirements changes in this. I have not reviewed all of them and don't believe we should be changing all requirments to pin to vllm dependencies. Diffusers in particular needs to stay at 0.31.0 to maintain support in other pipelines.
- I think upgrading torch to 2.4 or 2.5 should be fine from other testing I have done.
is there a docker file change I am missing for GPU ids?
Thoughts on using vllm 0.6.4 or 0.6.5?
Does streaming work for you? I get this response and the container logs are after the response.

curl -X POST "http://localhost:9000/llm" -H "Content-Type: application/json" -H "Connection: keep-alive" -H "Keep-Alive: timeout=5, max=100" -d @llm.json
data: {"choices": [{"delta": {"content": "<|start_header_id|>"}, "finish_reason": null}], "created": 1734653742, "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "id": "chatcmpl-1734653742"}

data: [DONE]

2024-12-20 00:15:42,934 INFO:     172.18.0.1:41394 - "POST /llm HTTP/1.1" 200 OK
runner-llm-1  | INFO 12-20 00:15:42 async_llm_engine.py:209] Added request chatcmpl-1734653742.
runner-llm-1  | INFO 12-20 00:15:42 metrics.py:345] Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
runner-llm-1  | INFO 12-20 00:15:42 metrics.py:361] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%
runner-llm-1  | INFO 12-20 00:15:42 async_llm_engine.py:221] Aborted request chatcmpl-1734653742.

the model num_heads needs to be divisible by the tensor parallel size. Can we check for this or prefer to let it fail and crash loop? Example below in testing Qwen 2.5 32B

ValueError: Total number of attention heads (40) must be divisible by tensor parallel size (3).

Thoughts on allowing setting pipeline parallel size?
The runner dies with cuda errors sometimes...if I allocated 4x 3090 tis it crash loops. With 2 allocated it loaded Llama 8b fine. This is likely an edge case since why would map more than 1 GPU if the model does not take more than 1. Logs from successful load with 2 GPUs allocated for reference.

runner-llm-1  | 2024-12-20 00:49:31,004 - app.pipelines.llm - INFO - Initializing LLM pipeline
runner-llm-1  | 2024-12-20 00:49:31,004 - app.pipelines.llm - INFO - Available GPU memory: {0: '23GiB', 1: '23GiB'}
runner-llm-1  | 2024-12-20 00:49:31,004 - app.pipelines.llm - INFO - Tensor parallel size: 2
runner-llm-1  | 2024-12-20 00:49:31,011 - app.pipelines.llm - INFO - Using BFloat16 precision
runner-llm-1  | INFO 12-20 00:49:35 config.py:887] Defaulting to use mp for distributed inference
runner-llm-1  | INFO 12-20 00:49:35 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', speculative_config=None, tokenizer='/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
runner-llm-1  | WARNING 12-20 00:49:35 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
runner-llm-1  | INFO 12-20 00:49:35 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
runner-llm-1  | /root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
runner-llm-1  | No module named 'vllm._version'
runner-llm-1  |   from vllm.version import __version__ as VLLM_VERSION
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:39 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
runner-llm-1  | INFO 12-20 00:49:39 utils.py:1008] Found nccl from library libnccl.so.2
runner-llm-1  | INFO 12-20 00:49:39 pynccl.py:63] vLLM is using nccl==2.20.5
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:39 utils.py:1008] Found nccl from library libnccl.so.2
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:39 pynccl.py:63] vLLM is using nccl==2.20.5
runner-llm-1  | INFO 12-20 00:49:39 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
runner-llm-1  | INFO 12-20 00:49:48 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:48 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
runner-llm-1  | INFO 12-20 00:49:48 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x729547271d90>, local_subscribe_port=37659, remote_subscribe_port=None)
runner-llm-1  | INFO 12-20 00:49:48 model_runner.py:1060] Starting to load model /models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659...
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:48 model_runner.py:1060] Starting to load model /models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  2.26it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.64it/s]
runner-llm-1  |
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:51 model_runner.py:1071] Loading model weights took 7.5122 GB
runner-llm-1  | INFO 12-20 00:49:51 model_runner.py:1071] Loading model weights took 7.5122 GB
runner-llm-1  | INFO 12-20 00:49:52 distributed_gpu_executor.py:57] # GPU blocks: 11472, # CPU blocks: 4096
runner-llm-1  | INFO 12-20 00:49:52 distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request: 22.41x
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:55 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:55 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
runner-llm-1  | INFO 12-20 00:49:55 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
runner-llm-1  | INFO 12-20 00:49:55 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:50:03 custom_all_reduce.py:233] Registering 1235 cuda graph addresses
runner-llm-1  | INFO 12-20 00:50:03 custom_all_reduce.py:233] Registering 1235 cuda graph addresses
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:50:03 model_runner.py:1530] Graph capturing finished in 8 secs.
runner-llm-1  | INFO 12-20 00:50:03 model_runner.py:1530] Graph capturing finished in 8 secs.
runner-llm-1  | 2024-12-20 00:50:03,843 - app.pipelines.llm - INFO - Model loaded: meta-llama/Meta-Llama-3.1-8B-Instruct
runner-llm-1  | 2024-12-20 00:50:03,843 - app.pipelines.llm - INFO - Using GPU memory utilization: 0.85
runner-llm-1  | 2024-12-20 00:50:03,855 - app.utils.hardware - INFO - CUDA devices available:
runner-llm-1  | 2024-12-20 00:50:03,855 - app.utils.hardware - INFO - Device 0: id='GPU-8dc1a60e-43d1-33f1-dcc9-47853e8af470' name='NVIDIA GeForce RTX 3090 Ti' memory_total=25757220864 memory_free=4343595008 major=8 minor=6
runner-llm-1  | 2024-12-20 00:50:03,855 - app.utils.hardware - INFO - Device 1: id='GPU-f5a4c9da-963f-bcec-17f2-401f36506fb1' name='NVIDIA GeForce RTX 3090 Ti' memory_total=25757220864 memory_free=4377149440 major=8 minor=6
runner-llm-1  | 2024-12-20 00:50:03,855 - app.main - INFO - Started up with pipeline LLMPipeline(model_id=meta-llama/Meta-Llama-3.1-8B-Instruct)
runner-llm-1  | 2024-12-20 00:50:03,855 INFO:     Application startup complete.
runner-llm-1  | 2024-12-20 00:50:03,855 INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)

rickstaa · 2024-12-20T07:23:17Z

@ad-astra-video thanks for the swift review ⚡:

There are a lot of requirements changes in this. I have not reviewed all of them and don't believe we should be changing all requirments to pin to vllm dependencies. Diffusers in particular needs to stay at 0.31.0 to maintain support in other pipelines.

Hi @kyriediculous, Would it be feasible to run the LLM pipeline in its own Docker container? We’re gradually migrating other pipelines into separate containers now that the orchestrator automatically downloads them (see PR #308 and PR #200).

This setup would give you full control over dependencies and streamline reviews, as Brad wouldn’t need to test unrelated pipelines—especially since E2E tests similar to those in realtime aren’t implemented yet. Perhaps your new pip-compile setup already addresses this issue (I haven’t tested it yet).

Additionally, we can merge PR #293, which enables pipeline container overrides. If you have any concerns about using a dedicated container, please let me know so I can better understand the constraints.

Looking forward to hearing your thoughts!

kyriediculous · 2024-12-20T09:14:08Z

@ad-astra-video Do you also have the logs from the crash with 4 GPUs ?

kyriediculous · 2024-12-20T09:28:17Z

@ad-astra-video thanks for the swift review ⚡:

There are a lot of requirements changes in this. I have not reviewed all of them and don't believe we should be changing all requirments to pin to vllm dependencies. Diffusers in particular needs to stay at 0.31.0 to maintain support in other pipelines.

Hi @kyriediculous, Would it be feasible to run the LLM pipeline in its own Docker container? We’re gradually migrating other pipelines into separate containers now that the orchestrator automatically downloads them (see PR #308 and PR #200).

This setup would give you full control over dependencies and streamline reviews, as Brad wouldn’t need to test unrelated pipelines—especially since E2E tests similar to those in realtime aren’t implemented yet. Perhaps your new pip-compile setup already addresses this issue (I haven’t tested it yet).

Additionally, we can merge PR #293, which enables pipeline container overrides. If you have any concerns about using a dedicated container, please let me know so I can better understand the constraints.

Looking forward to hearing your thoughts!

@rickstaa The only requirement would be creating a separate Dockerfile.llm in the docker/ folder right ?

kyriediculous · 2024-12-20T09:46:39Z

@ad-astra-video @rickstaa

Fixed streaming
Isolated docker image
- Includes all env vars
- uses pip-compile to build dependency list
Added pipeline parallelism setting
Update vLLM to 0.6.5 (0.6.3 was the highest version few weeks ago)

ad-astra-video · 2024-12-20T23:03:55Z

@ad-astra-video Do you also have the logs from the crash with 4 GPUs ?

EDIT: figured it out...the docker containers needs --ipc=host or a --shm-size. Not entirely sure what shm-size should be but there is an example of 10.24g in vllm repo.
https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html

Original response below for reference:

Here they are: lp_vllm_startup.txt

Note it starts up fine using 2 for pipeline parallel size (lp_vllm_2_pipelines.txt).

I don't think we need to hold this up for this. We can run multiple runner containers which I think would be better anyways. Only other nuance for this server is each set of 3090 tis are connect with nvlink.

Attaching logs from the vllm openai compatible server starting up for comparison.
vllm_startup.txt

ad-astra-video · 2024-12-21T00:14:29Z

This looks good with the couple small changes suggested above.

The open api spec action is failing, can you run the api spec gen again? Will merge after that is passing.

kyriediculous · 2024-12-23T10:10:45Z

@ad-astra-video Do you also have the logs from the crash with 4 GPUs ?

EDIT: figured it out...the docker containers needs --ipc=host or a --shm-size. Not entirely sure what shm-size should be but there is an example of 10.24g in vllm repo.

https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html

Original response below for reference:

Here they are: lp_vllm_startup.txt

Note it starts up fine using 2 for pipeline parallel size (lp_vllm_2_pipelines.txt).

I don't think we need to hold this up for this. We can run multiple runner containers which I think would be better anyways. Only other nuance for this server is each set of 3090 tis are connect with nvlink.

Attaching logs from the vllm openai compatible server starting up for comparison.

vllm_startup.txt

Vllm abd pytorch uses shared memory to efficiently share tensors between its dataloader workers and its main process.

It defaults to 64mb but I found vlmm requires more.

I forgot to mention it, maybe i should attach a small readme

ad-astra-video · 2024-12-23T12:38:25Z

Vllm abd pytorch uses shared memory to efficiently share tensors between its dataloader workers and its main process.

It defaults to 64mb but I found vlmm requires more.

I forgot to mention it, maybe i should attach a small readme

I see this as a documentation thing mostly so a note in the docs should cover this.

Managed containers use default values for IpcMode: private and ShmSize: 67108864. If target is LLM model fits on 1 GPU then documentation can indicate to run the container externally if more than 1 GPU.

rickstaa · 2024-12-24T08:18:05Z

@ad-astra-video, to help move this forward, could you provide a concise list of the remaining changes you’re waiting on before this can be merged? Based on the discussion, here’s what I’ve noted so far:

Requirements Updates

There are a lot of requirements changes in this. I have not reviewed all of them and don't believe we should be changing all requirements to pin to vllm dependencies. Diffusers, in particular, needs to stay at 0.31.0 to maintain support in other pipelines. I think upgrading torch to 2.4 or 2.5 should be fine from other testing I have done.
This concern was mitigated by putting LLMS in a sperate container.
vllm Version

~~Thoughts on using vllm 0.6.4 or 0.6.5?~~
This was done.
Pipeline Parallel Size

~~Thoughts on allowing setting pipeline parallel size?~~
This appears to have been completed.
CUDA Errors on Multi-GPU Setup

The runner dies with CUDA errors sometimes...if I allocated 4× 3090 Ti GPUs, it crash loops. With 2 allocated, it loaded Llama 8B fine. This is likely an edge case since why map more than 1 GPU if the model does not take more than 1. Logs from a successful load with 2 GPUs allocated for reference.
This looks like resolved according to this comment.
Streaming Issues

Does streaming work for you? I get this response, and the container logs are below the response.
Is there a Dockerfile change I am missing for GPU IDs?
Have these been addressed? @ad-astra-video, could you confirm?
num_heads Divisibility by Tensor Parallel Size

The model’s num_heads needs to be divisible by the tensor parallel size. Can we add a check for this, or should we prefer to let it fail and crash loop? (Example: Testing Qwen 2.5 32B.)
Was this discussed or implemented?
Documentation Update

I see this as a documentation task mostly, so a note in the docs should cover this.
Was this documented as suggested?

Let me know if I missed anything or if there are additional updates to track!

ad-astra-video

Here are the suggested changes. My apologies, i think i didnt get these posted.

runner/app/pipelines/llm.py

runner/docker/Dockerfile.llm

ad-astra-video · 2024-12-24T12:29:39Z

@rickstaa @kyriediculous looks like i messed up posting the suggested changes.

The ShmSize and IpcHost only come into play if running pipeline parallel or tensor size > 2 I believe. If Nico is comfortable with this we can move forward with it as a documentation note since these variables can be adjusted using external runners.

Suggested changes:

add engine_args.load_format = "bitsandbytes" to enable 8 bit support if used
remove SFAST from Docker file build to speed up build and limit dependency conflicts down the road since SFAST is in maintenance mode
regenerate open api bindings EDIT: will fix in separate PR

kyriediculous · 2024-12-26T10:38:34Z

I regenerated the bindings but there's no change to commit.

ad-astra-video · 2024-12-26T19:08:36Z

@kyriediculous I confirmed fixes and tested runner locally. I am approving and merging and will re-run the open api gen separately to fix.

kyriediculous force-pushed the llm branch from 62c0f11 to 261392f Compare October 17, 2024 19:51

kyriediculous marked this pull request as ready for review October 17, 2024 19:55

kyriediculous requested a review from rickstaa as a code owner October 17, 2024 19:55

kyriediculous force-pushed the llm branch 2 times, most recently from 30cc874 to b7e5606 Compare October 17, 2024 20:09

kyriediculous force-pushed the llm branch 3 times, most recently from 84eadc3 to 6ed3584 Compare December 19, 2024 14:34

kyriediculous added 4 commits December 19, 2024 15:36

llm: use vLLM

67a32b7

llm: improve API

dcd60e7

llm: regenerate bindings

59f0a58

requirements.in using pip-compile

484b7c5

kyriediculous force-pushed the llm branch from 6ed3584 to 484b7c5 Compare December 19, 2024 14:57

fix: stream_generator helper

217eeee

docker: isolated docker image for LLMs

5f58a29

fixup! docker: isolated docker image for LLMs

7c9343f

ad-astra-video requested changes Dec 24, 2024

View reviewed changes

runner/app/pipelines/llm.py Show resolved Hide resolved

runner/docker/Dockerfile.llm Outdated Show resolved Hide resolved

fix: quantization with load_format

d738705

kyriediculous added 2 commits December 26, 2024 11:40

fix: rm sfast from Dockerfile.llm

92f7c35

Merge branch 'main' into llm

bce039a

ad-astra-video approved these changes Dec 26, 2024

View reviewed changes

ad-astra-video merged commit b81f898 into livepeer:main Dec 26, 2024
6 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use vLLM to load LLMs #230

Use vLLM to load LLMs #230

kyriediculous commented Oct 17, 2024 •

edited

Loading

ad-astra-video commented Dec 20, 2024

rickstaa commented Dec 20, 2024 •

edited

Loading

kyriediculous commented Dec 20, 2024

kyriediculous commented Dec 20, 2024 •

edited

Loading

kyriediculous commented Dec 20, 2024

ad-astra-video commented Dec 20, 2024 •

edited

Loading

ad-astra-video commented Dec 21, 2024

kyriediculous commented Dec 23, 2024 •

edited

Loading

ad-astra-video commented Dec 23, 2024

rickstaa commented Dec 24, 2024 •

edited

Loading

ad-astra-video left a comment

ad-astra-video commented Dec 24, 2024 •

edited

Loading

kyriediculous commented Dec 26, 2024

ad-astra-video commented Dec 26, 2024 •

edited

Loading

Use vLLM to load LLMs #230

Use vLLM to load LLMs #230

Conversation

kyriediculous commented Oct 17, 2024 • edited Loading

ad-astra-video commented Dec 20, 2024

rickstaa commented Dec 20, 2024 • edited Loading

kyriediculous commented Dec 20, 2024

kyriediculous commented Dec 20, 2024 • edited Loading

kyriediculous commented Dec 20, 2024

ad-astra-video commented Dec 20, 2024 • edited Loading

ad-astra-video commented Dec 21, 2024

kyriediculous commented Dec 23, 2024 • edited Loading

ad-astra-video commented Dec 23, 2024

rickstaa commented Dec 24, 2024 • edited Loading

ad-astra-video left a comment

Choose a reason for hiding this comment

ad-astra-video commented Dec 24, 2024 • edited Loading

kyriediculous commented Dec 26, 2024

ad-astra-video commented Dec 26, 2024 • edited Loading

kyriediculous commented Oct 17, 2024 •

edited

Loading

rickstaa commented Dec 20, 2024 •

edited

Loading

kyriediculous commented Dec 20, 2024 •

edited

Loading

ad-astra-video commented Dec 20, 2024 •

edited

Loading

kyriediculous commented Dec 23, 2024 •

edited

Loading

rickstaa commented Dec 24, 2024 •

edited

Loading

ad-astra-video commented Dec 24, 2024 •

edited

Loading

ad-astra-video commented Dec 26, 2024 •

edited

Loading