Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vLLM to load LLMs #230

Merged
merged 10 commits into from
Dec 26, 2024
Merged

Use vLLM to load LLMs #230

merged 10 commits into from
Dec 26, 2024

Conversation

kyriediculous
Copy link
Contributor

@kyriediculous kyriediculous commented Oct 17, 2024

This PR upgrades the LLM pipeline to use vLLM to load and perform inference on models using optimised batching and other features that come with vLLM.

Dependencies have been upgraded to be compatible with vLLM 0.6.3, these new dependency versions are untested with other pipelines (though could benefit them as well)

  • Both fp16 and 8 bit quantization is still supported, but could be further optimized by detecting GPUs on the machine and adjusting quantization methods to be used accordingly.

  • Docker file has been updated to use newer pip and torch

  • Docker file has been udpated to respect CUDA_PCI_BUS_ORDER , ensuring the same develop experience as go-livepeer when specifying GPU id's found in nvidia-smi

  • Adds Top_P and Top_K parameters to the LLM route

  • Change API to take messages in common LLM format instead of prompt and history fields

@kyriediculous kyriediculous marked this pull request as ready for review October 17, 2024 19:55
@kyriediculous kyriediculous force-pushed the llm branch 2 times, most recently from 30cc874 to b7e5606 Compare October 17, 2024 20:09
@kyriediculous kyriediculous force-pushed the llm branch 3 times, most recently from 84eadc3 to 6ed3584 Compare December 19, 2024 14:34
@ad-astra-video
Copy link
Collaborator

Hey Nico! I did a first round of review and have some questions/comments:

  • There are a lot of requirements changes in this. I have not reviewed all of them and don't believe we should be changing all requirments to pin to vllm dependencies. Diffusers in particular needs to stay at 0.31.0 to maintain support in other pipelines.
    • I think upgrading torch to 2.4 or 2.5 should be fine from other testing I have done.
  • is there a docker file change I am missing for GPU ids?
  • Thoughts on using vllm 0.6.4 or 0.6.5?
  • Does streaming work for you? I get this response and the container logs are after the response.
curl -X POST "http://localhost:9000/llm" -H "Content-Type: application/json" -H "Connection: keep-alive" -H "Keep-Alive: timeout=5, max=100" -d @llm.json
data: {"choices": [{"delta": {"content": "<|start_header_id|>"}, "finish_reason": null}], "created": 1734653742, "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "id": "chatcmpl-1734653742"}

data: [DONE]
2024-12-20 00:15:42,934 INFO:     172.18.0.1:41394 - "POST /llm HTTP/1.1" 200 OK
runner-llm-1  | INFO 12-20 00:15:42 async_llm_engine.py:209] Added request chatcmpl-1734653742.
runner-llm-1  | INFO 12-20 00:15:42 metrics.py:345] Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
runner-llm-1  | INFO 12-20 00:15:42 metrics.py:361] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%
runner-llm-1  | INFO 12-20 00:15:42 async_llm_engine.py:221] Aborted request chatcmpl-1734653742.
  • the model num_heads needs to be divisible by the tensor parallel size. Can we check for this or prefer to let it fail and crash loop? Example below in testing Qwen 2.5 32B
ValueError: Total number of attention heads (40) must be divisible by tensor parallel size (3).
  • Thoughts on allowing setting pipeline parallel size?

  • The runner dies with cuda errors sometimes...if I allocated 4x 3090 tis it crash loops. With 2 allocated it loaded Llama 8b fine. This is likely an edge case since why would map more than 1 GPU if the model does not take more than 1. Logs from successful load with 2 GPUs allocated for reference.

runner-llm-1  | 2024-12-20 00:49:31,004 - app.pipelines.llm - INFO - Initializing LLM pipeline
runner-llm-1  | 2024-12-20 00:49:31,004 - app.pipelines.llm - INFO - Available GPU memory: {0: '23GiB', 1: '23GiB'}
runner-llm-1  | 2024-12-20 00:49:31,004 - app.pipelines.llm - INFO - Tensor parallel size: 2
runner-llm-1  | 2024-12-20 00:49:31,011 - app.pipelines.llm - INFO - Using BFloat16 precision
runner-llm-1  | INFO 12-20 00:49:35 config.py:887] Defaulting to use mp for distributed inference
runner-llm-1  | INFO 12-20 00:49:35 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', speculative_config=None, tokenizer='/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
runner-llm-1  | WARNING 12-20 00:49:35 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
runner-llm-1  | INFO 12-20 00:49:35 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
runner-llm-1  | /root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
runner-llm-1  | No module named 'vllm._version'
runner-llm-1  |   from vllm.version import __version__ as VLLM_VERSION
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:39 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
runner-llm-1  | INFO 12-20 00:49:39 utils.py:1008] Found nccl from library libnccl.so.2
runner-llm-1  | INFO 12-20 00:49:39 pynccl.py:63] vLLM is using nccl==2.20.5
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:39 utils.py:1008] Found nccl from library libnccl.so.2
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:39 pynccl.py:63] vLLM is using nccl==2.20.5
runner-llm-1  | INFO 12-20 00:49:39 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
runner-llm-1  | INFO 12-20 00:49:48 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:48 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
runner-llm-1  | INFO 12-20 00:49:48 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x729547271d90>, local_subscribe_port=37659, remote_subscribe_port=None)
runner-llm-1  | INFO 12-20 00:49:48 model_runner.py:1060] Starting to load model /models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659...
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:48 model_runner.py:1060] Starting to load model /models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  2.26it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.64it/s]
runner-llm-1  |
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:51 model_runner.py:1071] Loading model weights took 7.5122 GB
runner-llm-1  | INFO 12-20 00:49:51 model_runner.py:1071] Loading model weights took 7.5122 GB
runner-llm-1  | INFO 12-20 00:49:52 distributed_gpu_executor.py:57] # GPU blocks: 11472, # CPU blocks: 4096
runner-llm-1  | INFO 12-20 00:49:52 distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request: 22.41x
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:55 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:49:55 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
runner-llm-1  | INFO 12-20 00:49:55 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
runner-llm-1  | INFO 12-20 00:49:55 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:50:03 custom_all_reduce.py:233] Registering 1235 cuda graph addresses
runner-llm-1  | INFO 12-20 00:50:03 custom_all_reduce.py:233] Registering 1235 cuda graph addresses
runner-llm-1  | (VllmWorkerProcess pid=221) INFO 12-20 00:50:03 model_runner.py:1530] Graph capturing finished in 8 secs.
runner-llm-1  | INFO 12-20 00:50:03 model_runner.py:1530] Graph capturing finished in 8 secs.
runner-llm-1  | 2024-12-20 00:50:03,843 - app.pipelines.llm - INFO - Model loaded: meta-llama/Meta-Llama-3.1-8B-Instruct
runner-llm-1  | 2024-12-20 00:50:03,843 - app.pipelines.llm - INFO - Using GPU memory utilization: 0.85
runner-llm-1  | 2024-12-20 00:50:03,855 - app.utils.hardware - INFO - CUDA devices available:
runner-llm-1  | 2024-12-20 00:50:03,855 - app.utils.hardware - INFO - Device 0: id='GPU-8dc1a60e-43d1-33f1-dcc9-47853e8af470' name='NVIDIA GeForce RTX 3090 Ti' memory_total=25757220864 memory_free=4343595008 major=8 minor=6
runner-llm-1  | 2024-12-20 00:50:03,855 - app.utils.hardware - INFO - Device 1: id='GPU-f5a4c9da-963f-bcec-17f2-401f36506fb1' name='NVIDIA GeForce RTX 3090 Ti' memory_total=25757220864 memory_free=4377149440 major=8 minor=6
runner-llm-1  | 2024-12-20 00:50:03,855 - app.main - INFO - Started up with pipeline LLMPipeline(model_id=meta-llama/Meta-Llama-3.1-8B-Instruct)
runner-llm-1  | 2024-12-20 00:50:03,855 INFO:     Application startup complete.
runner-llm-1  | 2024-12-20 00:50:03,855 INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)

@rickstaa
Copy link
Member

rickstaa commented Dec 20, 2024

@ad-astra-video thanks for the swift review ⚡:

There are a lot of requirements changes in this. I have not reviewed all of them and don't believe we should be changing all requirments to pin to vllm dependencies. Diffusers in particular needs to stay at 0.31.0 to maintain support in other pipelines.

Hi @kyriediculous, Would it be feasible to run the LLM pipeline in its own Docker container? We’re gradually migrating other pipelines into separate containers now that the orchestrator automatically downloads them (see PR #308 and PR #200).

This setup would give you full control over dependencies and streamline reviews, as Brad wouldn’t need to test unrelated pipelines—especially since E2E tests similar to those in realtime aren’t implemented yet. Perhaps your new pip-compile setup already addresses this issue (I haven’t tested it yet).

Additionally, we can merge PR #293, which enables pipeline container overrides. If you have any concerns about using a dedicated container, please let me know so I can better understand the constraints.

Looking forward to hearing your thoughts!

@kyriediculous
Copy link
Contributor Author

@ad-astra-video Do you also have the logs from the crash with 4 GPUs ?

@kyriediculous
Copy link
Contributor Author

kyriediculous commented Dec 20, 2024

@ad-astra-video thanks for the swift review ⚡:

There are a lot of requirements changes in this. I have not reviewed all of them and don't believe we should be changing all requirments to pin to vllm dependencies. Diffusers in particular needs to stay at 0.31.0 to maintain support in other pipelines.

Hi @kyriediculous, Would it be feasible to run the LLM pipeline in its own Docker container? We’re gradually migrating other pipelines into separate containers now that the orchestrator automatically downloads them (see PR #308 and PR #200).

This setup would give you full control over dependencies and streamline reviews, as Brad wouldn’t need to test unrelated pipelines—especially since E2E tests similar to those in realtime aren’t implemented yet. Perhaps your new pip-compile setup already addresses this issue (I haven’t tested it yet).

Additionally, we can merge PR #293, which enables pipeline container overrides. If you have any concerns about using a dedicated container, please let me know so I can better understand the constraints.

Looking forward to hearing your thoughts!

@rickstaa The only requirement would be creating a separate Dockerfile.llm in the docker/ folder right ?

@kyriediculous
Copy link
Contributor Author

@ad-astra-video @rickstaa

  • Fixed streaming
  • Isolated docker image
    • Includes all env vars
    • uses pip-compile to build dependency list
  • Added pipeline parallelism setting
  • Update vLLM to 0.6.5 (0.6.3 was the highest version few weeks ago)

@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Dec 20, 2024

@ad-astra-video Do you also have the logs from the crash with 4 GPUs ?

EDIT: figured it out...the docker containers needs --ipc=host or a --shm-size. Not entirely sure what shm-size should be but there is an example of 10.24g in vllm repo.
https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html

Original response below for reference:

Here they are: lp_vllm_startup.txt

Note it starts up fine using 2 for pipeline parallel size (lp_vllm_2_pipelines.txt).

I don't think we need to hold this up for this. We can run multiple runner containers which I think would be better anyways. Only other nuance for this server is each set of 3090 tis are connect with nvlink.

Attaching logs from the vllm openai compatible server starting up for comparison.
vllm_startup.txt

@ad-astra-video
Copy link
Collaborator

This looks good with the couple small changes suggested above.

The open api spec action is failing, can you run the api spec gen again? Will merge after that is passing.

@kyriediculous
Copy link
Contributor Author

kyriediculous commented Dec 23, 2024

@ad-astra-video Do you also have the logs from the crash with 4 GPUs ?

EDIT: figured it out...the docker containers needs --ipc=host or a --shm-size. Not entirely sure what shm-size should be but there is an example of 10.24g in vllm repo.

https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html

Original response below for reference:

Here they are: lp_vllm_startup.txt

Note it starts up fine using 2 for pipeline parallel size (lp_vllm_2_pipelines.txt).

I don't think we need to hold this up for this. We can run multiple runner containers which I think would be better anyways. Only other nuance for this server is each set of 3090 tis are connect with nvlink.

Attaching logs from the vllm openai compatible server starting up for comparison.

vllm_startup.txt

Vllm abd pytorch uses shared memory to efficiently share tensors between its dataloader workers and its main process.

It defaults to 64mb but I found vlmm requires more.

I forgot to mention it, maybe i should attach a small readme

@ad-astra-video
Copy link
Collaborator

Vllm abd pytorch uses shared memory to efficiently share tensors between its dataloader workers and its main process.

It defaults to 64mb but I found vlmm requires more.

I forgot to mention it, maybe i should attach a small readme

I see this as a documentation thing mostly so a note in the docs should cover this.

Managed containers use default values for IpcMode: private and ShmSize: 67108864. If target is LLM model fits on 1 GPU then documentation can indicate to run the container externally if more than 1 GPU.

@rickstaa
Copy link
Member

rickstaa commented Dec 24, 2024

@ad-astra-video, to help move this forward, could you provide a concise list of the remaining changes you’re waiting on before this can be merged? Based on the discussion, here’s what I’ve noted so far:

  • Requirements Updates

    There are a lot of requirements changes in this. I have not reviewed all of them and don't believe we should be changing all requirements to pin to vllm dependencies. Diffusers, in particular, needs to stay at 0.31.0 to maintain support in other pipelines. I think upgrading torch to 2.4 or 2.5 should be fine from other testing I have done.
    This concern was mitigated by putting LLMS in a sperate container.

  • vllm Version

    Thoughts on using vllm 0.6.4 or 0.6.5?
    This was done.

  • Pipeline Parallel Size

    Thoughts on allowing setting pipeline parallel size?
    This appears to have been completed.

  • CUDA Errors on Multi-GPU Setup

    The runner dies with CUDA errors sometimes...if I allocated 4× 3090 Ti GPUs, it crash loops. With 2 allocated, it loaded Llama 8B fine. This is likely an edge case since why map more than 1 GPU if the model does not take more than 1. Logs from a successful load with 2 GPUs allocated for reference.
    This looks like resolved according to this comment.

  • Streaming Issues

    Does streaming work for you? I get this response, and the container logs are below the response.
    Is there a Dockerfile change I am missing for GPU IDs?
    Have these been addressed? @ad-astra-video, could you confirm?

  • num_heads Divisibility by Tensor Parallel Size

    The model’s num_heads needs to be divisible by the tensor parallel size. Can we add a check for this, or should we prefer to let it fail and crash loop? (Example: Testing Qwen 2.5 32B.)
    Was this discussed or implemented?

  • Documentation Update

    I see this as a documentation task mostly, so a note in the docs should cover this.
    Was this documented as suggested?

Let me know if I missed anything or if there are additional updates to track!

Copy link
Collaborator

@ad-astra-video ad-astra-video left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the suggested changes. My apologies, i think i didnt get these posted.

runner/app/pipelines/llm.py Show resolved Hide resolved
runner/docker/Dockerfile.llm Outdated Show resolved Hide resolved
@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Dec 24, 2024

@rickstaa @kyriediculous looks like i messed up posting the suggested changes.

The ShmSize and IpcHost only come into play if running pipeline parallel or tensor size > 2 I believe. If Nico is comfortable with this we can move forward with it as a documentation note since these variables can be adjusted using external runners.

Suggested changes:

  • add engine_args.load_format = "bitsandbytes" to enable 8 bit support if used
  • remove SFAST from Docker file build to speed up build and limit dependency conflicts down the road since SFAST is in maintenance mode
  • regenerate open api bindings EDIT: will fix in separate PR

@kyriediculous
Copy link
Contributor Author

I regenerated the bindings but there's no change to commit.

@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Dec 26, 2024

@kyriediculous I confirmed fixes and tested runner locally. I am approving and merging and will re-run the open api gen separately to fix.

@ad-astra-video ad-astra-video merged commit b81f898 into livepeer:main Dec 26, 2024
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants