Replies: 2 comments
-
5 good looking call |
Beta Was this translation helpful? Give feedback.
0 replies
-
it's a hard check, you can specify |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I thought vLLMs dynamic mgmt of memory would allow the GPU memory to handle as much KVCache as it could aka Growing the KV Cache one page at a time as the KVCache grows and constantly deleting finished queries and reallocating memory.
However, when putting Llama-3.1-8b on a 24G 4090, it errors out that not enough space is available for a FULL 128k context length.
I am effectively running the vLLM offline inference example but with Llama-3.1-8B (https://docs.vllm.ai/en/v0.5.5/getting_started/examples/offline_inference.html)
What am I miss understanding here?
Beta Was this translation helpful? Give feedback.
All reactions