You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (1792). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine
The text was updated successfully, but these errors were encountered:
Hi, thanks for running our codes, it looks like you are encountering an issue with vllm. You could refer to vllm-project/vllm#2418 to try the solution mentioned there. Since the vllm running may depend on cuda version and torch version, I cannot determine the solution for your case. If you still encounter issues with vllm, you may turn to hugging face inference instead.
The text was updated successfully, but these errors were encountered: