Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] High throughput with large batch size #686

Open
3 tasks done
fzyzcjy opened this issue Nov 26, 2024 · 5 comments
Open
3 tasks done

[REQUEST] High throughput with large batch size #686

fzyzcjy opened this issue Nov 26, 2024 · 5 comments

Comments

@fzyzcjy
Copy link

fzyzcjy commented Nov 26, 2024

Problem

Hi thanks for the library! I hope to use 7B model on 24GB 4090 with as large thoughput as possible (latency is not a problem - it is a batch task). Vllm works well, but it seems that its 8bit kv cache degrades the results a lot (or maybe I do not get it yet). exllamav2 seems to have super good low bit kv cache, thus I would appreciate it if it could have high throughput with large batch size (e.g. batch size = 256).

Solution

(see above)

Alternatives

No response

Explanation

(see above)

Examples

No response

Additional context

No response

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@turboderp
Copy link
Member

You can run with a large batch size if you have the VRAM to store that number of sequences in the cache at once. But throughput and latency are ultimately still connected. If you want to run at bsz 1000 and a context length of 32k or whatever, that means a 32M-token cache. However you manage that, it's going to far outweigh the storage requirement and bandwidth usage for the weights, and at that point why would you even be considering quantization?

@fzyzcjy
Copy link
Author

fzyzcjy commented Nov 27, 2024

Thanks for the reply! I am mainly trying bs=256 and context around 1-2k, and find vllm/lmdeploy quite fast.

@turboderp
Copy link
Member

You can try the bulk_inference.py example which could work at that scale.

@fzyzcjy
Copy link
Author

fzyzcjy commented Dec 1, 2024

Thank you!

@fzyzcjy
Copy link
Author

fzyzcjy commented Dec 2, 2024

@turboderp By the way, I wonder whether it is OK if I use https://github.com/theroyallab/tabbyAPI to test the speed (i.e. will it have nearly same performance as the bulk_inference.py direct batch call)? Currently my code tests vllm / lmdeploy by using their openai compatible server, and send HTTP requests to them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants