Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control maximum parallel requests to inference server #424

Open
bbrowning opened this issue Dec 2, 2024 · 0 comments
Open

Control maximum parallel requests to inference server #424

bbrowning opened this issue Dec 2, 2024 · 0 comments

Comments

@bbrowning
Copy link
Contributor

bbrowning commented Dec 2, 2024

We've observed users overwhelming their inference server, primarily when running data generation in batches against vLLM. What happens is we end up piling on thousands of pending requests into vLLM, causing inference timeouts as we exhaust its KV cache and overwhelm the ability of the teacher model inference to keep up.

Today we really just expose a num-cpus parameter that controls how many batches we process in parallel but we don't actually do anything to control how many in-flight requests we send to the inference server. Instead, we should expose a knob that controls the maximum number of parallel requests we'll send to the inference server at a time, and adjust our request handling so that we respect that amount of concurrency instead of just firing thousands of requests at a back-up inference server.

So, the proposal here is to deprecate the num-cpus parameter and add a new one - name to be determined, but something like max-parallel-requests. And, centralize dispatching of requests to the inference server so we can have a client-side queue of pending requests where we pipeline these to the server based on the configured max-parallel-requests. This would implicitly apply backpressure as we wouldn't have more than N requests in-flight, and also give an easy knob for someone that's hosting teacher models across multiple machines behind a load balancer to increase the parallel requests we send across their distributed inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant