Control maximum parallel requests to inference server #424

bbrowning · 2024-12-02T15:29:56Z

We've observed users overwhelming their inference server, primarily when running data generation in batches against vLLM. What happens is we end up piling on thousands of pending requests into vLLM, causing inference timeouts as we exhaust its KV cache and overwhelm the ability of the teacher model inference to keep up.

Today we really just expose a num-cpus parameter that controls how many batches we process in parallel but we don't actually do anything to control how many in-flight requests we send to the inference server. Instead, we should expose a knob that controls the maximum number of parallel requests we'll send to the inference server at a time, and adjust our request handling so that we respect that amount of concurrency instead of just firing thousands of requests at a back-up inference server.

So, the proposal here is to deprecate the num-cpus parameter and add a new one - name to be determined, but something like max-parallel-requests. And, centralize dispatching of requests to the inference server so we can have a client-side queue of pending requests where we pipeline these to the server based on the configured max-parallel-requests. This would implicitly apply backpressure as we wouldn't have more than N requests in-flight, and also give an easy knob for someone that's hosting teacher models across multiple machines behind a load balancer to increase the parallel requests we send across their distributed inference.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control maximum parallel requests to inference server #424

Control maximum parallel requests to inference server #424

bbrowning commented Dec 2, 2024 •

edited

Loading

Control maximum parallel requests to inference server #424

Control maximum parallel requests to inference server #424

Comments

bbrowning commented Dec 2, 2024 • edited Loading

bbrowning commented Dec 2, 2024 •

edited

Loading