You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've observed users overwhelming their inference server, primarily when running data generation in batches against vLLM. What happens is we end up piling on thousands of pending requests into vLLM, causing inference timeouts as we exhaust its KV cache and overwhelm the ability of the teacher model inference to keep up.
Today we really just expose a num-cpus parameter that controls how many batches we process in parallel but we don't actually do anything to control how many in-flight requests we send to the inference server. Instead, we should expose a knob that controls the maximum number of parallel requests we'll send to the inference server at a time, and adjust our request handling so that we respect that amount of concurrency instead of just firing thousands of requests at a back-up inference server.
So, the proposal here is to deprecate the num-cpus parameter and add a new one - name to be determined, but something like max-parallel-requests. And, centralize dispatching of requests to the inference server so we can have a client-side queue of pending requests where we pipeline these to the server based on the configured max-parallel-requests. This would implicitly apply backpressure as we wouldn't have more than N requests in-flight, and also give an easy knob for someone that's hosting teacher models across multiple machines behind a load balancer to increase the parallel requests we send across their distributed inference.
The text was updated successfully, but these errors were encountered:
We've observed users overwhelming their inference server, primarily when running data generation in batches against vLLM. What happens is we end up piling on thousands of pending requests into vLLM, causing inference timeouts as we exhaust its KV cache and overwhelm the ability of the teacher model inference to keep up.
Today we really just expose a
num-cpus
parameter that controls how many batches we process in parallel but we don't actually do anything to control how many in-flight requests we send to the inference server. Instead, we should expose a knob that controls the maximum number of parallel requests we'll send to the inference server at a time, and adjust our request handling so that we respect that amount of concurrency instead of just firing thousands of requests at a back-up inference server.So, the proposal here is to deprecate the
num-cpus
parameter and add a new one - name to be determined, but something likemax-parallel-requests
. And, centralize dispatching of requests to the inference server so we can have a client-side queue of pending requests where we pipeline these to the server based on the configuredmax-parallel-requests
. This would implicitly apply backpressure as we wouldn't have more than N requests in-flight, and also give an easy knob for someone that's hosting teacher models across multiple machines behind a load balancer to increase the parallel requests we send across their distributed inference.The text was updated successfully, but these errors were encountered: