⚗️ Implement distributed adapter cache #201
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
vLLM currently does not support a distributed adapter cache- all replicas of a deployment must receive an explicit
/v1/load_lora_adapter
call to load an adapter.This PR implements the existing
ADAPTER_CACHE
logic on top of vLLM's http server by injecting middleware that will detect if themodel
field of a request is referencing a set of files from the cache, and if so will pre-load the adapter before continuing with the call.This also wraps the
/v1/models
endpoint to pre-load all adapters from the cache, so that the response is consistent across all replicas of a deployment.Looking for some feedback here- I would like to solve this upstream but this gives us a quick way to roll out distributed lora adapters to users that matches existing TGIS behavior
How Has This Been Tested?
Merge criteria: