-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce adapters cannot be loaded past --adapter-memory-fraction
#306
Conversation
@@ -81,6 +81,9 @@ def info(self) -> InfoResponse: | |||
@abstractmethod | |||
def batch_type(self) -> Type[B]: | |||
raise NotImplementedError | |||
|
|||
def adapter_memory_size(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do want to return 0 here for other model types to ensure they still work (like Bloom), even though we won't be enforcing the adapter memory reservation.
// Add back cost for all offload adapters | ||
for adapter in offload_adapters.iter() { | ||
let queue = self.queue_map.get(adapter).unwrap().clone(); | ||
let cost = queue.cost().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a none
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one originally, but we never add to the active set until the cost non-none (see check below). So if this condition is violated, it should be a programming error.
|
||
return generate_pb2.DownloadAdapterResponse( | ||
downloaded=True, | ||
memory_fraction=adapter_memory_fraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious - why convert to a fraction instead of passing the actual cost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way I don't have to plumb the reservation amount to the router, I just always work with 1 as the reservation amount for simplicity.
Fixes #53.
One of the main issues with our existing adapter loading and offloading strategy is that it relies on the user setting a fixed adapter limit via
--max-active-adapters
(default:128
). However, this doesn't account for the fact that adapter sizes can vary by orders of magnitude based on the rank. As such, the server might do well with 128 rank 8 adapters but fall over with just a handful of rank 128 adapters.In #303, we introduced a new parameter
--adapter-memory-fraction
that allows setting aside a dedicated pool of GPU memory to account for adapter overhead. This prevents the KV cache from expanding past the reservation set aside for adapters, reducing memory pressure. However, because the LoRAX scheduler still works off of the max active adapters, this means that users can still blow up the GPU memory by setting max active adapters higher than what can be accommodated by the adapter memory fraction.This PR reconciles the scheduler with the adapter memory fraction. Now, the LoRAX scheduler will look at the size of each adapter after download to determine whether it can be loaded safely on the GPU. If not, then the adapter will wait until enough space is freed up the other active adapters becoming idle before the adapter can be moved to the GPU. This should ensure that no CUDA OOMs occur due to loading too many adapters.
Note that with this change, the max active adapters may no longer be needed, but we will keep it around for now to avoid backwards incompatible changes. However, the new default of 1024 should be sufficiently high that it won't be used in most cases.
The new default
--adapter-memory-fraction
will be0.1
, meaning 10% of GPU memory will be set aside for adapters. To go back to the previous behavior, users can set the following parameters:But in general it is recommended to avoid modifying max active adapters going forward and instead tune adapter memory fraction to find the right balance between KV cache size and concurrent adapters.