Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for quantized kv cache (compressed-tensors) #6028

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

dbogunowicz
Copy link

@dbogunowicz dbogunowicz commented Jul 1, 2024

Feature description

Adding the logic for loading quantized models with additional kv cache quantization, generated using compressed-tensors framework.

Notable changes:

  • CompressedTensorsConfig right now expect to read an optional kv_cache_scheme argument. As of the next compressed-tensors release, the key contains the information about the properties of the quantized kv cache.
  • Added an interface BaseKVCacheMethod and its implementation CompressedTensorsKVCacheMethod, to prepare kv_scale attribute of Attention layer on the model initialization

Current limitations

Currently, the loading of the kv cache scales happens inside the load_model method of the model (in this PR only LLaMa model), which is a bit ugly (we need to copy and paste the helper function into every model to make it read quantized kv cache). Something to think about after the initial round of reviews.

Also compressed-tensors assume separate value for k_scale and v_scale, which is not directly compatible with vLLM. I added minimal logic to reconciles the requirement for scales to be recomputed for "kv tensor".

The UX for the user would not change, however, they will be able to automatically load the compressed-tensors model with this new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants