Support for quantized kv cache (compressed-tensors) #6028

dbogunowicz · 2024-07-01T12:43:35Z

Feature description

Adding the logic for loading quantized models with additional kv cache quantization, generated using compressed-tensors framework.

Notable changes:

CompressedTensorsConfig right now expect to read an optional kv_cache_scheme argument. As of the next compressed-tensors release, the key contains the information about the properties of the quantized kv cache.
Added an interface BaseKVCacheMethod and its implementation CompressedTensorsKVCacheMethod, to prepare kv_scale attribute of Attention layer on the model initialization

Current limitations

Currently, the loading of the kv cache scales happens inside the load_model method of the model (in this PR only LLaMa model), which is a bit ugly (we need to copy and paste the helper function into every model to make it read quantized kv cache). Something to think about after the initial round of reviews.

Also compressed-tensors assume separate value for k_scale and v_scale, which is not directly compatible with vLLM. I added minimal logic to reconciles the requirement for scales to be recomputed for "kv tensor".

The UX for the user would not change, however, they will be able to automatically load the compressed-tensors model with this new feature.

tests/quantization/test_compressed_tensors.py

vllm/model_executor/layers/quantization/kv_cache.py

[email protected] and others added 3 commits June 28, 2024 14:26

initial commit

490e8d8

still support one kv_scale

ef8fd84

Merge branch 'main' into main

3466e10

robertgshaw2-neuralmagic self-requested a review July 1, 2024 12:49

robertgshaw2-neuralmagic reviewed Jul 1, 2024

View reviewed changes

tests/quantization/test_compressed_tensors.py Outdated Show resolved Hide resolved

mgoin reviewed Jul 1, 2024

View reviewed changes

vllm/model_executor/layers/quantization/kv_cache.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/kv_cache.py Show resolved Hide resolved

[email protected] added 4 commits July 3, 2024 11:58

update tests

8e91860

Merge branch 'main' of https://github.com/dbogunowicz/vllm

5007f11

Merge remote-tracking branch 'upstream/main'

caf6bce

fix bad rebase

850b33b

dbogunowicz requested review from mgoin and robertgshaw2-neuralmagic July 3, 2024 12:14

[email protected] added 2 commits July 3, 2024 13:32

fix tests

e532f2e

all tests passing locally

6fcf7f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for quantized kv cache (compressed-tensors) #6028

Support for quantized kv cache (compressed-tensors) #6028

dbogunowicz commented Jul 1, 2024 •

edited

Loading

Support for quantized kv cache (compressed-tensors) #6028

Are you sure you want to change the base?

Support for quantized kv cache (compressed-tensors) #6028

Conversation

dbogunowicz commented Jul 1, 2024 • edited Loading

Feature description

Current limitations

dbogunowicz commented Jul 1, 2024 •

edited

Loading