Kv cache squat #1
                
     Open
            
            
          
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Motivation
The KV cache stores the intermediate representations from previous tokens to accelerate autoregressive decoding. For long sequences, the KV cache can consume more GPU memory than the model weights. During inference, LLM decoding becomes memory-bound, with most of the time spent on data transfer rather than computation. This has led to active research on KV cache quantization, but quantization errors can accumulate as more tokens are generated, causing later tokens to deviate from expected outputs.
This PR
This PR adds the state-of-the-art training-free KV cache quantization method: SQuat (Subspace-orthogonal KV cache quantization). It can significantly reduce memory overhead and latency while maintaining model accuracy.
SQuat constructs a subspace that captures critical task-relevant information, then enforces quantization errors to lie orthogonal to this subspace, minimizing their effect on the output of the attention mechanism.
🌟 Highlights
⚡ Efficient
🏃🏻 Example
Run
example.pyor: