Releases: NVIDIA/kvpress
Releases · NVIDIA/kvpress
v0.2.1
v0.2.0
Transformers v4.48 introduced breaking changes handled in this release. The release also features AdaKVPress
, the first press allowing head-wise compression by patching the attention functions registered in ALL_ATTENTION_FUNCTIONS
since v4.48. When combined with ExpectedAttentionPress
, AdaKVPress
achieved the best results observed yet on the RULER benchmark (see this post).
v0.1.1
What's Changed
- #33 by @SimJeg fixes a small bug in the pipeline
- #36 by @maxjeblick sets transformers <4.48 as a dependency
Full Changelog: v0.1.0...v0.1.1
v0.1.0
#24 by @maxjeblick and #29 by @SimJeg introduce a non-breaking refactoring:
- a press does not require the
compression_ratio
input argument anymore as some presses do not explicitly require it (e.g.ThinKPress
,SimLayerKVPress
). However every press must have acompression_ratio
attribute after any forward pass (assertion added in tests) to allow average compression ratio measurement on a benchmark - the core compression logic has been moved from
BasePress.forward_hook
toBasePress.compress
.BasePress.forward_hook
now only checks ifcompress
must be called (pre-filling vs decoding), de-quantize cache beforecompress
and re-quantize it afterwards - the
BasePress
does not implement ascore
method anymore, this has been moved to theScorerPress
with the associatedScorerPress.compress
method
Other features:
- Add
SimLayerKVPress
, #28 by @SimJeg and @dame-cell - Add
ComposedPress
, #29 by @SimJeg - Add
KeyReRotationPress
, #31 by @maxjeblick and @giulio98 - Fix
QuantizedCache
, #30 by @maxjeblick - Add new tests, including an integration test on a sample from RULER
v0.0.4
v0.0.3
v0.0.2
Initial release
v0.0.1 install poetry in workflows (#1)