Releases: turboderp/exllamav2
Releases · turboderp/exllamav2
0.2.6
- Some small fixes, most notably for Qwen2-VL inference on Windows
Full Changelog: v0.2.5...v0.2.6
0.2.5
- Initial support for Qwen2-VL (images for now, no video)
- Some bugfixes
Full Changelog: v0.2.4...v0.2.5
0.2.4
- Support Pixtral
- Refactoring for more multimodal support
- Faster filter evaluation
- Various optimizations and bugfixes
- Various quality of life improvements
Full Changelog: v0.2.3...v0.2.4
0.2.3
- No longer use safetensors for loading weights (fix virtual memory issues on Windows especially)
- Disable fasttensors option (now redundant)
- Prioritize HF Tokenizers model when both HF and SPM models available
- Add XTC sampler
- Add YaRN support
- Various fixes and QoL improvements
Full Changelog: v0.2.2...v0.2.3
0.2.2
- small fixes related to LMFE
- allow SDPA during normal inference with custom bias
Full Changelog: v0.2.1...v0.2.2
0.2.1
- TP: fallback SDPA mode when flash-attn is unavailable
- Faster filter/grammar path
- Add DRY
- Fix issues since 0.1.9 (streams/graphs) when loading certain models via Tabby
- Banish Râul
Full Changelog: v0.2.0...v0.2.1
0.2.0
Small release to fix various issues in 0.1.9
Full Changelog: v0.1.9...v0.2.0
0.1.9
- Add experimental tensor-parallel mode. Currently supports Llama(1+2+3), Qwen2 and Mistral models
- CUDA Graphs to reduce overhead and CPU bottlenecking
- Various other optimizations
- Some bugfixes
Full Changelog: v0.1.8...v0.1.9
0.1.8
- Support Llama 3.1 (correct RoPE scaling etc.)
- Support IndexTeam architecture
- Some bugfixes and QoL improvements
Full Changelog: v0.1.7...v0.1.8
0.1.7
- Support Gemma2
- Support InternLM2
- Various bugfixes and optimizations
Full Changelog: v0.1.6...v0.1.7