Skip to content

v0.0.21: Expand caching support for inference, GQA training support, TGI improved performance

Compare
Choose a tag to compare
@JingyaHuang JingyaHuang released this 09 Apr 08:46
· 105 commits to main since this release

What's Changed

Training

  • Add GQA optimization for Tensor Parallel training to support the case tp_size > num_key_value_heads by @michaelbenayoun in #498
  • Mixed-precision training with both torch_xla or torch.autocast by @michaelbenayoun in #523

Inference

  • Add caching support for traced TorchScript models (eg. encoders, stable diffusion models) by @JingyaHuang in #510
  • Support phi model on feature-extraction, text-classification, token-classification tasks by @JingyaHuang in #509

TGI

Caveat

AWS Neuron SDK 2.18 doesn't support the compilation of SDXL's unet with weights / neff separation, inline_weights_to_neff=True is forced through:

  • Disable weights / neff separation of SDXL's UNET for neuron sdk 2.18 by @JingyaHuang in #554

Other changes

New Contributors

Full Changelog: v0.0.20...v0.0.21