Releases: OpenNMT/CTranslate2
Releases · OpenNMT/CTranslate2
CTranslate2 3.15.1
Fixes and improvements
- Fix an error when using the new
static_prompt
argument in the methodsgenerate_tokens
andgenerate_batch
- Improve the performance of models using ALiBi
CTranslate2 3.15.0
New features
- Initial support of encoder-only Transformer model via a new class
ctranslate2.Encoder
- Update the Transformers converter to support the Falcon models
- Add a generation argument
static_prompt
to optimize the execution for models using system prompts: the model state for this prompt is cached and reused in future calls - Support early stopping in greedy search when the callback function returns
True
- Make the layer norm epsilon value configurable in the model configuration file
config.json
- Add Tanh as a possible activation function
Fixes and improvements
- Fix a performance issue when running models using ALiBi on the GPU
- Fix application of the rotary embeddings when the multi-query attention is used
- Fix conversion of Marian models using
tied-embeddings-all: false
- Remove
use_fast
argument when loading Hugging Face tokenizers to use the default tokenizer for the model
CTranslate2 3.14.0
New features
- Update the Transformers converter with new architectures:
- CodeGen
- GPTBigCode
- LLaMa
- MPT
- Update the OpenNMT-py converter to support some recent options:
layer_norm="rms"
max_relative_positions=-1
(rotary embeddings)max_relative_positions=-2
(ALiBi)pos_ffn_activation_fn="silu"
- Update the OpenNMT-tf converter to support models using different configurations for the encoder and decoder (e.g. post-norm in the encoder and pre-norm in the decoder)
- Implement the multi-query attention (used by GPTBigCode)
Fixes and improvements
- Support paths containing Unicode characters on Windows
- Fix the
generate_tokens
method to properly raise the underlying exception instead of hanging indefinitely - Fix compilation error when using
-DBUILD_SHARED_LIBS=OFF
- Fix runtime errors when linking against
libctranslate2.a
without using the "whole archive" flags
CTranslate2 3.13.0
New features
- Support conversion of GPT-NeoX models with the Transformers converter
- Extend the
end_token
argument to also accept a list of tokens - Add option
return_end_token
to include the end token in the results of the methodsgenerate_batch
andtranslate_batch
(by default the end token is removed) - Expose the
callback
argument for the methodsgenerate_batch
andtranslate_batch
to get early results from the decoding loop - Fallback to a custom threading implementation when OpenMP is not used (which is currently the case for the macOS ARM64 Python wheels)
- Define the CMake package
CTranslate2::ctranslate2
to facilitate the library integration in other CMake projects
Fixes and improvements
- Fix the vocabulary loading when some tokens end with the carriage return
- Implement a fused kernel to apply the rotary embeddings
- Update the Ruy library to commit 363f2522
CTranslate2 3.12.0
New features
- Add methods
Generator.generate_tokens
andTranslator.generate_tokens
returning a generator that yields tokens as soon as they are generated by the model (not compatible with beam search) - Improve performance of rotary embeddings on CPU with an alternative implementation that is enabled when setting
rotary_interleave=False
in the model specification (may require to permute QK weights) - Support a variable number of input frames in method
Whisper.align
to improve batch support - Expose flag
low_cpu_mem_usage
in the Transformers converter to reduce the memory usage when loading large models (requires the packageaccelerate
)
Fixes and improvements
- Fix crash in
Whisper.align
whennum_frames // 2 <= median_filter_width
- Raise an error if arguments
end_token
orsuppress_sequences
contain tokens that are not in the vocabulary - Optimize the quantization of FP16 weights during the model conversion
- In the Transformers converter, also load the model weights in FP16 when the selected quantization is
int8_float16
- Update the Whisper timestamp decoding rules to prevent the generation of segments with zero duration
CTranslate2 3.11.0
Changes
- The Python wheels for macOS ARM are now built with the Ruy backend to support INT8 computation. This will change the performance and results when loading an INT8 model and/or using the
auto
compute type. To keep the previous behavior, setcompute_type="float32"
.
New features
- Support conversion of the GPT-J architecture
- Support conversion of models using rotary position embeddings
- Apply the new OpenNMT-py option
decoder_start_token
- Add option
revision
in the Transformers converter to download a specific revision of the model from the Hugging Face Hub
CTranslate2 3.10.3
Fixes and improvements
- Fix a synchronization issue when the model input is a CUDA storage
CTranslate2 3.10.2
Fixes and improvements
- Select the correct device when copying a
StorageView
instance
CTranslate2 3.10.1
Fixes and improvements
- Add missing device setter in
Whisper.encode
CTranslate2 3.10.0
New features
- Add
Generator
optioninclude_prompt_in_result
(True
by default) - Add method
Whisper.encode
to only run the Whisper encoder - Add model properties
Whisper.device
andWhisper.device_index
Fixes and improvements
- Update the methods
Whisper.detect_language
,Whisper.generate
, andWhisper.align
to accept the encoder output - Fix a crash when running
Generator.forward
on GPU and the generator object is destroyed before the forward output - Fix parsing of Marian YAML vocabulary files containing "complex key mappings" and escaped sequences such as "\x84"