Releases: OpenNMT/CTranslate2
Releases · OpenNMT/CTranslate2
CTranslate2 1.15.0
New features
- [Experimental] The Python package published on PyPI now includes GPU support. The binary is compiled with CUDA 10.1, but all CUDA dependencies are integrated in the package and do not need to be installed on the system. The only requirement should be a working GPU with driver version >= 418.39.
Fixes and improvements
- Remove the TensorRT dependency to simplify installation and reduce memory usage:
- Reduce GPU Docker images size by 600MB
- Reduce memory usage on the GPU and the system by up 1GB
- Reduce initialization time during the first GPU translation
- Improve TopK performance on GPU for K < 5
- Improve INT8 performance on GPU
- Accept linear layers without bias when converting models
- Update Intel MKL to 2020.4
- [Python] Improve compatibility with Python 3.9
CTranslate2 1.14.0
New features
- Accept target prefix in file translation APIs
Fixes and improvements
- Fix CUDA illegal memory access when changing the beam size in the same process
- Fix decoding with target prefix that sometimes did not go beyond the prefix
- Fix Intel MKL search paths on macOS
- Update Intel MKL to 2020.3
- Clarify error message when selecting a CUDA device in CPU-only builds
CTranslate2 1.13.2
Fixes and improvements
- Fix model conversion to
float16
when using the Python converters: weights were duplicated and not correctly converted - Fix incorrect code logic that could lead to incorrect translation results in batch mode
CTranslate2 1.13.1
Fixes and improvements
- Fix performance regression when decoding with a large beam size on GPU
CTranslate2 1.13.0
New features
- Environment variable
CT2_TRANSLATORS_CORE_OFFSET
to pin parallel translators to a range of CPU cores (only forintra_threads
= 1) - [Python] Add some properties to the
Translator
object:device
device_index
num_translators
num_queued_batches
model_is_loaded
Fixes and improvements
- Improve batch performance of target prefix
- Improve performance when the input batch contains sentences with very different lengths
- Improve beam search performance by expanding the batch size only after the first decoding step
- Optimize Transpose op on GPU for the permutation used in multi-head attention
- Remove padding in returned attention vectors
- Update Intel MKL to 2020.2
CTranslate2 1.12.1
Fixes and improvements
- Fix implicit int16 to float16 model conversion on compatible GPUs
CTranslate2 1.12.0
Changes
- Docker images based on Ubuntu 16.04 are no longer updated
New features
- Support
float16
data type for model conversion (with--quantization float16
) and computation (with--compute_type float16
). FP16 execution can improve performance by up to 50% on NVIDIA GPUs with Compute Capability >= 7.0. - Add Docker images with newer CUDA versions, which can improve performance in some cases:
latest-ubuntu18-cuda10.0
(same aslatest-ubuntu18-gpu
)latest-ubuntu18-cuda10.1
latest-ubuntu18-cuda10.2
latest-centos7-cuda10.0
(same aslatest-centos7-gpu
)latest-centos7-cuda10.1
latest-centos7-cuda10.2
- Allow setting a computation type per device (e.g.
Translator(..., compute_type={"cuda": "float16", "cpu": "int8"})
with the Python API) - [C++] Add
ModelReader
interface to customize model loading
Fixes and improvements
- Optimize Transpose op on CPU for the permutation used in multi-head attention
- Optimize GELU op on CPU with Intel MKL
- Fix compilation when targeting an architecture and disabling ISA dispatch (e.g.:
-DCMAKE_CXX_FLAGS="-march=skylake" -DENABLE_CPU_DISPATCH=OFF
) - Inline some frequently called methods
CTranslate2 1.11.0
New features
- Add tokenization and detokenization hooks for file translation APIs
- Add alternatives to Intel MKL:
- Integrate oneDNN for GEMM functions
- Implement vectorized operators that automatically select the instruction set architecture (ISA) (can be manually controlled with the
CT2_FORCE_CPU_ISA
environment variable)
- When alternatives are available, avoid using Intel MKL on non Intel processors (can be manually controlled with the
CT2_USE_MKL
environment variable) - Enable a verbose mode with the environment variable
CT2_VERBOSE=1
to help debugging the run configuration (e.g. the detected CPU, whether Intel MKL is being used, etc.)
Fixes and improvements
- Improve numerical precision of SoftMax and LogSoftMax layers on CPU
- Parallelize INT16 quantization/dequantization and ReLU on CPU
- Add back the translation client in CentOS 7 Docker images
CTranslate2 1.10.2
Fixes and improvements
- [Python] Fix error when calling
unload_model(to_cpu=True)
for models with shared weights - [Python] Do not ignore errors when importing the compiled translator extension
CTranslate2 1.10.1
Fixes and improvements
- Force
intra_threads
to 1 when running a model on GPU to prevent high CPU load - Improve handling of decoding length constraints when using a target prefix
- Do not raise an error when setting
use_vmap
but no vocabulary map exists