3.1.27

fhieber released this 06 Nov 14:13

· 15 commits to main since this release

[3.1.27]

Changed

allow torch 1.13 in requirements.txt
Replaced deprecated torch.testing.assert_allclose with torch.testing.close for PyTorch 1.14 compatibility.

[3.1.26]

Added

--tf32 0|1 bool device (torch.backends.cuda.matmul.allow_tf32)
enabling 10-bit precision (19 bit total) transparent float32
acceleration. default true for backward compat with torch < 1.12.
allow different --tf32 training continuation

Changed

device.init_device() called by train, translate, and score
allow torch 1.12 in requirements.txt

[3.1.25]

Changed

Updated to sacrebleu==2.3.1. Changed default BLEU floor smoothing offset from 0.01 to 0.1.

[3.1.24]

Fixed

Updated DeepSpeed checkpoint conversion to support newer versions of DeepSpeed.

[3.1.23]

Changed

Change decoder softmax size logging level from info to debug.

[3.1.22]

Added

log beam search avg output vocab size

Changed

common base Search for GreedySearch and BeamSearch
.pylintrc: suppress warnings about deprecated pylint warning suppressions

[3.1.21]

Fixed

Send skip_nvs and nvs_thresh args now to Translator constructor in sockeye-translate instead of ignoring them.

[3.1.20]

Added

Added training support for DeepSpeed.
- Installation: pip install deepspeed
- Usage: deepspeed --no_python ... sockeye-train ...
- DeepSpeed mode uses Zero Redundancy Optimizer (ZeRO) stage 1 (Rajbhandari et al., 2019).
- Run in FP16 mode with --deepspeed-fp16 or BF16 mode with --deepspeed-bf16.

[3.1.19]

Added

Clean up GPU and CPU memory used during training initialization before starting the main training loop.

Changed

Refactored training code in advance of adding DeepSpeed support:
- Moved logic for flagging interleaved key-value parameters from layers.py to model.py.
- Refactored LearningRateScheduler API to be compatible with PyTorch/DeepSpeed.
- Refactored optimizer and learning rate scheduler creation to be modular.
- Migrated to ModelWithLoss API, which wraps a Sockeye model and its losses in a single module.
- Refactored primary and secondary worker logic to reduce redundant calculations.
- Refactored code for saving/loading training states.
- Added utility code for managing model/training configurations.

Removed

Removed unused training option --learning-rate-t-scale.

[3.1.18]

Added

Added sockeye-train and sockeye-translate option --clamp-to-dtype that clamps outputs of transformer attention, feed-forward networks, and process blocks to the min/max finite values for the current dtype. This can prevent inf/nan values from overflow when running large models in float16 mode. See: https://discuss.huggingface.co/t/t5-fp16-issue-is-fixed/3139

[3.1.17]

Added

Added support for offline model quantization with sockeye-quantize.
- Pre-quantizing a model avoids the load-time memory spike of runtime quantization. For example, a float16 model loads directly as float16 instead of loading as float32 then casting to float16.

[3.1.16]

Added

Added nbest list reranking options using isometric translation criteria as proposed in an ICASSP 2021 paper https://arxiv.org/abs/2110.03847.
To use this feature pass a criterion (isometric-ratio, isometric-diff, isometric-lc) when specifying --metric.
Added --output-best-non-blank to output non-blank best hypothesis from the nbest list.

[3.1.15]

Fixed

Fix type of valid_length to be pt.Tensor instead of Optional[pt.Tensor] = None for jit tracing

Assets 2