2.1.7

fhieber released this 03 Jun 09:40

[2.1.7]

Changed

Optimize prepare_data by saving the shards in parallel. The prepare_data script accepts a new parameter --max-processes to control the level of parallelism with which shards are written to disk.

[2.1.6]

Changed

Updated Dockerfiles optimized for CPU (intgemm int8 inference, full MKL support) and GPU (distributed training with Horovod). See sockeye_contrib/docker.

Added

Official support for int8 quantization with intgemm:
- This requires the "intgemm" fork of MXNet (kpuatamazon/incubator-mxnet/intgemm). This is the version of MXNet used in the Sockeye CPU docker image (see sockeye_contrib/docker).
- Use sockeye.translate --dtype int8 to quantize a trained float32 model at runtime.
- Use the sockeye.quantize CLI to annotate a float32 model with int8 scaling factors for fast runtime quantization.

[2.1.5]

Changed

Changed state caching for transformer models during beam search to cache states with attention heads already separated out. This avoids repeated transpose operations during decoding, leading to faster inference.

[2.1.4]

Added

Added Dockerfiles that build an experimental CPU-optimized Sockeye image:
- Uses the latest versions of kpuatamazon/incubator-mxnet (supports intgemm and makes full use of Intel MKL) and kpuatamazon/sockeye (supports int8 quantization for inference).
- See sockeye_contrib/docker.

[2.1.3]

Changed

Performance optimizations to beam search inference
- Remove unneeded take ops on encoder states
- Gathering input data before sending to GPU, rather than sending each batch element individually
- All of beam search can be done in fp16, if specified by the model
- Other small miscellaneous optimizations
Model states are now a flat list in ensemble inference, structure of states provided by state_structure()

[2.1.2]

Changed

Updated to MXNet 1.6.0

Added

Added support for CUDA 10.2

Removed

Removed support for CUDA<9.1 / CUDNN<7.5

[2.1.1]

Added

Ability to set environment variables from training/translate CLIs before MXNet is imported. For example, users can
configure MXNet as such: --env "OMP_NUM_THREADS=1;MXNET_ENGINE_TYPE=NaiveEngine"

[2.1.0]

Changed

Version bump, which should have been included in commit b0461b due to incompatible models.

[2.0.1]

Changed

Inference defaults to using the max input length observed in training (versus scaling down based on mean length ratio and standard deviations).

Added

Additional parameter fixing strategies:
- all_except_feed_forward: Only train feed forward layers.
- encoder_and_source_embeddings: Only train the decoder (decoder layers, output layer, and target embeddings).
- encoder_half_and_source_embeddings: Train the latter half of encoder layers and the decoder.
Option to specify the number of CPU threads without using an environment variable (--omp-num-threads).
More flexibility for source factors combination

[2.0.0]

Changed

Update to MXNet 1.5.0
Moved SockeyeModel implementation and all layers to Gluon API
Removed support for Python 3.4.
Removed image captioning module
Removed outdated Autopilot module
Removed unused training options: Eve, Nadam, RMSProp, Nag, Adagrad, and Adadelta optimizers, fixed-step and fixed-rate-inv-t learning rate schedulers
Updated and renamed learning rate scheduler fixed-rate-inv-sqrt-t -> inv-sqrt-decay
Added script for plotting metrics files: sockeye_contrib/plot_metrics.py
Removed option --weight-tying. Weight tying is enabled by default, disable with --weight-tying-type none.

Added

Added distributed training support with Horovod/OpenMPI. Use horovodrun and the --horovod training flag.
Added Dockerfiles that build a Sockeye image with all features enabled. See sockeye_contrib/docker.
Added none learning rate scheduler (use a fixed rate throughout training)
Added linear-decay learning rate scheduler
Added training option --learning-rate-t-scale for time-based decay schedulers
Added support for MXNet's Automatic Mixed Precision. Activate with the --amp training flag. For best results, make sure as many model dimensions are possible are multiples of 8.
Added options for making various model dimensions multiples of a given value. For example, use --pad-vocab-to-multiple-of 8, --bucket-width 8 --no-bucket-scaling, and --round-batch-sizes-to-multiple-of 8 with AMP training.
Added GluonNLP's BERTAdam optimizer, an implementation of the Adam variant used by Devlin et al. (2018). Use --optimizer bertadam.
Added training option --checkpoint-improvement-threshold to set the amount of metric improvement required over the window of previous checkpoints to be considered actual model improvement (used with --max-num-checkpoint-not-improved).

Assets 2