Releases: awslabs/sockeye
Releases · awslabs/sockeye
2.1.16
[2.1.16]
Fixed
- Fixed batch sizing error introduced in version 2.1.12 (c00da52) that caused batch sizes to be multiplied by the number of devices. Batch sizing now works as documented (same as pre-2.1.12 versions).
- Fixed
max-word
batching to properly size batches to a multiple of both--batch-sentences-multiple-of
and the number of devices.
[2.1.15]
Added
- Inference option
--mc-dropout
to use dropout during inference, leading to non-deterministic output. This option uses the same dropout parameters present in the model config file.
[2.1.14]
Added
- Added
sockeye.rerank
option--output
to specify output file. - Added
sockeye.rerank
option--output-reference-instead-of-blank
to output reference line instead of best hypothesis when best hypothesis is blank.
2.1.13
[2.1.13]
Added
- Training option
--quiet-secondary-workers
that suppresses console output for secondary workers when training with Horovod/MPI. - Set version of isort to
<5.0.0
in requirements.dev.txt to avoid incompatibility between newer versions of isort and pylint.
[2.1.12]
Added
- Batch type option
max-word
for max number of words including padding tokens (more predictable memory usage thanword
). - Batching option
--batch-sentences-multiple-of
that is similar to--round-batch-sizes-to-multiple-of
but always rounds down (more predictable memory usage).
Changed
- Default bucketing settings changed to width 8, max sequence length 95 (96 including BOS/EOS tokens), and no bucket scaling.
- Argument
--no-bucket-scaling
replaced with--bucket-scaling
which is False by default.
[2.1.11]
Changed
- Updated
sockeye.rerank
module to use "add-k" smoothing for sentence-level BLEU.
Fixed
- Updated
sockeye.rerank
module to use current N-best format.
2.1.10
[2.1.10]
Changed
- Changed to a cross-entropy loss implementation that avoids the use of SoftmaxOutput.
[2.1.9]
Added
- Added training argument
--ignore-extra-params
to ignore extra parameters when loading models. The primary use case is continuing training with a model that has already been annotated with scaling factors (sockeye.quantize
).
Fixed
- Properly pass
allow_missing
flag tomodel.load_parameters()
[2.1.8]
Changed
- Update to sacrebleu=1.4.10
2.1.7
[2.1.7]
Changed
- Optimize prepare_data by saving the shards in parallel. The prepare_data script accepts a new parameter
--max-processes
to control the level of parallelism with which shards are written to disk.
[2.1.6]
Changed
- Updated Dockerfiles optimized for CPU (intgemm int8 inference, full MKL support) and GPU (distributed training with Horovod). See sockeye_contrib/docker.
Added
- Official support for int8 quantization with intgemm:
- This requires the "intgemm" fork of MXNet (kpuatamazon/incubator-mxnet/intgemm). This is the version of MXNet used in the Sockeye CPU docker image (see sockeye_contrib/docker).
- Use
sockeye.translate --dtype int8
to quantize a trained float32 model at runtime. - Use the
sockeye.quantize
CLI to annotate a float32 model with int8 scaling factors for fast runtime quantization.
[2.1.5]
Changed
- Changed state caching for transformer models during beam search to cache states with attention heads already separated out. This avoids repeated transpose operations during decoding, leading to faster inference.
[2.1.4]
Added
- Added Dockerfiles that build an experimental CPU-optimized Sockeye image:
- Uses the latest versions of kpuatamazon/incubator-mxnet (supports intgemm and makes full use of Intel MKL) and kpuatamazon/sockeye (supports int8 quantization for inference).
- See sockeye_contrib/docker.
[2.1.3]
Changed
- Performance optimizations to beam search inference
- Remove unneeded take ops on encoder states
- Gathering input data before sending to GPU, rather than sending each batch element individually
- All of beam search can be done in fp16, if specified by the model
- Other small miscellaneous optimizations
- Model states are now a flat list in ensemble inference, structure of states provided by
state_structure()
[2.1.2]
Changed
- Updated to MXNet 1.6.0
Added
- Added support for CUDA 10.2
Removed
- Removed support for CUDA<9.1 / CUDNN<7.5
[2.1.1]
Added
- Ability to set environment variables from training/translate CLIs before MXNet is imported. For example, users can
configure MXNet as such:--env "OMP_NUM_THREADS=1;MXNET_ENGINE_TYPE=NaiveEngine"
[2.1.0]
Changed
- Version bump, which should have been included in commit b0461b due to incompatible models.
[2.0.1]
Changed
- Inference defaults to using the max input length observed in training (versus scaling down based on mean length ratio and standard deviations).
Added
- Additional parameter fixing strategies:
all_except_feed_forward
: Only train feed forward layers.encoder_and_source_embeddings
: Only train the decoder (decoder layers, output layer, and target embeddings).encoder_half_and_source_embeddings
: Train the latter half of encoder layers and the decoder.
- Option to specify the number of CPU threads without using an environment variable (
--omp-num-threads
). - More flexibility for source factors combination
[2.0.0]
Changed
- Update to MXNet 1.5.0
- Moved
SockeyeModel
implementation and all layers to Gluon API - Removed support for Python 3.4.
- Removed image captioning module
- Removed outdated Autopilot module
- Removed unused training options: Eve, Nadam, RMSProp, Nag, Adagrad, and Adadelta optimizers,
fixed-step
andfixed-rate-inv-t
learning rate schedulers - Updated and renamed learning rate scheduler
fixed-rate-inv-sqrt-t
->inv-sqrt-decay
- Added script for plotting metrics files: sockeye_contrib/plot_metrics.py
- Removed option
--weight-tying
. Weight tying is enabled by default, disable with--weight-tying-type none
.
Added
- Added distributed training support with Horovod/OpenMPI. Use
horovodrun
and the--horovod
training flag. - Added Dockerfiles that build a Sockeye image with all features enabled. See sockeye_contrib/docker.
- Added
none
learning rate scheduler (use a fixed rate throughout training) - Added
linear-decay
learning rate scheduler - Added training option
--learning-rate-t-scale
for time-based decay schedulers - Added support for MXNet's Automatic Mixed Precision. Activate with the
--amp
training flag. For best results, make sure as many model dimensions are possible are multiples of 8. - Added options for making various model dimensions multiples of a given value. For example, use
--pad-vocab-to-multiple-of 8
,--bucket-width 8 --no-bucket-scaling
, and--round-batch-sizes-to-multiple-of 8
with AMP training. - Added GluonNLP's BERTAdam optimizer, an implementation of the Adam variant used by Devlin et al. (2018). Use
--optimizer bertadam
. - Added training option
--checkpoint-improvement-threshold
to set the amount of metric improvement required over the window of previous checkpoints to be considered actual model improvement (used with--max-num-checkpoint-not-improved
).
1.18.115
[1.18.115]
Added
- Added requirements for MXnet compatible with cuda 10.1.
[1.18.114]
Fixed
- Fix bug in prepare_train_data arguments.
[1.18.113]
Fixed
- Added logging arguments for prepare_data CLI.
[1.18.112]
Added
- Option to suppress creation of logfiles for CLIs (
--no-logfile
).
[1.18.111]
Added
- Added an optional checkpoint callback for the train function.
Changed
- Excluded gradients from pickled fields of TrainState
[1.18.110]
Changed
- We now guard against failures to run
nvidia-smi
for GPU memory monitoring.
[1.18.109]
Fixed
- Fixed the metric names by prefixing training metrics with 'train-' and validation metrics with 'val-'. Also restricted the custom logging function to accept only a dictionary and a compulsory global_step parameter.
[1.18.108]
Changed
- More verbose log messages about target token counts.
[1.18.107]
Changed
- Updated to MXNet 1.5.0
1.18.106
[1.18.106]
Added
- Added an optional time limit for stopping training. The training will stop at the next checkpoint after reaching the time limit.
[1.18.105]
Added
- Added support for a possibility to have a custom metrics logger - a function passed as an extra parameter. If supplied, the logger is called during training.
[1.18.104]
Changed
- Implemented an attention-based copy mechanism as described in Jia, Robin, and Percy Liang. "Data recombination for neural semantic parsing." (2016).
- Added a <ptr\d+> special symbol to explicitly point at an input token in the target sequence
- Changed the decoder interface to pass both the decoder data and the pointer data.
- Changed the AttentionState named tuple to add the raw attention scores.
[1.18.103]
Added
- Added ability to score image-sentence pairs by extending the scoring feature originally implemented for machine
translation to the image captioning module.
[1.18.102]
Fixed
- Fixed loading of more than 10 source vocabulary files to be in the right, numerical order.
[1.18.101]
Changed
- Update to Sacrebleu 1.3.6
[1.18.100]
Fixed
- Always initializing the multiprocessing context. This should fix issues observed when running
sockeye-train
.
[1.18.99]
Changed
- Updated to MXNet 1.4.1
[1.18.98]
Changed
- Converted several transformer-related layer implementations to Gluon HybridBlocks. No functional change.
1.18.97
[1.18.97]
Changed
- Updated to PyYAML 5.1
[1.18.96]
Changed
- Extracted prepare vocab functionality in the build vocab step into its own function. This matches the pattern in prepare data and train where the main() function only has argparsing, and it invokes a separate function to do the work. This is to allow modules that import this one to circumvent the command line.
[1.18.95]
Changed
- Removed custom operators from transformer models and replaced them with symbolic operators.
Improves Performance.
[1.18.94]
Added
- Added ability to accumulate gradients over multiple batches (--update-interval). This allows simulation of large
batch sizes on environments with limited memory. For example: training with--batch-size 4096 --update-interval 2
should be close to training with--batch-size 8192
at smaller memory footprint.
[1.18.93]
Fixed
- Made
brevity_penalty
argument inTranslator
class optional to ensure backwards compatibility.
1.18.92
[1.18.92]
Added
- Added sentence length (and length ratio) prediction to be able to discourage hypotheses that are too short at inference time. Can be enabled for training with
--length-task
and with--brevity-penalty-type
during inference.
[1.18.91]
Changed
- Multiple lexicons can now be specified with the
--restrict-lexicon
option:- For a single lexicon:
--restrict-lexicon /path/to/lexicon
. - For multiple lexicons:
--restrict-lexicon key1:/path/to/lexicon1 key2:/path/to/lexicon2 ...
. - Use
--json-input
to specify the lexicon to use for each input, ex:{"text": "some input string", "restrict_lexicon": "key1"}
.
- For a single lexicon:
[1.18.90]
Changed
- Updated to MXNet 1.4.0
- Integration tests no longer check for equivalence of outputs with batch size 2
[1.18.89]
Fixed
- Made the length ratios per bucket change backwards compatible.
[1.18.88]
Changed
- Made sacrebleu a pip dependency and removed it from
sockeye_contrib
.
[1.18.87]
Added
- Data statistics at training time now compute mean and standard deviation of length ratios per bucket.
This information is stored in the model's config, but not used at the moment.
[1.18.86]
Added
- Added the
--fixed-param-strategy
option that allows fixing various model parameters during training via named strategies.
These include some of the simpler combinations from Wuebker et al. (2018) such as fixing everything except the first and last layers of the encoder and decoder (all_except_outer_layers
). See the help message for a full list of strategies.
1.18.85
[1.18.85]
Changed
- Disabled dynamic batching for
Translator.translate()
by default due to increased memory usage. The default is to
fill-up batches toTranslator.max_batch_size
.
Dynamic batching can still be enabled iffill_up_batches
is set to False.
Added
- Added parameter to force training to stop after a given number of checkpoints. Useful when forced to share limited GPU resources.
[1.18.84]
Fixed
- Fixed lexical constraints bugs that broke batching and caused large drop in BLEU.
These were introduced with sampling (1.18.64).
[1.18.83]
Changed
- The embedding size is automatically adjusted to the Transformer model size in case it is not specified on the command line.
[1.18.82]
Fixed
- Fixed type conversion in metrics file reading introduced in 1.18.79.
[1.18.81]
Fixed
- Making sure the training pickled training state contains the checkpoint decoder's BLEU score of the last checkpoint.
[1.18.80]
Fixed
- Fixed a bug introduced in 1.18.77 where blank lines in the training data resulted in failure.
[1.18.79]
Added
- Writing of the convergence/divergence status to the metrics file and guarding against numpy.histogram's errors for NaNs during divergent behaviour.
1.18.78
[1.18.78]
Changed
- Dynamic batch sizes:
Translator.translate()
will adjust batch size in beam search to the actual number of inputs without using padding.
[1.18.77]
Added
sockeye.score
now loads data on demand and doesn't skip any input lines
[1.18.76]
Changed
- Do not compare scores from translation and scoring in integration tests.
Added
- Adding the option via the flag
--stop-training-on-decoder-failure
to stop training in case the checkpoint decoder dies (e.g. because there is not enough memory).
In case this is turned on a checkpoint decoder is launched right when training starts in order to fail as early as possible.
[1.18.75]
Changed
- Do not create dropout layers for inference models for performance reasons.
[1.18.74]
Changed
- Revert change in 1.18.72 as no memory saving could be observed.
[1.18.73]
Fixed
- Fixed a bug where
source-factors-num-embed
was not correctly adjusted tonum-embed
when using prepared data &source-factor-combine
sum.