Transformer models in Sockeye (#98)

* Positional encodings and initial arguments for transformer * Stub for TransformerEncoder * WIP self attention * ffn * Unmasked self-attention prototype * cleaned up code. Still not tested * Put things together so we can run and debug, some cleanup * Separate layer construction from application in encoder * Added masking for self-attention * More fixes, now runs on CPUs with default args * removed unused code * fix inference for transformer * docstrings * Added Multi-head dot attention for the actual attention mechanism. Enable with --attention-type mhdot * fixed existing tests * Import fix * Precompute positional encodings in variable initialization * temporary fix. Will change later * Pass max_seq_len to Embedding if needed for positional encodings * fix import * more control over positional encodings * Fix masking for MultiheadAttention * Fix nasty bug with layer normalization quietly accepting 3d input. * WIP: decoder * Added transformer test * WIP full transformer with decoder. Inference and RNN is currently broken, work-in-progress * fix auto-regressive bias * Revised Configs and Decoder interface * moved attention into (rnn) decoder * Defined proper Decoder interface for inference. Rewrote RecurrentDecoder to adhere to the new interface. * Fixed bias variable/length problem by writing a custom operator * custom operator for positional encodings * added integration tests * improve consistency * Fixed a last bug in inference regarding lengths. All tests pass now * Bump version * Update tests * Make mypy happy * Support transformer with convolutional embedding encoder * Fix to actually use layer normalization * Allow projecting segment embeddings to arbitrary size * Typo fix * Correct path in documentation pypi upload doc. (#92) * Uniform weight initialization. (#93) * Added transformer dropout * Learning rate warmup * fix * Changed eps for Layer Normalization * Docstrings and cleanup * Better coverage for ConvolutionalEmbeddingEncoder * warmup WIP * Fix travis builds * Removed source_length from inference code. Is now computed in the encoder graph * Added transformer module to doc generation * small fixes * fixed doc generation * Fix tests * Refactored read_metrics_file method to separate multiple of its responsibilities. The new read_metrics_file method can now easily be used for other things, e.g. offline analysis etc. * Removed old method * Fixed argument description * revised arguments according to David & Tobis comments * Fix system tests * Removed duplicate query scaling in DotAttention * adressed Tobis comments * pass correct argument to rnn attention num heads * Moved check for batch2timemajor encoder being last encoder to encoder sequence * Fixed RNN decoder after decoder rewrite. * fix #2 * Do not truncate metrics file in callback_monitor constructor. Restructured saving and loading of metris file to make it consistent. * make pylint happy * adressed Tobis comments * Test averaging in integration/system tests * Adressed Tobis (last?) comments * revised abstract class * adressed tobis comments
awslabs · Aug 14, 2017 · 3f77dfb · 3f77dfb
1 parent 533fae1
commit 3f77dfb
Show file tree

Hide file tree

Showing 31 changed files with 2,250 additions and 772 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -26,7 +26,7 @@
 
 class MockClass(MagicMock):
 
-    def __init__(self, name, *args, **kwargs):
+    def __init__(self, name="", *args, **kwargs):
         super().__init__(*args, **kwargs)
         self.name = name
 

diff --git a/docs/modules.rst b/docs/modules.rst
@@ -134,14 +134,20 @@ sockeye.rnn module
     :members:
     :show-inheritance:
 
-
 sockeye.training module
 -----------------------
 
 .. automodule:: sockeye.training
     :members:
     :show-inheritance:
 
+sockeye.transformer module
+--------------------------
+
+.. automodule:: sockeye.transformer
+    :members:
+    :show-inheritance:
+
 sockeye.utils module
 --------------------
 

diff --git a/sockeye/__init__.py b/sockeye/__init__.py
@@ -11,4 +11,4 @@
 # express or implied. See the License for the specific language governing
 # permissions and limitations under the License.
 
-__version__ = '1.1.1'
+__version__ = '1.2.1'
diff --git a/sockeye/arguments.py b/sockeye/arguments.py
@@ -175,7 +175,32 @@ def add_model_parameters(params):
                               choices=C.ENCODERS,
                               default=C.RNN_NAME,
                               help="Type of encoder. Default: %(default)s.")
+    model_params.add_argument('--decoder',
+                              choices=C.DECODERS,
+                              default=C.RNN_NAME,
+                              help="Type of encoder. Default: %(default)s.")
+
+    model_params.add_argument('--num-layers',
+                              type=int_greater_or_equal(1),
+                              default=1,
+                              help='Number of layers for encoder & decoder architectures. Default: %(default)s.')
+    model_params.add_argument('--encoder-num-layers',
+                              type=int,
+                              default=None,
+                              help='Number of layers for encoder architectures. Overrides --num-layers. '
+                                   'Default: %(default)s.')
+    model_params.add_argument('--decoder-num-layers',
+                              type=int,
+                              default=None,
+                              help='Number of layers for decoder architectures. Overrides --num-layers. '
+                                   'Default: %(default)s.')
 
+    model_params.add_argument('--conv-embed-output-dim',
+                              type=int_greater_or_equal(1),
+                              default=None,
+                              help="Project segment embeddings to this size for ConvolutionalEmbeddingEncoder. Omit to"
+                                   " avoid projection, leaving segment embeddings total size of all filters. Default:"
+                                   " %(default)s.")
     model_params.add_argument('--conv-embed-max-filter-width',
                               type=int_greater_or_equal(1),
                               default=8,
@@ -185,7 +210,7 @@ def add_model_parameters(params):
                               type=int,
                               default=(200, 200, 250, 250, 300, 300, 300, 300),
                               help="List of number of filters of each width 1..max for ConvolutionalEmbeddingEncoder. "
-                              "Default: %(default)s.")
+                                   "Default: %(default)s.")
     model_params.add_argument('--conv-embed-pool-stride',
                               type=int_greater_or_equal(1),
                               default=5,
@@ -194,11 +219,13 @@ def add_model_parameters(params):
                               type=int_greater_or_equal(0),
                               default=4,
                               help="Number of highway layers for ConvolutionalEmbeddingEncoder. Default: %(default)s.")
+    model_params.add_argument('--conv-embed-add-positional-encodings',
+                              action='store_true',
+                              default=False,
+                              help="Add positonal encodings to final segment embeddings for"
+                                   " ConvolutionalEmbeddingEncoder. Default: %(default)s.")
 
-    model_params.add_argument('--rnn-num-layers',
-                              type=int_greater_or_equal(1),
-                              default=1,
-                              help='Number of layers for encoder and decoder. Default: %(default)s.')
+    # rnn arguments
     model_params.add_argument('--rnn-cell-type',
                               choices=C.CELL_TYPES,
                               default=C.LSTM_TYPE,
@@ -212,7 +239,30 @@ def add_model_parameters(params):
                               default=False,
                               help="Add residual connections to stacked RNNs if --rnn-num-layers > 3. "
                                    "(see Wu ETAL'16). Default: %(default)s.")
+    model_params.add_argument('--rnn-context-gating', action="store_true",
+                              help="Enables a context gate which adaptively weighs the RNN decoder input against the "
+                                   "source context vector before each update of the decoder hidden state.")
+
+    # transformer arguments
+    model_params.add_argument('--transformer-model-size',
+                              type=int_greater_or_equal(1),
+                              default=512,
+                              help='Size of all layers and embeddings when using transformer. Default: %(default)s.')
+    model_params.add_argument('--transformer-attention-heads',
+                              type=int_greater_or_equal(1),
+                              default=8,
+                              help='Number of heads for all self-attention when using transformer layers. '
+                                   'Default: %(default)s.')
+    model_params.add_argument('--transformer-feed-forward-num-hidden',
+                              type=int_greater_or_equal(1),
+                              default=2048,
+                              help='Number of hidden units in feed forward layers when using transformer. '
+                                   'Default: %(default)s.')
+    model_params.add_argument('--transformer-no-positional-encodings',
+                              action='store_true',
+                              help='Do not use positional encodings.')
 
+    # embedding arguments
     model_params.add_argument('--num-embed',
                               type=int_greater_or_equal(1),
                               default=512,
@@ -226,15 +276,18 @@ def add_model_parameters(params):
                               default=None,
                               help='Embedding size for target tokens. Overrides --num-embed. Default: %(default)s')
 
+    # attention arguments
     model_params.add_argument('--attention-type',
                               choices=C.ATT_TYPES,
                               default=C.ATT_MLP,
-                              help='Attention model. Choices: {%(choices)s}. '
+                              help='Attention model for RNN decoders. Choices: {%(choices)s}. '
                                    'Default: %(default)s.')
     model_params.add_argument('--attention-num-hidden',
                               default=None,
                               type=int,
                               help='Number of hidden units for attention layers. Default: equal to --rnn-num-hidden.')
+    model_params.add_argument('--attention-use-prev-word', action="store_true",
+                              help="Feed the previous target embedding into the attention mechanism.")
 
     model_params.add_argument('--attention-coverage-type',
                               choices=["tanh", "sigmoid", "relu", "softrelu", "gru", "count"],
@@ -246,7 +299,10 @@ def add_model_parameters(params):
     model_params.add_argument('--attention-coverage-num-hidden',
                               type=int,
                               default=1,
-                              help="Number of hidden units for coverage vectors. Default: %(default)s")
+                              help="Number of hidden units for coverage vectors. Default: %(default)s.")
+    model_params.add_argument('--attention-mhdot-heads',
+                              type=int, default=None,
+                              help='Number of heads for Multi-head dot attention. Default: %(default)s.')
 
     model_params.add_argument('--lexical-bias',
                               default=None,
@@ -276,22 +332,18 @@ def add_model_parameters(params):
     model_params.add_argument('--max-seq-len-source',
                               type=int_greater_or_equal(1),
                               default=None,
-                              help='Maximum source sequence length in tokens. Overrides --max-seq-len. Default: %(default)s')
+                              help='Maximum source sequence length in tokens. '
+                                   'Overrides --max-seq-len. Default: %(default)s')
     model_params.add_argument('--max-seq-len-target',
                               type=int_greater_or_equal(1),
                               default=None,
-                              help='Maximum target sequence length in tokens. Overrides --max-seq-len. Default: %(default)s')
-
-    model_params.add_argument('--attention-use-prev-word', action="store_true",
-                              help="Feed the previous target embedding into the attention mechanism.")
-
-    model_params.add_argument('--context-gating', action="store_true",
-                              help="Enables a context gate which adaptively weighs the decoder input against the"
-                                   "source context vector before each update of the decoder hidden state.")
+                              help='Maximum target sequence length in tokens. '
+                                   'Overrides --max-seq-len. Default: %(default)s')
 
     model_params.add_argument('--layer-normalization', action="store_true",
-                              help="Adds layer normalization before non-linear activations of 1) MLP attention, "
-                                   "2) decoder RNN state initialization, and 3) RNN hidden state. "
+                              help="Adds layer normalization before non-linear activations. "
+                                   "This includes MLP attention, RNN decoder state initialization, "
+                                   "RNN decoder hidden state, transformer layers."
                                    "It does not normalize RNN cell activations "
                                    "(this can be done using the '%s' or '%s' rnn-cell-type." % (C.LNLSTM_TYPE,
                                                                                                 C.LNGLSTM_TYPE))
@@ -367,6 +419,18 @@ def add_training_args(params):
                               default=0.,
                               help='Dropout probability for source embedding and source and target RNNs. '
                                    'Default: %(default)s.')
+    train_params.add_argument('--transformer-dropout-attention',
+                              type=float,
+                              default=0.,
+                              help='Dropout probability for multi-head attention. Default: %(default)s.')
+    train_params.add_argument('--transformer-dropout-relu',
+                              type=float,
+                              default=0.,
+                              help='Dropout probability before relu in feed-forward block. Default: %(default)s.')
+    train_params.add_argument('--transformer-dropout-residual',
+                              type=float,
+                              default=0.,
+                              help='Dropout probability for residual connections. Default: %(default)s.')
 
     train_params.add_argument('--optimizer',
                               default='adam',
@@ -418,7 +482,12 @@ def add_training_args(params):
                               type=float,
                               default=10,
                               help="Half-life of learning rate in checkpoints. For 'fixed-rate-*' "
-                                   "learning rate schedulers. Default: 10.")
+                                   "learning rate schedulers. Default: %(default)s.")
+    train_params.add_argument('--learning-rate-warmup',
+                              type=int,
+                              default=0,
+                              help="Number of warmup steps. If set to x, linearly increases learning rate from 10%% "
+                                   "to 100%% of the initial learning rate. Default: %(default)s.")
 
     train_params.add_argument('--use-fused-rnn',
                               default=False,