diff --git a/fast-beam/fast-beam.pdf b/fast-beam/fast-beam.pdf index 316f791..a6a31aa 100644 Binary files a/fast-beam/fast-beam.pdf and b/fast-beam/fast-beam.pdf differ diff --git a/fast-beam/fast-beam.tex b/fast-beam/fast-beam.tex index 4be003c..6ab84be 100644 --- a/fast-beam/fast-beam.tex +++ b/fast-beam/fast-beam.tex @@ -25,7 +25,7 @@ \begin{document} % \mtsummitHeader{x}{x}{xxx-xxx}{2016}{45-character paper description goes here}{Author(s) initials and last name go here} -\title{\bf Faster Beam Search for Neural Machine Translation} +\title{\bf Faster Neural Machine Translation Inference} \author{\name{\bf Hieu Hoang} \hfill \addr{hieu@hoang.co.uk}\\ \addr{} @@ -62,9 +62,13 @@ \section{Introduction and Prior Work} \section{Proposal} \label{sec:Proposal} -We based our model on the sequence-to-sequence model of (??? Cho) for machine translation models, but unlike (??? Devlin), we avoid solutions for the specific model. Therefore, our solution should be applicable to other models, architectures and task which have the similar characteristics. We envisage that our solutions would be of value to models used in text summarization, chatbot or image captioning. +We will look at two areas that are critical to fast NMT inference where the models in NMT differ significantly from those in other applications. These areas have been overlooked by the general deep-learning community, we aim to improve their efficient for NMT-specific tasks. -We also choose to focus on the use of GPU, rather than CPU as pursued in (??? Devlin). +Firstly, the number of classes in many deep-learning applications is small. However, the number of classes in NMT models is typically in the tens or hundreds of thousands, corresponding to the vocabulary size of the output language. For example, the best German-English NMT model at WMT14 contains 85,000 classes (Rico ???). This makes the output layer of NMT models very computationally expensive. ??? shows the breakdown of amount of time during translation using RNN model similar to (Rico ???); over 40\% of the time is involved in calculating the activation, softmax and the beam search. We will look at optimizations which explicitly target the output layer. + +Secondly, mini-batching is often used to increase the speed of deep-learning applications, including NMT systems. (??? graph of batch v. speed) shows using batching can increase speed by 17 times. However, mini-batching does not take into acccount the variable length of NMT inputs and outputs, creating computational issues and efficiency challenges. The computational issues are often solved by masking unwanted calculations. Translation speed can often be improved by employing max-batching whereby the input set is pre-sorted by sentence length before mini-batching, creating mini-batches with similar length input sentences. However, the target sentence lengths will still differ even for similar length inputs but the standard mini-batchin algorithm must continue to process the batch until all target sentences have been completed. (??? batchsize size v. iterations) shows the number of sentences still being processed for each decoding iteration, for a specific batch. The number of sentences still to be processed decreases, reducing the effectiveness of mini-batching. We will propose an alternative to the mini-batching algorithm. + +We based our model on the sequence-to-sequence model of (??? Cho) for machine translation models, but unlike (??? Devlin), we avoid solutions for the specific model. Therefore, our solution should be applicable to other models, architectures and task which have the similar characteristics. We envisage that our solutions would be of value to models used in text summarization, chatbot or image captioning. We also choose to focus on the use of GPU, rather than CPU as pursued in (??? Devlin). \subsection{Softmax and Beam Search Fusion} @@ -76,13 +80,7 @@ \subsection{Softmax and Beam Search Fusion} \item \vspace{-2 mm} a search for the argmax output class, and probability is necessary. \end{enumerate} -In models with a small number of classes such as binary classification, the calculation of softmax and argmax is trivial and fast. The activation and bias term steps are also fast unless the dimensionality is large. - -However, the output layer of most machine translation models frequently contains tens of thousand, if not hundred of thousands, classes corresponding to the vocabulary of the output language. For example, the best German-English system for WMT 2014 (??? Rico) contains 85,000 subword units in the target vocabulary. In these scenarios, the trivial procedure used to calculate the output layer takes up a significant amount of time. Table ???? shows the percentage of the total translationt time each step takes in our baseline system (described in Section ???). - -??? Table of percentage each step takes - -In addition, we are interested not just in the argmax value but of beam search for the n-best classes and their probabilities, further complicating the problem. +In models with a small number of classes such as binary classification, the computational effort required is trivial and fast. However, this is not the case for large number of classes such as those found in NMT models. We shall leave the matrix multiplication for the next proposal and future work, and concentrate on the last three steps. There are algorithmic similarites between these steps, which we outline in ???. diff --git a/mtma17/graphs.ods b/mtma17/graphs.ods index 6059672..5715aa0 100644 Binary files a/mtma17/graphs.ods and b/mtma17/graphs.ods differ