-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeech2 benchmark technical details #1
Comments
Hi Soumith,
intermittently - Erich and David can fill in the details. Thanks Shubho On Friday, June 3, 2016, Soumith Chintala [email protected] wrote:
|
awesome thanks. |
The network specs are as follows: {
"connectivity": [
"conv2d_1",
"conv2d_2",
"bd",
"bd",
"bd",
"bd",
"bd",
"bd",
"bd",
"fc",
"ctc"
],
"layers": {
"bd": {
"batch_norm": true,
"dim": 1760,
"type": "RecurrentLinear"
},
"conv2d_1": {
"batch_norm": true,
"channels": 1,
"context_h": 5,
"context_w": 20,
"filters": 32,
"is_same_w": true,
"stride_h": 2,
"stride_w": 2,
"type": "Conv2DPackage"
},
"conv2d_2": {
"batch_norm": true,
"channels": 32,
"context_h": 5,
"context_w": 10,
"filters": 32,
"is_same_w": true,
"stride_h": 1,
"stride_w": 2,
"type": "Conv2DPackage"
},
"ctc": {
"type": "CTCCostLinear"
},
"fc": {
"batch_norm": true,
"dim": 1760,
"type": "FullyConnected"
}
}
} The raw input is a spectrogram that is 161 x (minibatch x time). bd layers are bi-directional vanilla RNNs The CTCCostLinear layer includes a linear transform to the alphabet size followed by a softmax. In English the alphabet size is 29. The criterion is a CTC loss done in logspace. All non-linearities are clipped ReLU units (max of 20). I will update this with the dataset information soon. |
The dataset should be drawn from the following distribution:
Each second corresponds to 100 input timesteps as we use a 10ms step. |
@ekelsen thanks for the specs! Could we get some information on how you chose the dataset specification? |
It is similar to the distribution of one of our training sets. |
What is the proper procedure for this benchmark. Are we to generate benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then take a weighted average of the runtimes using the distribution above? |
One could generate a training sample of different data lengths using that Shubho On Mon, Jun 13, 2016 at 2:46 PM, nervetumer [email protected]
|
Just wanted to clarify that the benchmark can't test convergence at all - On Mon, Jun 13, 2016 at 10:56 PM, Shubho Sengupta [email protected] wrote:
|
I agree we could do that but then everyone will be benchmarking a different data set. It may not matter much for a large data set epoch but it seems like we should try to minimize the differences between all the benchmarks. So if we go this route maybe we should have a small python script here with a random number seed and random number generator that is platform independant which generates the sequence lengths? Or we should choose a publicly available dataset instead of using statistics from a private dataset. |
I think we should have a script with a fixed seed to nail down the dataset. Shubho On Tuesday, June 14, 2016, nervetumer [email protected] wrote:
|
I think that having to use real data for performance benchmarks is somewhat annoying and best avoided if possible. @shubho will provide a python script to generate the dataset (input spectrograms and labels) so that the expected behavior is clear. I really don't think the exact floating point numbers and labels that are chosen will have any impact on the performance (we can test by changing the seed in python), so if it is easier to generate the dataset in a different language that should be fine as long the distribution is the same. The minibatch is generally chosen to be the largest possible for the longest sequence length as we tend to keep the minibatch size constant during optimization. (There is work on variable mini-batches, but that isn't common, so I don't think that makes sense for the benchmark). There are some odd performance cliffs when using CuBLAS, like going from a minibatch of 8 -> 9, and a minibatch of 96 significantly underperforms a minibatch of 64, so choosing the minibatch that is fastest overall might require some tuning. The nervana kernels mostly don't have these problems. I doubt any of the frameworks will be able to exceed a global mini-batch of 1024, even on 8 GPUs. But it is true that in practice we notice degraded optimization performance beyond this mini-batch size. For benchmarking purposes I don't think we need to worry about that though. |
The following script should be a reasonable generator for random data for this benchmark. The distribution of utterance lengths is fixed and does not depend on a random number generator and the generation itself should be quite fast and not affect overall benchmark timing. If the chosen minibatch size is not a multiple of 2, then the last minibatch of a given utterance length will be smaller than usual. This is not exactly the same behavior as a real training system where we would lump together different length sequences. If people would prefer that behavior, let us know. import numpy as np
class DataGenerator:
"""Generates DS2 test data for DeepMark benchmark.
Returns utterance length in number of 10ms slices. So utt_length
is set to 1000 for a 10s utterance.
Returns spectrogram filled with random input. This is a
two-dimensional Numpy array with dimensions
161 x (utt_length * mb_size) where mb_size is the user supplied
minibatch size.
If mb_size is not a multiple of two, then the last minibatch
for a particular utt_length may be less than mb_size.
Returns label data filled with random input. This is a
one-dimensional Numpy array with dimensions
label length corresponding to the utterance length.
"""
### Set up initial state
# Utterance lengths are in number of non-overlapping 10ms slices
_utt_lengths = [100, 200, 300, 400, 500, 600, 700,
800, 900, 1000, 1100, 1200, 1300, 1400, 1500]
_counts = [3, 10, 11, 13, 14, 13, 9,
8, 5, 4, 3, 2, 2, 2, 1]
_label_lengths = [7, 17, 35, 48, 62, 78, 93, 107,
120, 134, 148, 163, 178, 193, 209]
_freq_bins = 161
# 29 characters in english dataset - all equally likely to be
# selected for now
_prob_chars = [1 / 29.] * 29
_chars = range(29)
# minimum number of utterances to generate for a count of 1
_scale_factor = 10 * 128
# extra space to allow for different minibatch data even though
# we only generate one set of random numbers for speed
_extra = 1000
def __init__(self, minibatch_size):
self._current = 0
self._mb_size = minibatch_size
# Generate all the utterance lengths that we need
self._utt_counts = [self._scale_factor * x for x in self._counts]
# only generate random data once so that the data generation
# is as fast as possible and doesn't interfere with benchmark
# timing
self._randomness = np.random.randn(self._freq_bins,
minibatch_size *
(self._utt_lengths[-1]) +
self._extra
).astype(np.float32)
def __iter__(self):
return self
def next(self):
if self._current >= len(self._utt_counts):
raise StopIteration
else:
# Generate an utterance length
if (self._utt_counts[self._current] > self._mb_size):
mb_size = self._mb_size
self._utt_counts[self._current] -= self._mb_size
inc = 0
else:
mb_size = self._utt_counts[self._current]
self._utt_counts[self._current] = 0
inc = 1
utt_length = self._utt_lengths[self._current]
# Create random label data
label_length = self._label_lengths[self._current]
start = np.random.randint(0, self._extra +
self._mb_size *
(self._utt_lengths[-1] -
self._utt_lengths[self._current])
)
end = start + utt_length * mb_size
self._current += inc
return utt_length, \
self._randomness[:, start:end], \
np.random.choice(self._chars, label_length,
self._prob_chars) |
Sounds great, thanks @ekelsen! Not sure what would be more fit for the torch benchmark; should I use a library to access the above python code in lua, or rewrite the class in lua? I personally prefer to rewrite, but whatever is more appropriate! |
I feel rewriting is fine - the important parts are the distribution, Shubho On Wednesday, June 22, 2016, Sean Naren [email protected] wrote:
|
@shubho thanks, bit confused as to how the generator is to be used. Do the below steps cover what the benchmark using the generator is supposed to be?
|
Yeah and you can choose the appropriate minibatch that gives you the Shubho On Thursday, June 23, 2016, Sean Naren [email protected] wrote:
|
Awesome so just to summarise: Benchmark is the average forward/backward/forward+backward time taken to run through the entire synthetic dataset using the DS2 architecture (using fastest batch size). Steps are:
Sorry if this is obvious, just trying to nail the details :) |
I usually report total time for forward and back prop, number of Shubho On Thursday, June 23, 2016, Sean Naren [email protected] wrote:
|
Consistency is more important than what I suggested. Shubho On Thursday, June 23, 2016, Sean Naren [email protected] wrote:
|
Interested to see how the other DS2 benchmarks are progressing, any news guys? |
I am planning to get to the internal one this weekend. Shubho On Thu, Jul 7, 2016 at 5:46 AM, Sean Naren [email protected] wrote:
|
Shall we start benchmarking numbers? Just to confirm we are measuring the time it takes to run through the entire dataset iterator? |
@SeanNaren yea benchmarking through the dataset iterator sounds right. Time to benchmark! |
Some preliminary results to get the ball rolling, I benchmarked a 1xTitan and a 4xTitan setup of the Torch implementation (shoutout to @digitalreasoning for letting me use their servers!) with 5 epochs of the dataset:
I could only manage a 32 batch/GPU in memory for the epoch. |
I suggest to add a column for multi-gpu scaling for the final presentation. |
@SeanNaren, any particular reason you are not using cudnn.BatchNormalization in BatchBRNN.lua and use nn.BatchNormalization instead? Thanks for your work on this! |
@ngimel cudnn.BatchNormalization only supports batchsize < 1024 in inference mode. I think this is the point. |
@SeanNaren, there are a few ways:
|
@ngimel sounds fair will modify the benchmark to use cuDNN BatchNorm! |
Hey @shubho , can you give some technical details on the DeepSpeech2 benchmark so that the others can implement it to your exact spec.
Some details:
cc: @SeanNaren @delta2323
The text was updated successfully, but these errors were encountered: