DeepSpeech2 benchmark technical details #1

soumith · 2016-06-02T16:54:43Z

Hey @shubho , can you give some technical details on the DeepSpeech2 benchmark so that the others can implement it to your exact spec.

Some details:

Exact architecture
Criterion
The synthetic dataset: sample length, dimensionality, etc.
Any other detail that would be important

cc: @SeanNaren @delta2323

shubho · 2016-06-02T19:47:41Z

Hi Soumith,

                 I am traveling till June 12th and will be on internet

intermittently - Erich and David can fill in the details.

Thanks

Shubho

On Friday, June 3, 2016, Soumith Chintala [email protected] wrote:

Hey @shubho https://github.com/shubho , can you give some technical
details on the DeepSpeech2 benchmark so that the others can implement it to
your exact spec.

Some details:

Exact architecture

Criterion

The synthetic dataset: sample length, dimensionality, etc.

Any other detail that would be important

cc: @SeanNaren https://github.com/seannaren @delta2323
https://github.com/delta2323

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1, or mute the thread
https://github.com/notifications/unsubscribe/ABIPeUyPtNOq1wutL4xz6bZoHYQn6hyYks5qHwrYgaJpZM4Isw4w
.

soumith · 2016-06-03T04:02:41Z

awesome thanks.

ekelsen · 2016-06-09T04:30:28Z

The network specs are as follows:

{
    "connectivity": [
        "conv2d_1",
        "conv2d_2",
        "bd",
        "bd",
        "bd",
        "bd",
        "bd",
        "bd",
        "bd",
        "fc",
        "ctc"
    ],
    "layers": {
        "bd": {
            "batch_norm": true,
            "dim": 1760,
            "type": "RecurrentLinear"
        },
        "conv2d_1": {
            "batch_norm": true,
            "channels": 1,
            "context_h": 5,
            "context_w": 20,
            "filters": 32,
            "is_same_w": true,
            "stride_h": 2,
            "stride_w": 2,
            "type": "Conv2DPackage"
        },
        "conv2d_2": {
            "batch_norm": true,
            "channels": 32,
            "context_h": 5,
            "context_w": 10,
            "filters": 32,
            "is_same_w": true,
            "stride_h": 1,
            "stride_w": 2,
            "type": "Conv2DPackage"
        },
        "ctc": {
            "type": "CTCCostLinear"
        },
        "fc": {
            "batch_norm": true,
            "dim": 1760,
            "type": "FullyConnected"
        }
    }
}

The raw input is a spectrogram that is 161 x (minibatch x time).

bd layers are bi-directional vanilla RNNs

The CTCCostLinear layer includes a linear transform to the alphabet size followed by a softmax. In English the alphabet size is 29. The criterion is a CTC loss done in logspace.

All non-linearities are clipped ReLU units (max of 20).

I will update this with the dataset information soon.

ekelsen · 2016-06-09T20:36:48Z

The dataset should be drawn from the following distribution:

Length (sec)	Frequency (percent)	Label Length
1	3.0	7
2	10.0	17
3	11.0	35
4	13.0	48
5	14.0	62
6	13.0	78
7	9.0	93
8	8.0	107
9	5.0	120
10	4.0	134
11	3.0	148
12	2.0	163
13	2.0	178
14	2.0	193
15	1.0	209

Each second corresponds to 100 input timesteps as we use a 10ms step.

SeanNaren · 2016-06-09T20:41:38Z

@ekelsen thanks for the specs! Could we get some information on how you chose the dataset specification?

ekelsen · 2016-06-09T20:43:18Z

It is similar to the distribution of one of our training sets.

nervetumer · 2016-06-13T21:46:07Z

What is the proper procedure for this benchmark. Are we to generate benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then take a weighted average of the runtimes using the distribution above?

shubho · 2016-06-14T05:56:58Z

One could generate a training sample of different data lengths using that
distribution - form minibatches so that a minibatch has utterances of equal
length (that gets around the zero padding problem) and go from there. Mini
batches should be all large as possible - but anything above 128 / GPU will
either hit memory limits of GPUs first or unusable in practice (assuming
multi-GPU training with 8 GPUs) due to convergence issues.

Shubho

On Mon, Jun 13, 2016 at 2:46 PM, nervetumer [email protected]
wrote:

What is the proper procedure for this benchmark. Are we to generate
benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then
take a weighted average of the runtimes using the distribution above?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeQ61HPdGZNtA6t_OYo7yD8iLMJBcks5qLc-igaJpZM4Isw4w
.

shubho · 2016-06-14T05:59:19Z

Just wanted to clarify that the benchmark can't test convergence at all -
so maybe the minibatch should be wide enough to fit in GPU memory.

On Mon, Jun 13, 2016 at 10:56 PM, Shubho Sengupta [email protected] wrote:

One could generate a training sample of different data lengths using that
distribution - form minibatches so that a minibatch has utterances of equal
length (that gets around the zero padding problem) and go from there. Mini
batches should be all large as possible - but anything above 128 / GPU will
either hit memory limits of GPUs first or unusable in practice (assuming
multi-GPU training with 8 GPUs) due to convergence issues.

Shubho

On Mon, Jun 13, 2016 at 2:46 PM, nervetumer [email protected]
wrote:

What is the proper procedure for this benchmark. Are we to generate
benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then
take a weighted average of the runtimes using the distribution above?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeQ61HPdGZNtA6t_OYo7yD8iLMJBcks5qLc-igaJpZM4Isw4w
.

nervetumer · 2016-06-14T13:56:32Z

I agree we could do that but then everyone will be benchmarking a different data set. It may not matter much for a large data set epoch but it seems like we should try to minimize the differences between all the benchmarks. So if we go this route maybe we should have a small python script here with a random number seed and random number generator that is platform independant which generates the sequence lengths? Or we should choose a publicly available dataset instead of using statistics from a private dataset.

shubho · 2016-06-14T13:58:07Z

I think we should have a script with a fixed seed to nail down the dataset.

Shubho

On Tuesday, June 14, 2016, nervetumer [email protected] wrote:

I agree we could do that but then everyone will be benchmarking a
different data set. It may not matter much for a large data set epoch but
it seems like we should try to minimize the differences between all the
benchmarks. So if we go this route maybe we should have a small python
script here with a random number seed and random number generator that is
platform independant which generates the sequence lengths? Or we should
choose a publicly available dataset instead of using statistics from a
private dataset.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPecvrbmOsotSZSvbsB_TsDI5arT_7ks5qLrMTgaJpZM4Isw4w
.

ekelsen · 2016-06-14T18:54:20Z

I think that having to use real data for performance benchmarks is somewhat annoying and best avoided if possible.

@shubho will provide a python script to generate the dataset (input spectrograms and labels) so that the expected behavior is clear. I really don't think the exact floating point numbers and labels that are chosen will have any impact on the performance (we can test by changing the seed in python), so if it is easier to generate the dataset in a different language that should be fine as long the distribution is the same.

The minibatch is generally chosen to be the largest possible for the longest sequence length as we tend to keep the minibatch size constant during optimization. (There is work on variable mini-batches, but that isn't common, so I don't think that makes sense for the benchmark).

There are some odd performance cliffs when using CuBLAS, like going from a minibatch of 8 -> 9, and a minibatch of 96 significantly underperforms a minibatch of 64, so choosing the minibatch that is fastest overall might require some tuning. The nervana kernels mostly don't have these problems.

I doubt any of the frameworks will be able to exceed a global mini-batch of 1024, even on 8 GPUs. But it is true that in practice we notice degraded optimization performance beyond this mini-batch size. For benchmarking purposes I don't think we need to worry about that though.

ekelsen · 2016-06-21T19:04:11Z

The following script should be a reasonable generator for random data for this benchmark. The distribution of utterance lengths is fixed and does not depend on a random number generator and the generation itself should be quite fast and not affect overall benchmark timing.

If the chosen minibatch size is not a multiple of 2, then the last minibatch of a given utterance length will be smaller than usual. This is not exactly the same behavior as a real training system where we would lump together different length sequences. If people would prefer that behavior, let us know.

import numpy as np

class DataGenerator:

    """Generates DS2 test data for DeepMark benchmark.

       Returns utterance length in number of 10ms slices. So utt_length
       is set to 1000 for a 10s utterance.

       Returns spectrogram filled with random input. This is a
       two-dimensional Numpy array with dimensions
       161 x (utt_length * mb_size) where mb_size is the user supplied
       minibatch size.

       If mb_size is not a multiple of two, then the last minibatch
       for a particular utt_length may be less than mb_size.

       Returns label data filled with random input. This is a
       one-dimensional Numpy array with dimensions
       label length corresponding to the utterance length.

    """

    ### Set up initial state
    # Utterance lengths are in number of non-overlapping 10ms slices
    _utt_lengths = [100, 200, 300, 400, 500, 600, 700,
                    800, 900, 1000, 1100, 1200, 1300, 1400, 1500]
    _counts = [3, 10, 11, 13, 14, 13, 9,
               8, 5, 4, 3, 2, 2, 2, 1]
    _label_lengths = [7, 17, 35, 48, 62, 78, 93, 107,
                      120, 134, 148, 163, 178, 193, 209]
    _freq_bins = 161

    # 29 characters in english dataset - all equally likely to be
    # selected for now
    _prob_chars = [1 / 29.] * 29
    _chars = range(29)

    # minimum number of utterances to generate for a count of 1
    _scale_factor = 10 * 128

    # extra space to allow for different minibatch data even though
    # we only generate one set of random numbers for speed
    _extra = 1000

    def __init__(self, minibatch_size):
        self._current = 0
        self._mb_size = minibatch_size

        # Generate all the utterance lengths that we need
        self._utt_counts = [self._scale_factor * x for x in self._counts]

        # only generate random data once so that the data generation
        # is as fast as possible and doesn't interfere with benchmark
        # timing
        self._randomness = np.random.randn(self._freq_bins,
                                           minibatch_size *
                                           (self._utt_lengths[-1]) +
                                           self._extra
                                           ).astype(np.float32)

    def __iter__(self):
        return self

    def next(self):
        if self._current >= len(self._utt_counts):
            raise StopIteration
        else:
            # Generate an utterance length
            if (self._utt_counts[self._current] > self._mb_size):
                mb_size = self._mb_size
                self._utt_counts[self._current] -= self._mb_size
                inc = 0
            else:
                mb_size = self._utt_counts[self._current]
                self._utt_counts[self._current] = 0
                inc = 1

            utt_length = self._utt_lengths[self._current]

            # Create random label data
            label_length = self._label_lengths[self._current]

            start = np.random.randint(0, self._extra +
                                         self._mb_size *
                                             (self._utt_lengths[-1] -
                                              self._utt_lengths[self._current])
                                     )
            end = start + utt_length * mb_size

            self._current += inc

            return utt_length, \
                   self._randomness[:, start:end], \
                   np.random.choice(self._chars, label_length,
                                    self._prob_chars)

SeanNaren · 2016-06-22T15:20:09Z

Sounds great, thanks @ekelsen! Not sure what would be more fit for the torch benchmark; should I use a library to access the above python code in lua, or rewrite the class in lua? I personally prefer to rewrite, but whatever is more appropriate!

shubho · 2016-06-22T16:20:53Z

I feel rewriting is fine - the important parts are the distribution,
total number
of samples and the way they are divided into minibatches.

Shubho

On Wednesday, June 22, 2016, Sean Naren [email protected] wrote:

Sounds great, thanks @ekelsen https://github.com/ekelsen! Not sure what
would be more fit for the torch benchmark; should I use a library to access
the above python code in lua, or rewrite the class in lua? I personally
prefer to rewrite, but whatever is more appropriate!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPec5G9ecutKR3yEuzLamnoJbOnsnQks5qOVKtgaJpZM4Isw4w
.

SeanNaren · 2016-06-23T12:01:57Z

@shubho thanks, bit confused as to how the generator is to be used. Do the below steps cover what the benchmark using the generator is supposed to be?

generator:next()
Forward pass, record forward time
Backward pass, record backward time
loop from 1. until iterator finished
Average for each loop, Sum times

shubho · 2016-06-23T12:13:20Z

Yeah and you can choose the appropriate minibatch that gives you the
fastest time.

Shubho

On Thursday, June 23, 2016, Sean Naren [email protected] wrote:

Great, bit confused as to how the generator is to be used. Do the below
steps cover what the benchmark using the generator is supposed to be?

generator:next()

Forward pass, record forward time

Backward pass, record backward time

loop from 1. until iterator finished

Sum times

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeQdiO1zn52zTIMcAgfvXYbX2NdMzks5qOnW5gaJpZM4Isw4w
.

SeanNaren · 2016-06-23T19:16:38Z

Awesome so just to summarise:

Benchmark is the average forward/backward/forward+backward time taken to run through the entire synthetic dataset using the DS2 architecture (using fastest batch size).

Steps are:

generator:next()
Forward pass, record forward time
Backward pass, record backward time
loop from 1. until iterator finished
Report average forward time/backward time/ forward+backward time

Sorry if this is obvious, just trying to nail the details :)

shubho · 2016-06-23T19:24:06Z

I usually report total time for forward and back prop, number of
minibatches and also average but I haven't checked DeepMark's requirements.
What is reported should be consistent across all networks.

Shubho

On Thursday, June 23, 2016, Sean Naren [email protected] wrote:

Awesome so just to summarise:

Benchmark is the average forward/backward/forward+backward time taken to
run through the entire synthetic dataset using the DS2 architecture (using
fastest batch size).

Steps are:

generator:next()

Forward pass, record forward time

Backward pass, record backward time

loop from 1. until iterator finished

Report average forward time/backward time/ forward+backward time

Sorry if this is obvious, just trying to nail the details :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeZLw3__ko_Zjk9pPxSPspqjXrD1gks5qOtuagaJpZM4Isw4w
.

SeanNaren · 2016-06-23T19:47:38Z

@shubho, I was going off the covnet benchmark structure for forward/backward/forward+backward average, but if you think total is a better measurement that could be an alternative, @soumith what would you suggest?

shubho · 2016-06-23T20:06:53Z

Consistency is more important than what I suggested.

Shubho

On Thursday, June 23, 2016, Sean Naren [email protected] wrote:

@shubho https://github.com/shubho, I was going off the covnet benchmark
structure for forward/backward/forward+backward average, but if you think
total is a better measurement that could be an alternative, @soumith
https://github.com/soumith what would you suggest?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeea18Uvst4GDewb2l1HCGDi7ZnjWks5qOuLegaJpZM4Isw4w
.

SeanNaren · 2016-07-07T12:46:54Z

Interested to see how the other DS2 benchmarks are progressing, any news guys?
cc @shubho @ekelsen @soumith @nervetumer

shubho · 2016-07-07T13:59:55Z

I am planning to get to the internal one this weekend.

Shubho

On Thu, Jul 7, 2016 at 5:46 AM, Sean Naren [email protected] wrote:

Interested to see how the other DS2 benchmarks are progressing, any news
guys?
cc @shubho https://github.com/shubho @ekelsen
https://github.com/ekelsen @soumith https://github.com/soumith
@nervetumer https://github.com/nervetumer

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeU1O8eOc2CLsEH_5O7KMDfB_PXzHks5qTPVCgaJpZM4Isw4w
.

SeanNaren · 2016-07-27T14:25:14Z

Shall we start benchmarking numbers? Just to confirm we are measuring the time it takes to run through the entire dataset iterator?

soumith · 2016-07-27T14:26:23Z

@SeanNaren yea benchmarking through the dataset iterator sounds right. Time to benchmark!

SeanNaren · 2016-08-07T21:14:40Z

Some preliminary results to get the ball rolling, I benchmarked a 1xTitan and a 4xTitan setup of the Torch implementation (shoutout to @digitalreasoning for letting me use their servers!) with 5 epochs of the dataset:

Hardware	Time (ms)	forward (ms)	backward (ms)	Samples processed	Samples processed per second	Seconds of audio processed per second	Epoch time (s)
1x Titans	154	83	72	128000	32	189	4013
4x Titans	180	99	81	128000	106	632	1204

I could only manage a 32 batch/GPU in memory for the epoch.

pooyadavoodi · 2016-08-12T02:45:57Z

I suggest to add a column for multi-gpu scaling for the final presentation.

ngimel · 2016-08-23T01:19:19Z

@SeanNaren, any particular reason you are not using cudnn.BatchNormalization in BatchBRNN.lua and use nn.BatchNormalization instead? Thanks for your work on this!

seed93 · 2016-08-23T01:44:58Z

@ngimel cudnn.BatchNormalization only supports batchsize < 1024 in inference mode. I think this is the point.

SeanNaren · 2016-08-23T06:25:37Z

@ngimel What @seed93 said :) Thanks for the changes on the benchmark, ill find time to re-run these!

EDIT: though if you manage to find a way we could use cuDNN batch norm please let me know (we could add cudnn.BatchNorm since we are only training, would be up for opionions on this)!

ngimel · 2016-08-24T17:04:03Z

@SeanNaren, there are a few ways:

you can use cudnn batchnorm for training, and switch to nn for inference, since the modules are compatible.
cudnn bindings can be modified for inference to call cudnn batchnorm a few times with smaller batch sizes - bn at inference time is a pointwise operation, so results should not be affected.
I'll also check if this limitation can be removed for future cudnn versions.
We've seen speedups using cudnn.BatchNorm instead of nn.BatchNorm, so I see no reason not to use it for training benchmark.

SeanNaren · 2016-08-24T17:34:39Z

@ngimel sounds fair will modify the benchmark to use cuDNN BatchNorm!
Don't want to spam this issue but want to address a few things with the torch implementation, I'll open a separate issue.

delta2323 mentioned this issue Jun 10, 2016

Fix DeepSpeech2 delta2323/chainer-deepmark#1

Closed

SeanNaren mentioned this issue Jun 24, 2016

Adapting DS2 torch benchmark to new dataset #5

Merged

SeanNaren mentioned this issue Aug 25, 2016

Changed to cuDNN BatchNorm, use Bottle instead of SequenceWise #10

Merged

SeanNaren mentioned this issue Jun 11, 2017

Pre-trained models SeanNaren/deepspeech.pytorch#59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeech2 benchmark technical details #1

DeepSpeech2 benchmark technical details #1

soumith commented Jun 2, 2016

shubho commented Jun 2, 2016

soumith commented Jun 3, 2016

ekelsen commented Jun 9, 2016

ekelsen commented Jun 9, 2016 •

edited

Loading

SeanNaren commented Jun 9, 2016

ekelsen commented Jun 9, 2016

nervetumer commented Jun 13, 2016

shubho commented Jun 14, 2016

shubho commented Jun 14, 2016

nervetumer commented Jun 14, 2016

shubho commented Jun 14, 2016

ekelsen commented Jun 14, 2016

ekelsen commented Jun 21, 2016

SeanNaren commented Jun 22, 2016

shubho commented Jun 22, 2016

SeanNaren commented Jun 23, 2016 •

edited

Loading

shubho commented Jun 23, 2016

SeanNaren commented Jun 23, 2016

shubho commented Jun 23, 2016

SeanNaren commented Jun 23, 2016

shubho commented Jun 23, 2016

SeanNaren commented Jul 7, 2016

shubho commented Jul 7, 2016

SeanNaren commented Jul 27, 2016

soumith commented Jul 27, 2016

SeanNaren commented Aug 7, 2016

pooyadavoodi commented Aug 12, 2016

ngimel commented Aug 23, 2016

seed93 commented Aug 23, 2016

SeanNaren commented Aug 23, 2016 •

edited

Loading

ngimel commented Aug 24, 2016

SeanNaren commented Aug 24, 2016

DeepSpeech2 benchmark technical details #1

DeepSpeech2 benchmark technical details #1

Comments

soumith commented Jun 2, 2016

shubho commented Jun 2, 2016

soumith commented Jun 3, 2016

ekelsen commented Jun 9, 2016

ekelsen commented Jun 9, 2016 • edited Loading

SeanNaren commented Jun 9, 2016

ekelsen commented Jun 9, 2016

nervetumer commented Jun 13, 2016

shubho commented Jun 14, 2016

shubho commented Jun 14, 2016

nervetumer commented Jun 14, 2016

shubho commented Jun 14, 2016

ekelsen commented Jun 14, 2016

ekelsen commented Jun 21, 2016

SeanNaren commented Jun 22, 2016

shubho commented Jun 22, 2016

SeanNaren commented Jun 23, 2016 • edited Loading

shubho commented Jun 23, 2016

SeanNaren commented Jun 23, 2016

shubho commented Jun 23, 2016

SeanNaren commented Jun 23, 2016

shubho commented Jun 23, 2016

SeanNaren commented Jul 7, 2016

shubho commented Jul 7, 2016

SeanNaren commented Jul 27, 2016

soumith commented Jul 27, 2016

SeanNaren commented Aug 7, 2016

pooyadavoodi commented Aug 12, 2016

ngimel commented Aug 23, 2016

seed93 commented Aug 23, 2016

SeanNaren commented Aug 23, 2016 • edited Loading

ngimel commented Aug 24, 2016

SeanNaren commented Aug 24, 2016

ekelsen commented Jun 9, 2016 •

edited

Loading

SeanNaren commented Jun 23, 2016 •

edited

Loading

SeanNaren commented Aug 23, 2016 •

edited

Loading