Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeech2 benchmark technical details #1

Open
soumith opened this issue Jun 2, 2016 · 32 comments
Open

DeepSpeech2 benchmark technical details #1

soumith opened this issue Jun 2, 2016 · 32 comments

Comments

@soumith
Copy link
Contributor

soumith commented Jun 2, 2016

Hey @shubho , can you give some technical details on the DeepSpeech2 benchmark so that the others can implement it to your exact spec.

Some details:

  • Exact architecture
  • Criterion
  • The synthetic dataset: sample length, dimensionality, etc.
  • Any other detail that would be important

cc: @SeanNaren @delta2323

@shubho
Copy link

shubho commented Jun 2, 2016

Hi Soumith,

                 I am traveling till June 12th and will be on internet

intermittently - Erich and David can fill in the details.

Thanks

Shubho

On Friday, June 3, 2016, Soumith Chintala [email protected] wrote:

Hey @shubho https://github.com/shubho , can you give some technical
details on the DeepSpeech2 benchmark so that the others can implement it to
your exact spec.

Some details:

  • Exact architecture
  • Criterion
  • The synthetic dataset: sample length, dimensionality, etc.
  • Any other detail that would be important

cc: @SeanNaren https://github.com/seannaren @delta2323
https://github.com/delta2323


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1, or mute the thread
https://github.com/notifications/unsubscribe/ABIPeUyPtNOq1wutL4xz6bZoHYQn6hyYks5qHwrYgaJpZM4Isw4w
.

@soumith
Copy link
Contributor Author

soumith commented Jun 3, 2016

awesome thanks.

@ekelsen
Copy link

ekelsen commented Jun 9, 2016

The network specs are as follows:

{
    "connectivity": [
        "conv2d_1",
        "conv2d_2",
        "bd",
        "bd",
        "bd",
        "bd",
        "bd",
        "bd",
        "bd",
        "fc",
        "ctc"
    ],
    "layers": {
        "bd": {
            "batch_norm": true,
            "dim": 1760,
            "type": "RecurrentLinear"
        },
        "conv2d_1": {
            "batch_norm": true,
            "channels": 1,
            "context_h": 5,
            "context_w": 20,
            "filters": 32,
            "is_same_w": true,
            "stride_h": 2,
            "stride_w": 2,
            "type": "Conv2DPackage"
        },
        "conv2d_2": {
            "batch_norm": true,
            "channels": 32,
            "context_h": 5,
            "context_w": 10,
            "filters": 32,
            "is_same_w": true,
            "stride_h": 1,
            "stride_w": 2,
            "type": "Conv2DPackage"
        },
        "ctc": {
            "type": "CTCCostLinear"
        },
        "fc": {
            "batch_norm": true,
            "dim": 1760,
            "type": "FullyConnected"
        }
    }
}

The raw input is a spectrogram that is 161 x (minibatch x time).

bd layers are bi-directional vanilla RNNs

The CTCCostLinear layer includes a linear transform to the alphabet size followed by a softmax. In English the alphabet size is 29. The criterion is a CTC loss done in logspace.

All non-linearities are clipped ReLU units (max of 20).

I will update this with the dataset information soon.

@ekelsen
Copy link

ekelsen commented Jun 9, 2016

The dataset should be drawn from the following distribution:

Length (sec) Frequency (percent) Label Length
1 3.0 7
2 10.0 17
3 11.0 35
4 13.0 48
5 14.0 62
6 13.0 78
7 9.0 93
8 8.0 107
9 5.0 120
10 4.0 134
11 3.0 148
12 2.0 163
13 2.0 178
14 2.0 193
15 1.0 209

Each second corresponds to 100 input timesteps as we use a 10ms step.

@SeanNaren
Copy link
Contributor

@ekelsen thanks for the specs! Could we get some information on how you chose the dataset specification?

@ekelsen
Copy link

ekelsen commented Jun 9, 2016

It is similar to the distribution of one of our training sets.

@nervetumer
Copy link

What is the proper procedure for this benchmark. Are we to generate benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then take a weighted average of the runtimes using the distribution above?

@shubho
Copy link

shubho commented Jun 14, 2016

One could generate a training sample of different data lengths using that
distribution - form minibatches so that a minibatch has utterances of equal
length (that gets around the zero padding problem) and go from there. Mini
batches should be all large as possible - but anything above 128 / GPU will
either hit memory limits of GPUs first or unusable in practice (assuming
multi-GPU training with 8 GPUs) due to convergence issues.

Shubho

On Mon, Jun 13, 2016 at 2:46 PM, nervetumer [email protected]
wrote:

What is the proper procedure for this benchmark. Are we to generate
benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then
take a weighted average of the runtimes using the distribution above?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeQ61HPdGZNtA6t_OYo7yD8iLMJBcks5qLc-igaJpZM4Isw4w
.

@shubho
Copy link

shubho commented Jun 14, 2016

Just wanted to clarify that the benchmark can't test convergence at all -
so maybe the minibatch should be wide enough to fit in GPU memory.

On Mon, Jun 13, 2016 at 10:56 PM, Shubho Sengupta [email protected] wrote:

One could generate a training sample of different data lengths using that
distribution - form minibatches so that a minibatch has utterances of equal
length (that gets around the zero padding problem) and go from there. Mini
batches should be all large as possible - but anything above 128 / GPU will
either hit memory limits of GPUs first or unusable in practice (assuming
multi-GPU training with 8 GPUs) due to convergence issues.

Shubho

On Mon, Jun 13, 2016 at 2:46 PM, nervetumer [email protected]
wrote:

What is the proper procedure for this benchmark. Are we to generate
benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then
take a weighted average of the runtimes using the distribution above?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeQ61HPdGZNtA6t_OYo7yD8iLMJBcks5qLc-igaJpZM4Isw4w
.

@nervetumer
Copy link

I agree we could do that but then everyone will be benchmarking a different data set. It may not matter much for a large data set epoch but it seems like we should try to minimize the differences between all the benchmarks. So if we go this route maybe we should have a small python script here with a random number seed and random number generator that is platform independant which generates the sequence lengths? Or we should choose a publicly available dataset instead of using statistics from a private dataset.

@shubho
Copy link

shubho commented Jun 14, 2016

I think we should have a script with a fixed seed to nail down the dataset.

Shubho

On Tuesday, June 14, 2016, nervetumer [email protected] wrote:

I agree we could do that but then everyone will be benchmarking a
different data set. It may not matter much for a large data set epoch but
it seems like we should try to minimize the differences between all the
benchmarks. So if we go this route maybe we should have a small python
script here with a random number seed and random number generator that is
platform independant which generates the sequence lengths? Or we should
choose a publicly available dataset instead of using statistics from a
private dataset.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPecvrbmOsotSZSvbsB_TsDI5arT_7ks5qLrMTgaJpZM4Isw4w
.

@ekelsen
Copy link

ekelsen commented Jun 14, 2016

I think that having to use real data for performance benchmarks is somewhat annoying and best avoided if possible.

@shubho will provide a python script to generate the dataset (input spectrograms and labels) so that the expected behavior is clear. I really don't think the exact floating point numbers and labels that are chosen will have any impact on the performance (we can test by changing the seed in python), so if it is easier to generate the dataset in a different language that should be fine as long the distribution is the same.

The minibatch is generally chosen to be the largest possible for the longest sequence length as we tend to keep the minibatch size constant during optimization. (There is work on variable mini-batches, but that isn't common, so I don't think that makes sense for the benchmark).

There are some odd performance cliffs when using CuBLAS, like going from a minibatch of 8 -> 9, and a minibatch of 96 significantly underperforms a minibatch of 64, so choosing the minibatch that is fastest overall might require some tuning. The nervana kernels mostly don't have these problems.

I doubt any of the frameworks will be able to exceed a global mini-batch of 1024, even on 8 GPUs. But it is true that in practice we notice degraded optimization performance beyond this mini-batch size. For benchmarking purposes I don't think we need to worry about that though.

@ekelsen
Copy link

ekelsen commented Jun 21, 2016

The following script should be a reasonable generator for random data for this benchmark. The distribution of utterance lengths is fixed and does not depend on a random number generator and the generation itself should be quite fast and not affect overall benchmark timing.

If the chosen minibatch size is not a multiple of 2, then the last minibatch of a given utterance length will be smaller than usual. This is not exactly the same behavior as a real training system where we would lump together different length sequences. If people would prefer that behavior, let us know.

import numpy as np

class DataGenerator:

    """Generates DS2 test data for DeepMark benchmark.

       Returns utterance length in number of 10ms slices. So utt_length
       is set to 1000 for a 10s utterance.

       Returns spectrogram filled with random input. This is a
       two-dimensional Numpy array with dimensions
       161 x (utt_length * mb_size) where mb_size is the user supplied
       minibatch size.

       If mb_size is not a multiple of two, then the last minibatch
       for a particular utt_length may be less than mb_size.

       Returns label data filled with random input. This is a
       one-dimensional Numpy array with dimensions
       label length corresponding to the utterance length.

    """

    ### Set up initial state
    # Utterance lengths are in number of non-overlapping 10ms slices
    _utt_lengths = [100, 200, 300, 400, 500, 600, 700,
                    800, 900, 1000, 1100, 1200, 1300, 1400, 1500]
    _counts = [3, 10, 11, 13, 14, 13, 9,
               8, 5, 4, 3, 2, 2, 2, 1]
    _label_lengths = [7, 17, 35, 48, 62, 78, 93, 107,
                      120, 134, 148, 163, 178, 193, 209]
    _freq_bins = 161

    # 29 characters in english dataset - all equally likely to be
    # selected for now
    _prob_chars = [1 / 29.] * 29
    _chars = range(29)

    # minimum number of utterances to generate for a count of 1
    _scale_factor = 10 * 128

    # extra space to allow for different minibatch data even though
    # we only generate one set of random numbers for speed
    _extra = 1000

    def __init__(self, minibatch_size):
        self._current = 0
        self._mb_size = minibatch_size

        # Generate all the utterance lengths that we need
        self._utt_counts = [self._scale_factor * x for x in self._counts]

        # only generate random data once so that the data generation
        # is as fast as possible and doesn't interfere with benchmark
        # timing
        self._randomness = np.random.randn(self._freq_bins,
                                           minibatch_size *
                                           (self._utt_lengths[-1]) +
                                           self._extra
                                           ).astype(np.float32)

    def __iter__(self):
        return self

    def next(self):
        if self._current >= len(self._utt_counts):
            raise StopIteration
        else:
            # Generate an utterance length
            if (self._utt_counts[self._current] > self._mb_size):
                mb_size = self._mb_size
                self._utt_counts[self._current] -= self._mb_size
                inc = 0
            else:
                mb_size = self._utt_counts[self._current]
                self._utt_counts[self._current] = 0
                inc = 1

            utt_length = self._utt_lengths[self._current]

            # Create random label data
            label_length = self._label_lengths[self._current]

            start = np.random.randint(0, self._extra +
                                         self._mb_size *
                                             (self._utt_lengths[-1] -
                                              self._utt_lengths[self._current])
                                     )
            end = start + utt_length * mb_size

            self._current += inc

            return utt_length, \
                   self._randomness[:, start:end], \
                   np.random.choice(self._chars, label_length,
                                    self._prob_chars)

@SeanNaren
Copy link
Contributor

Sounds great, thanks @ekelsen! Not sure what would be more fit for the torch benchmark; should I use a library to access the above python code in lua, or rewrite the class in lua? I personally prefer to rewrite, but whatever is more appropriate!

@shubho
Copy link

shubho commented Jun 22, 2016

I feel rewriting is fine - the important parts are the distribution,
total number
of samples and the way they are divided into minibatches.

Shubho

On Wednesday, June 22, 2016, Sean Naren [email protected] wrote:

Sounds great, thanks @ekelsen https://github.com/ekelsen! Not sure what
would be more fit for the torch benchmark; should I use a library to access
the above python code in lua, or rewrite the class in lua? I personally
prefer to rewrite, but whatever is more appropriate!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPec5G9ecutKR3yEuzLamnoJbOnsnQks5qOVKtgaJpZM4Isw4w
.

@SeanNaren
Copy link
Contributor

SeanNaren commented Jun 23, 2016

@shubho thanks, bit confused as to how the generator is to be used. Do the below steps cover what the benchmark using the generator is supposed to be?

  1. generator:next()
  2. Forward pass, record forward time
  3. Backward pass, record backward time
  4. loop from 1. until iterator finished
  5. Average for each loop, Sum times

@shubho
Copy link

shubho commented Jun 23, 2016

Yeah and you can choose the appropriate minibatch that gives you the
fastest time.

Shubho

On Thursday, June 23, 2016, Sean Naren [email protected] wrote:

Great, bit confused as to how the generator is to be used. Do the below
steps cover what the benchmark using the generator is supposed to be?

  1. generator:next()
  2. Forward pass, record forward time
  3. Backward pass, record backward time
  4. loop from 1. until iterator finished
  5. Sum times


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeQdiO1zn52zTIMcAgfvXYbX2NdMzks5qOnW5gaJpZM4Isw4w
.

@SeanNaren
Copy link
Contributor

Awesome so just to summarise:

Benchmark is the average forward/backward/forward+backward time taken to run through the entire synthetic dataset using the DS2 architecture (using fastest batch size).

Steps are:

  1. generator:next()
  2. Forward pass, record forward time
  3. Backward pass, record backward time
  4. loop from 1. until iterator finished
  5. Report average forward time/backward time/ forward+backward time

Sorry if this is obvious, just trying to nail the details :)

@shubho
Copy link

shubho commented Jun 23, 2016

I usually report total time for forward and back prop, number of
minibatches and also average but I haven't checked DeepMark's requirements.
What is reported should be consistent across all networks.

Shubho

On Thursday, June 23, 2016, Sean Naren [email protected] wrote:

Awesome so just to summarise:

Benchmark is the average forward/backward/forward+backward time taken to
run through the entire synthetic dataset using the DS2 architecture (using
fastest batch size).

Steps are:

  1. generator:next()
  2. Forward pass, record forward time
  3. Backward pass, record backward time
  4. loop from 1. until iterator finished
  5. Report average forward time/backward time/ forward+backward time

Sorry if this is obvious, just trying to nail the details :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeZLw3__ko_Zjk9pPxSPspqjXrD1gks5qOtuagaJpZM4Isw4w
.

@SeanNaren
Copy link
Contributor

@shubho, I was going off the covnet benchmark structure for forward/backward/forward+backward average, but if you think total is a better measurement that could be an alternative, @soumith what would you suggest?

@shubho
Copy link

shubho commented Jun 23, 2016

Consistency is more important than what I suggested.

Shubho

On Thursday, June 23, 2016, Sean Naren [email protected] wrote:

@shubho https://github.com/shubho, I was going off the covnet benchmark
structure for forward/backward/forward+backward average, but if you think
total is a better measurement that could be an alternative, @soumith
https://github.com/soumith what would you suggest?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeea18Uvst4GDewb2l1HCGDi7ZnjWks5qOuLegaJpZM4Isw4w
.

@SeanNaren
Copy link
Contributor

Interested to see how the other DS2 benchmarks are progressing, any news guys?
cc @shubho @ekelsen @soumith @nervetumer

@shubho
Copy link

shubho commented Jul 7, 2016

I am planning to get to the internal one this weekend.

Shubho

On Thu, Jul 7, 2016 at 5:46 AM, Sean Naren [email protected] wrote:

Interested to see how the other DS2 benchmarks are progressing, any news
guys?
cc @shubho https://github.com/shubho @ekelsen
https://github.com/ekelsen @soumith https://github.com/soumith
@nervetumer https://github.com/nervetumer


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABIPeU1O8eOc2CLsEH_5O7KMDfB_PXzHks5qTPVCgaJpZM4Isw4w
.

@SeanNaren
Copy link
Contributor

Shall we start benchmarking numbers? Just to confirm we are measuring the time it takes to run through the entire dataset iterator?

@soumith
Copy link
Contributor Author

soumith commented Jul 27, 2016

@SeanNaren yea benchmarking through the dataset iterator sounds right. Time to benchmark!

@SeanNaren
Copy link
Contributor

Some preliminary results to get the ball rolling, I benchmarked a 1xTitan and a 4xTitan setup of the Torch implementation (shoutout to @digitalreasoning for letting me use their servers!) with 5 epochs of the dataset:

Hardware Time (ms) forward (ms) backward (ms) Samples processed Samples processed per second Seconds of audio processed per second Epoch time (s)
1x Titans 154 83 72 128000 32 189 4013
4x Titans 180 99 81 128000 106 632 1204

I could only manage a 32 batch/GPU in memory for the epoch.

@pooyadavoodi
Copy link
Contributor

I suggest to add a column for multi-gpu scaling for the final presentation.

@ngimel
Copy link

ngimel commented Aug 23, 2016

@SeanNaren, any particular reason you are not using cudnn.BatchNormalization in BatchBRNN.lua and use nn.BatchNormalization instead? Thanks for your work on this!

@seed93
Copy link

seed93 commented Aug 23, 2016

@ngimel cudnn.BatchNormalization only supports batchsize < 1024 in inference mode. I think this is the point.

@SeanNaren
Copy link
Contributor

SeanNaren commented Aug 23, 2016

@ngimel What @seed93 said :) Thanks for the changes on the benchmark, ill find time to re-run these!

EDIT: though if you manage to find a way we could use cuDNN batch norm please let me know (we could add cudnn.BatchNorm since we are only training, would be up for opionions on this)!

@ngimel
Copy link

ngimel commented Aug 24, 2016

@SeanNaren, there are a few ways:

  1. you can use cudnn batchnorm for training, and switch to nn for inference, since the modules are compatible.
  2. cudnn bindings can be modified for inference to call cudnn batchnorm a few times with smaller batch sizes - bn at inference time is a pointwise operation, so results should not be affected.
  3. I'll also check if this limitation can be removed for future cudnn versions.
    We've seen speedups using cudnn.BatchNorm instead of nn.BatchNorm, so I see no reason not to use it for training benchmark.

@SeanNaren
Copy link
Contributor

@ngimel sounds fair will modify the benchmark to use cuDNN BatchNorm!
Don't want to spam this issue but want to address a few things with the torch implementation, I'll open a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants