Conformer frontend should fix dimensions, be more standard #219

albertz · 2022-10-12T10:44:56Z

The defaults we use for ConformerConvSubsample are wrong.
Maybe also the structure of the layer is wrong.
We should follow more standard code, e.g.:
https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/subsampling.py#L162
https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet2/asr/encoder/conformer_encoder.py#L164

Also see relative positional encoding, #132.

The text was updated successfully, but these errors were encountered:

albertz · 2022-10-12T11:02:01Z

Or maybe sth in between. I think our pooling is fine?

Note that we should change the dropout as well. The current default dropout is also probably way too high (0.1 by default) for such small dimensions. Once we go to 256, it is maybe ok though and the dropout can stay the same. Not sure.

Maybe we need some experiments first?

albertz · 2022-10-17T15:43:31Z

I noticed, in ESPnet, when RelPositionalEncoding is used (default), it still scales the conv-prenet output by a factor, see here, specifically:

        self.xscale = math.sqrt(self.d_model)
...
        x = x * self.xscale

albertz · 2022-10-20T07:45:33Z

I noticed, in ESPnet, when RelPositionalEncoding is used (default), it still scales the conv-prenet output by a factor, see here, specifically:
        self.xscale = math.sqrt(self.d_model)
...
        x = x * self.xscale

This is probably because such scale is also there for the word embeddings in the original Transformer.

I tried to find out on the motivation of this. I found this CV question. One answer states that it is to get the embeddings into a similar range as the positional encoding, or actually larger than the pos enc, which is important when you add them together. However, here in this case, we actually do not add them, so I wonder if the scale is really necessary or helpful, or maybe even hurtful. I guess we need to do an experiment.

albertz · 2022-10-20T08:19:02Z

As we discussed, it should be configurable to support both the RETURNN standard cases and also the ESPnet case (at least mostly, maybe except of the xscale).

For now, we would not put defaults, as it's not clear yet which are the best.

albertz · 2022-10-20T08:19:43Z

So what else is missing after #219?

Striding is one thing. What else?

albertz · 2022-10-20T11:16:43Z

Ok I added striding. But I'm thinking about refactoring it a bit more. Specifically, current problems:

Dropout is missing at the end (after the final linear transformation). The current dropout option is not really useful as this is not the way how we at i6 do it and neither how ESPnet does it.
In ESPnet, the ConformerConvSubsample (e.g. Conv2dSubsampling6) also contains the final linear transformation, but in our case we have that separate. I'm not sure what's better.
ConformerConvSubsample naming is inconsistent to subsample_conv_... options (order swapped around).
I don't like all the subsample_conv_... options. I think we should follow more the style of our Transformer and also how ESPnet does it, i.e. allow to pass the whole module as an arg (conv_subsample_module or so). In ESPnet, it is a string which then gets translated to the module class. In Transformer (ours and PyTorch standard) it is the module directly. So then you could easily pass different subsampling prenets. We also should stop extending ConformerConvSubsample more and more, but instead have just separate subsampling modules, like ESPnet. I think this is much easier to follow.
Why do we have num_heads=8 as default?

albertz · 2022-10-20T13:01:45Z

I changed num_heads to 4 by default.

albertz · 2022-10-20T13:02:12Z

I renamed the things as discussed, and changed the option to a single single_layer.

albertz · 2022-10-20T22:28:15Z

In ESPnet, default cnn_module_kernel: int = 31. But we have conv_kernel_size = 32 as default. What's more reasonable?

albertz added this to the first-release milestone Oct 12, 2022

albertz assigned mmz33 Oct 12, 2022

This was referenced Oct 17, 2022

Conformer misses relative pos encoding #132

Closed

SelfAttention misses Linear after attention, wrong for Conformer, Transformer #221

Closed

albertz assigned JackTemaki and albertz Oct 17, 2022

albertz mentioned this issue Oct 19, 2022

ConformerConvSubsample configurable #228

Merged

albertz mentioned this issue Nov 2, 2022

Create good Conformer baselines #233

Open

albertz mentioned this issue Nov 14, 2022

Behavior version? #247

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conformer frontend should fix dimensions, be more standard #219

Conformer frontend should fix dimensions, be more standard #219

albertz commented Oct 12, 2022 •

edited

Loading

albertz commented Oct 12, 2022

albertz commented Oct 17, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022 •

edited

Loading

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

Conformer frontend should fix dimensions, be more standard #219

Conformer frontend should fix dimensions, be more standard #219

Comments

albertz commented Oct 12, 2022 • edited Loading

albertz commented Oct 12, 2022

albertz commented Oct 17, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022 • edited Loading

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 12, 2022 •

edited

Loading

albertz commented Oct 20, 2022 •

edited

Loading