Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conformer frontend should fix dimensions, be more standard #219

Open
albertz opened this issue Oct 12, 2022 · 9 comments
Open

Conformer frontend should fix dimensions, be more standard #219

albertz opened this issue Oct 12, 2022 · 9 comments
Assignees
Milestone

Comments

@albertz
Copy link
Member

albertz commented Oct 12, 2022

The defaults we use for ConformerConvSubsample are wrong.
Maybe also the structure of the layer is wrong.
We should follow more standard code, e.g.:
https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/subsampling.py#L162
https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet2/asr/encoder/conformer_encoder.py#L164

Also see relative positional encoding, #132.

@albertz albertz added this to the first-release milestone Oct 12, 2022
@albertz
Copy link
Member Author

albertz commented Oct 12, 2022

Or maybe sth in between. I think our pooling is fine?

Note that we should change the dropout as well. The current default dropout is also probably way too high (0.1 by default) for such small dimensions. Once we go to 256, it is maybe ok though and the dropout can stay the same. Not sure.

Maybe we need some experiments first?

@albertz
Copy link
Member Author

albertz commented Oct 17, 2022

I noticed, in ESPnet, when RelPositionalEncoding is used (default), it still scales the conv-prenet output by a factor, see here, specifically:

        self.xscale = math.sqrt(self.d_model)
...
        x = x * self.xscale

@albertz
Copy link
Member Author

albertz commented Oct 20, 2022

I noticed, in ESPnet, when RelPositionalEncoding is used (default), it still scales the conv-prenet output by a factor, see here, specifically:

        self.xscale = math.sqrt(self.d_model)
...
        x = x * self.xscale

This is probably because such scale is also there for the word embeddings in the original Transformer.

I tried to find out on the motivation of this. I found this CV question. One answer states that it is to get the embeddings into a similar range as the positional encoding, or actually larger than the pos enc, which is important when you add them together. However, here in this case, we actually do not add them, so I wonder if the scale is really necessary or helpful, or maybe even hurtful. I guess we need to do an experiment.

@albertz
Copy link
Member Author

albertz commented Oct 20, 2022

As we discussed, it should be configurable to support both the RETURNN standard cases and also the ESPnet case (at least mostly, maybe except of the xscale).

For now, we would not put defaults, as it's not clear yet which are the best.

@albertz
Copy link
Member Author

albertz commented Oct 20, 2022

So what else is missing after #219?

Striding is one thing. What else?

@albertz
Copy link
Member Author

albertz commented Oct 20, 2022

Ok I added striding. But I'm thinking about refactoring it a bit more. Specifically, current problems:

  • Dropout is missing at the end (after the final linear transformation). The current dropout option is not really useful as this is not the way how we at i6 do it and neither how ESPnet does it.
  • In ESPnet, the ConformerConvSubsample (e.g. Conv2dSubsampling6) also contains the final linear transformation, but in our case we have that separate. I'm not sure what's better.
  • ConformerConvSubsample naming is inconsistent to subsample_conv_... options (order swapped around).
  • I don't like all the subsample_conv_... options. I think we should follow more the style of our Transformer and also how ESPnet does it, i.e. allow to pass the whole module as an arg (conv_subsample_module or so). In ESPnet, it is a string which then gets translated to the module class. In Transformer (ours and PyTorch standard) it is the module directly. So then you could easily pass different subsampling prenets. We also should stop extending ConformerConvSubsample more and more, but instead have just separate subsampling modules, like ESPnet. I think this is much easier to follow.
  • Why do we have num_heads=8 as default?

@albertz
Copy link
Member Author

albertz commented Oct 20, 2022

I changed num_heads to 4 by default.

@albertz
Copy link
Member Author

albertz commented Oct 20, 2022

I renamed the things as discussed, and changed the option to a single single_layer.

@albertz
Copy link
Member Author

albertz commented Oct 20, 2022

In ESPnet, default cnn_module_kernel: int = 31. But we have conv_kernel_size = 32 as default. What's more reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants