Attention Layer Bottleneck #24

ValterFallenius · 2022-03-15T10:30:00Z

Found a bottleneck: the attention layer
I have found a potential bottleneck for why bug #22 occurred. It seems like the axial attention layer is some kind of bottleneck. I ran the network for 1000 epochs to try to overfit a small subset of 4 samples. See run at
WandB. The network is not able to drop the loss at all almost and does not overfit the data, it yields a very bad result, some kind of mean. See image below:

After removing the axial attention layer the model does as expected and overfits the training data, see below after 100 epochs:

The message from the author listed in #19 does mention that our implementation of axial attention seems to be very different from theirs, he says: "Our (Google's) heads were small MLPs as far as I remember (I'm not at google anymore so do not have access to the source code)." I am not experienced enough to look into the source code of our Axial Attention Library to see how this differs from theirs.

What are heads in the axial attention? What does the number of heads have to do with anything?
Are we doing both vertical and horizontal attention passes in our implementation?

jacobbieker · 2022-03-15T12:26:18Z

So, this is the actual implementation being used for axial attention: https://github.com/lucidrains/axial-attention/blob/eff2c10c2e76c735a70a6b995b571213adffbbb7/axial_attention/axial_attention.py#L153-L178 which seems like ti is doing both vertical and horizontal passes. But, I just realized that we don't actually do any position embeddings, other than the lat/lon inputs, before passing to the axial attention. So we might need to add that and see what happens? The number of heads is the number of heads for multi-headed attention. So we can probably just set it to one and be fine I think.

ValterFallenius · 2022-03-15T12:48:46Z

Okay, how do we do this?

If we have 8 channels in the RNN output with 28×28 height and width, is this embedding information of which pixel we are in? I am struggling a bit wrapping my head around attention and axial attention...

Also when you say set the number of heads to 1 you mean for debugging, right? We still want multi head attention to replicate their model in the end.

jacobbieker · 2022-03-15T13:14:06Z

Yeah, set to 1 for the debugging to get it to overfit first. And yeah, the position embedding is saying which pixel we are in, and the location information of that pixel related to other pixels in the input. The library has a function for it, so we can probably just do this for where we use the axial attention: #25

JackKelly · 2022-03-15T13:45:45Z

I just realized that we don't actually do any position embeddings, other than the lat/lon inputs, before passing to the axial attention. So we might need to add that and see what happens?

I don't know if it's relevant but it recently occured to me that MetNet version 1 is quite similar to the Temporal Fusion Transformer (TFT) (also from Google!), except MetNet has 2 spatial dimensions, whilst TFT is for timeseries without any (explicit) spatial dimensions. In particular, both TFT and MetNet use an RNN followed by multi-head attention. In the TFT paper, they claim that the RNN generates a kind of learnt position encoding. So they don't bother with a "hand-crafted" position encoding.

The TFT paper says:

[The LSTM] also serves as a replacement for standard positional encoding, providing an appropriate inductive bias for the time ordering of the inputs.

ValterFallenius · 2022-03-15T15:17:25Z

I can confirm initial tests show promising results now, the networks seems to learn something now :) I'll be back with more results in a few days.

peterdudfield · 2022-09-07T15:18:14Z

@all-contributors please add @jacobbieker for code

allcontributors · 2022-09-07T15:18:22Z

@peterdudfield

I've put up a pull request to add @jacobbieker! 🎉

peterdudfield · 2022-09-07T15:18:40Z

@all-contributors please add @JackKelly for code

allcontributors · 2022-09-07T15:18:50Z

@peterdudfield

I've put up a pull request to add @JackKelly! 🎉

peterdudfield · 2022-09-07T15:20:00Z

@all-contributors please add @ValterFallenius for userTesting

allcontributors · 2022-09-07T15:20:08Z

@peterdudfield

I've put up a pull request to add @ValterFallenius! 🎉

ValterFallenius added the bug Something isn't working label Mar 15, 2022

allcontributors bot mentioned this issue Sep 7, 2022

docs: add jacobbieker as a contributor for code #34

Merged

allcontributors bot mentioned this issue Sep 7, 2022

docs: add JackKelly as a contributor for code #35

Merged

allcontributors bot mentioned this issue Sep 7, 2022

docs: add ValterFallenius as a contributor for userTesting #36

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention Layer Bottleneck #24

Attention Layer Bottleneck #24

ValterFallenius commented Mar 15, 2022 •

edited

Loading

jacobbieker commented Mar 15, 2022

ValterFallenius commented Mar 15, 2022

jacobbieker commented Mar 15, 2022

JackKelly commented Mar 15, 2022 •

edited

Loading

ValterFallenius commented Mar 15, 2022

peterdudfield commented Sep 7, 2022

allcontributors bot commented Sep 7, 2022

peterdudfield commented Sep 7, 2022

allcontributors bot commented Sep 7, 2022

peterdudfield commented Sep 7, 2022

allcontributors bot commented Sep 7, 2022

Attention Layer Bottleneck #24

Attention Layer Bottleneck #24

Comments

ValterFallenius commented Mar 15, 2022 • edited Loading

jacobbieker commented Mar 15, 2022

ValterFallenius commented Mar 15, 2022

jacobbieker commented Mar 15, 2022

JackKelly commented Mar 15, 2022 • edited Loading

ValterFallenius commented Mar 15, 2022

peterdudfield commented Sep 7, 2022

allcontributors bot commented Sep 7, 2022

peterdudfield commented Sep 7, 2022

allcontributors bot commented Sep 7, 2022

peterdudfield commented Sep 7, 2022

allcontributors bot commented Sep 7, 2022

ValterFallenius commented Mar 15, 2022 •

edited

Loading

JackKelly commented Mar 15, 2022 •

edited

Loading