About the Duplex attention #10

AndrewChiyz · 2021-04-13T13:16:27Z

Hi, Thanks for sharing the code!

I have a few questions about Section 3.1.2. Duplex attention.

I am confused by the notation in the section. For example, in this section, "Y=(K^{P\times d}, V^{P\times d}), where the values store the content of the Y variables (e.g. the randomly sampled latents for the case of GAN)". Does it mean that V^{P\times d} is sampled from the original variable Y? how to set the number of P in your code?
"keys track the centroids of the attention-based assignments from X to Y, which can be computed as K=a_b(Y, X)", does it mean K is calculated by using the self-attention module but with (Y, X) as input? If so, how to understand “the keys track the centroid of the attention-based assignments from X to Y”? BTW, how to get the centroids?
For the update rule in duplex attention, what does the a() function mean? Does it denote a self-attention module like a_b() in Section 3.1.1, where X as query, K as keys, and V as values, if so, K is calculated from another self-attention module as mentioned in question 2, so the output of a_b(Y, X) will be treated as Keys, so the update rule contains two self-attention operations? is that right? Does it mean ’Duplex‘ attention?
But finally I find I may be wrong when I read the last paragraph in this section. As mentioned in this section, "to support bidirectional interaction between elements, we can chain two reciprocal simplex attentions from X to Y and from Y to X, obtaining the duplex attention" So, does it mean, first, we calculate the Y by using a simplex attention module u^a(Y, X), and then use this Y as input of u^d(X, Y) to update X? Does it mean the duplex attention module contains three self-attention operations?

Thanks a lot! :)

The text was updated successfully, but these errors were encountered:

dorarad · 2021-04-21T04:10:14Z

Hi! :) So sorry for the large delay in my response, I hope to get back to you in about 2 days at most, will go then over all open issues!

dorarad · 2021-04-26T07:00:14Z

Hi, so sorry for the delay in my response!

Several people indeed indicated that the notation regarding the key-value description in the paper is a bit confusing, and I plan to upload a new version of the paper with that aspect fixed by tomorrow.

P is a typo. That should be m - the number of latent variables in the transformer. The Y = (K,V) simply meant that instead of having each latent variable just one vector, we now associate two separate vectors with it, a key and a value, where the key will track the mean image assignments, and the value will be the standard latent variable (normally sampled vector that is then passed through g_mapping).
To give an example, intuitively we want different latent variables to be associated with semantic regions or objects in the image, so if we have latent variable y=(k,v) being responsible for generating a blue sphere, then k will track the location of the sphere segment within the image, while v will contain information about how to "paint/generate" this segment, so e.g. will contain the "blueness" of the sphere to-be-created, its material, or any other property about it.
We get the centroids by computing the bipartite attention between the set of keys K (initialized to some trainable parameters) and the elements in X (the HxW image features). Think about how the standard k-means algorithm works: you have a set of elements X and you want to assign them to centroids:

So you initialize k centroids to some vectors, then you compute distance/similarity between each element and each centroid, and then if you pass it through softmax it will give to what extant each element should be assigned to each centroid.
Then, you need to update the centroids to be the mean of the X elements assigned to it. This can be done by computing a weighted mean of the assigned X elements and updating the k vectors accordingly.
If you think about it carefully, what I described is exactly the computation of bipartite transformer, between the sets K and X! (where X are used to update Ks, to see that read through 3.1.1 equations)

Yea a is the bi-partite attention (not self-attention) module as defined in section 3.1.1. The output of will be used as keys which are in turn used to update X in the opposite direction, which indeed gives us overall two attention modules in opposite directions, which is the reason we call it duplex attention!
No the duplex attention contains two attention modules: from X to K (which is part of Y), and then from Y back to X. I think what might cause the confusion is that I said we compute attention from X to Y and from X to K interchangeably. Will fix it in the new version of the paper. If the answer to this question is still unclear please let me know.

Hope it helps, and let me know if you have any further questions! :)

AndrewChiyz · 2021-04-27T13:22:48Z

Thanks a lot for your detailed reply! Now I understand the core idea of the duplex attention part.

Thank you! :)

07hyx06 · 2021-05-11T02:15:04Z

So the explicit form of duplex attention is:

K = Attention( K, X, X ) # or LayerNorm(K+Attention( K, X, X ))
X = gamma( Attention( X, K, Q ) ) * w(X) + beta( Attention( X, K, Q ) )
where Y=( K, Q ).

Am I right?

nicolas-dufour · 2021-08-03T15:39:38Z

Hey, i wanted to ask, so in the paper u say that you compute K = a(Y,X) and then X =u_d(X,Y), so if i get this right, V is never updated? Thanks

subminu · 2021-12-29T05:15:23Z

Hi, so sorry for the delay in my response!

Several people indeed indicated that the notation regarding the key-value description in the paper is a bit confusing, and I plan to upload a new version of the paper with that aspect fixed by tomorrow.

P is a typo. That should be m - the number of latent variables in the transformer. The Y = (K,V) simply meant that instead of having each latent variable just one vector, we now associate two separate vectors with it, a key and a value, where the key will track the mean image assignments, and the value will be the standard latent variable (normally sampled vector that is then passed through g_mapping).
To give an example, intuitively we want different latent variables to be associated with semantic regions or objects in the image, so if we have latent variable y=(k,v) being responsible for generating a blue sphere, then k will track the location of the sphere segment within the image, while v will contain information about how to "paint/generate" this segment, so e.g. will contain the "blueness" of the sphere to-be-created, its material, or any other property about it.

We get the centroids by computing the bipartite attention between the set of keys K (initialized to some trainable parameters) and the elements in X (the HxW image features). Think about how the standard k-means algorithm works: you have a set of elements X and you want to assign them to centroids:

So you initialize k centroids to some vectors, then you compute distance/similarity between each element and each centroid, and then if you pass it through softmax it will give to what extant each element should be assigned to each centroid.

Then, you need to update the centroids to be the mean of the X elements assigned to it. This can be done by computing a weighted mean of the assigned X elements and updating the k vectors accordingly.
If you think about it carefully, what I described is exactly the computation of bipartite transformer, between the sets K and X! (where X are used to update Ks, to see that read through 3.1.1 equations)

Yea a is the bi-partite attention (not self-attention) module as defined in section 3.1.1. The output of will be used as keys which are in turn used to update X in the opposite direction, which indeed gives us overall two attention modules in opposite directions, which is the reason we call it duplex attention!

No the duplex attention contains two attention modules: from X to K (which is part of Y), and then from Y back to X. I think what might cause the confusion is that I said we compute attention from X to Y and from X to K interchangeably. Will fix it in the new version of the paper. If the answer to this question is still unclear please let me know.

Hope it helps, and let me know if you have any further questions! :)

Thanks for this issue and answer, I could understand how does duplex attention work. However, according to your answer, there is still a typo in the new paper version (v3): Y = (K^{n \times d}, V^{n \times d}) <- Not m but n, this makes me hesitate with the notation n in simplex session (3.1.1) when I first read your paper.

dorarad · 2022-02-03T01:54:53Z

Hi all!
@07hyx06 Yep that's correct! We first find the centroids by casting attention over the image features (x) and then update the features based on the centroids (K)
@nicolas-dufour that's right the values are not iteratively updated, only the centroids and the image features!
@subminu Thanks so much for pointing that out! I'll update the paper with that fix!

dorarad pinned this issue Apr 26, 2021

AndrewChiyz closed this as completed Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the Duplex attention #10

About the Duplex attention #10

AndrewChiyz commented Apr 13, 2021 •

edited

Loading

dorarad commented Apr 21, 2021

dorarad commented Apr 26, 2021 •

edited

Loading

AndrewChiyz commented Apr 27, 2021

07hyx06 commented May 11, 2021

nicolas-dufour commented Aug 3, 2021

subminu commented Dec 29, 2021 •

edited

Loading

dorarad commented Feb 3, 2022

About the Duplex attention #10

About the Duplex attention #10

Comments

AndrewChiyz commented Apr 13, 2021 • edited Loading

dorarad commented Apr 21, 2021

dorarad commented Apr 26, 2021 • edited Loading

AndrewChiyz commented Apr 27, 2021

07hyx06 commented May 11, 2021

nicolas-dufour commented Aug 3, 2021

subminu commented Dec 29, 2021 • edited Loading

dorarad commented Feb 3, 2022

AndrewChiyz commented Apr 13, 2021 •

edited

Loading

dorarad commented Apr 26, 2021 •

edited

Loading

subminu commented Dec 29, 2021 •

edited

Loading