Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the Duplex attention #10

Closed
AndrewChiyz opened this issue Apr 13, 2021 · 7 comments
Closed

About the Duplex attention #10

AndrewChiyz opened this issue Apr 13, 2021 · 7 comments

Comments

@AndrewChiyz
Copy link

AndrewChiyz commented Apr 13, 2021

Hi, Thanks for sharing the code!

I have a few questions about Section 3.1.2. Duplex attention.

  1. I am confused by the notation in the section. For example, in this section, "Y=(K^{P\times d}, V^{P\times d}), where the values store the content of the Y variables (e.g. the randomly sampled latents for the case of GAN)". Does it mean that V^{P\times d} is sampled from the original variable Y? how to set the number of P in your code?

  2. "keys track the centroids of the attention-based assignments from X to Y, which can be computed as K=a_b(Y, X)", does it mean K is calculated by using the self-attention module but with (Y, X) as input? If so, how to understand “the keys track the centroid of the attention-based assignments from X to Y”? BTW, how to get the centroids?

  3. For the update rule in duplex attention, what does the a() function mean? Does it denote a self-attention module like a_b() in Section 3.1.1, where X as query, K as keys, and V as values, if so, K is calculated from another self-attention module as mentioned in question 2, so the output of a_b(Y, X) will be treated as Keys, so the update rule contains two self-attention operations? is that right? Does it mean ’Duplex‘ attention?

  4. But finally I find I may be wrong when I read the last paragraph in this section. As mentioned in this section, "to support bidirectional interaction between elements, we can chain two reciprocal simplex attentions from X to Y and from Y to X, obtaining the duplex attention" So, does it mean, first, we calculate the Y by using a simplex attention module u^a(Y, X), and then use this Y as input of u^d(X, Y) to update X? Does it mean the duplex attention module contains three self-attention operations?

Thanks a lot! :)

@dorarad
Copy link
Owner

dorarad commented Apr 21, 2021

Hi! :) So sorry for the large delay in my response, I hope to get back to you in about 2 days at most, will go then over all open issues!

@dorarad
Copy link
Owner

dorarad commented Apr 26, 2021

Hi, so sorry for the delay in my response!

Several people indeed indicated that the notation regarding the key-value description in the paper is a bit confusing, and I plan to upload a new version of the paper with that aspect fixed by tomorrow.

  1. P is a typo. That should be m - the number of latent variables in the transformer. The Y = (K,V) simply meant that instead of having each latent variable just one vector, we now associate two separate vectors with it, a key and a value, where the key will track the mean image assignments, and the value will be the standard latent variable (normally sampled vector that is then passed through g_mapping).
    To give an example, intuitively we want different latent variables to be associated with semantic regions or objects in the image, so if we have latent variable y=(k,v) being responsible for generating a blue sphere, then k will track the location of the sphere segment within the image, while v will contain information about how to "paint/generate" this segment, so e.g. will contain the "blueness" of the sphere to-be-created, its material, or any other property about it.

  2. We get the centroids by computing the bipartite attention between the set of keys K (initialized to some trainable parameters) and the elements in X (the HxW image features). Think about how the standard k-means algorithm works: you have a set of elements X and you want to assign them to centroids:

  • So you initialize k centroids to some vectors, then you compute distance/similarity between each element and each centroid, and then if you pass it through softmax it will give to what extant each element should be assigned to each centroid.
  • Then, you need to update the centroids to be the mean of the X elements assigned to it. This can be done by computing a weighted mean of the assigned X elements and updating the k vectors accordingly.
    If you think about it carefully, what I described is exactly the computation of bipartite transformer, between the sets K and X! (where X are used to update Ks, to see that read through 3.1.1 equations)
  1. Yea a is the bi-partite attention (not self-attention) module as defined in section 3.1.1. The output of will be used as keys which are in turn used to update X in the opposite direction, which indeed gives us overall two attention modules in opposite directions, which is the reason we call it duplex attention!

  2. No the duplex attention contains two attention modules: from X to K (which is part of Y), and then from Y back to X. I think what might cause the confusion is that I said we compute attention from X to Y and from X to K interchangeably. Will fix it in the new version of the paper. If the answer to this question is still unclear please let me know.

Hope it helps, and let me know if you have any further questions! :)

@dorarad dorarad pinned this issue Apr 26, 2021
@AndrewChiyz
Copy link
Author

Thanks a lot for your detailed reply! Now I understand the core idea of the duplex attention part.

Thank you! :)

@07hyx06
Copy link

07hyx06 commented May 11, 2021

So the explicit form of duplex attention is:

K = Attention( K, X, X ) # or LayerNorm(K+Attention( K, X, X ))
X = gamma( Attention( X, K, Q ) ) * w(X) + beta( Attention( X, K, Q ) )
where Y=( K, Q ).

Am I right?

@nicolas-dufour
Copy link

Hey, i wanted to ask, so in the paper u say that you compute K = a(Y,X) and then X =u_d(X,Y), so if i get this right, V is never updated? Thanks

@subminu
Copy link

subminu commented Dec 29, 2021

Hi, so sorry for the delay in my response!

Several people indeed indicated that the notation regarding the key-value description in the paper is a bit confusing, and I plan to upload a new version of the paper with that aspect fixed by tomorrow.

  1. P is a typo. That should be m - the number of latent variables in the transformer. The Y = (K,V) simply meant that instead of having each latent variable just one vector, we now associate two separate vectors with it, a key and a value, where the key will track the mean image assignments, and the value will be the standard latent variable (normally sampled vector that is then passed through g_mapping).
    To give an example, intuitively we want different latent variables to be associated with semantic regions or objects in the image, so if we have latent variable y=(k,v) being responsible for generating a blue sphere, then k will track the location of the sphere segment within the image, while v will contain information about how to "paint/generate" this segment, so e.g. will contain the "blueness" of the sphere to-be-created, its material, or any other property about it.
  2. We get the centroids by computing the bipartite attention between the set of keys K (initialized to some trainable parameters) and the elements in X (the HxW image features). Think about how the standard k-means algorithm works: you have a set of elements X and you want to assign them to centroids:
  • So you initialize k centroids to some vectors, then you compute distance/similarity between each element and each centroid, and then if you pass it through softmax it will give to what extant each element should be assigned to each centroid.
  • Then, you need to update the centroids to be the mean of the X elements assigned to it. This can be done by computing a weighted mean of the assigned X elements and updating the k vectors accordingly.
    If you think about it carefully, what I described is exactly the computation of bipartite transformer, between the sets K and X! (where X are used to update Ks, to see that read through 3.1.1 equations)
  1. Yea a is the bi-partite attention (not self-attention) module as defined in section 3.1.1. The output of will be used as keys which are in turn used to update X in the opposite direction, which indeed gives us overall two attention modules in opposite directions, which is the reason we call it duplex attention!
  2. No the duplex attention contains two attention modules: from X to K (which is part of Y), and then from Y back to X. I think what might cause the confusion is that I said we compute attention from X to Y and from X to K interchangeably. Will fix it in the new version of the paper. If the answer to this question is still unclear please let me know.

Hope it helps, and let me know if you have any further questions! :)

Thanks for this issue and answer, I could understand how does duplex attention work. However, according to your answer, there is still a typo in the new paper version (v3): Y = (K^{n \times d}, V^{n \times d}) <- Not m but n, this makes me hesitate with the notation n in simplex session (3.1.1) when I first read your paper.

@dorarad
Copy link
Owner

dorarad commented Feb 3, 2022

Hi all!
@07hyx06 Yep that's correct! We first find the centroids by casting attention over the image features (x) and then update the features based on the centroids (K)
@nicolas-dufour that's right the values are not iteratively updated, only the centroids and the image features!
@subminu Thanks so much for pointing that out! I'll update the paper with that fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants