You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your idea and repo. Since box embedding and w_g stay same in multi-turn multihead attention and they do not rely on k,q,v. Is it proper to move box embedding process to the begining of multihead attention to avoid embedding box in each EncoderLayer again and again? I have tried this and found it can reduce XE training time from 22h to 18h(on GTX 1080Ti) without obvious performance degradation (from CIDEr 1.1495 to CIDEr 1.1485)
The text was updated successfully, but these errors were encountered:
Equations (6) and (7) in the paper, show that indeed the box_embedding Emb(\lambda) is just a function of the bounding box displacements, and therefore constant for all the self-attention layers of the transformer encoder.
Therefore, as you say, the computation of Emb(\lambda) can be moved out of the self-attention layer.
However, as you can see in equation (7), the geometric weights w_g are a function of a learnable weight matrix W_G.
This learnable matrices are allowed to be different for different self-attention layers.
Therefore, the computation of w_g cannot be moved out of the self-attention layer.
Here is the computation of w_g in our code (Notice the linear layerl()):
Thank you for your idea and repo. Since box embedding and w_g stay same in multi-turn multihead attention and they do not rely on k,q,v. Is it proper to move box embedding process to the begining of multihead attention to avoid embedding box in each EncoderLayer again and again? I have tried this and found it can reduce XE training time from 22h to 18h(on GTX 1080Ti) without obvious performance degradation (from CIDEr 1.1495 to CIDEr 1.1485)
The text was updated successfully, but these errors were encountered: