You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why are the representation at each position on the first layer of the query stream initialized with the same vector w?
I think it can cause the same problem as how standard LM parameterization fails.
For example, with sequences 1->2->3->4 and 1->2->4->3, to predict the third position based on the first two positions, we can get the same representation on the second layer for the both sequences because of the same g and h on the first layer.
The text was updated successfully, but these errors were encountered:
Why are the representation at each position on the first layer of the query stream initialized with the same vector w?
I think it can cause the same problem as how standard LM parameterization fails.
For example, with sequences 1->2->3->4 and 1->2->4->3, to predict the third position based on the first two positions, we can get the same representation on the second layer for the both sequences because of the same g and h on the first layer.
The text was updated successfully, but these errors were encountered: