Why is the first layer of the query stream initialized with the same vector w rather than different vectors? #277

Huakui-Zhang · 2020-11-08T16:00:44Z

Why are the representation at each position on the first layer of the query stream initialized with the same vector w?
I think it can cause the same problem as how standard LM parameterization fails.
For example, with sequences 1->2->3->4 and 1->2->4->3, to predict the third position based on the first two positions, we can get the same representation on the second layer for the both sequences because of the same g and h on the first layer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the first layer of the query stream initialized with the same vector w rather than different vectors? #277

Why is the first layer of the query stream initialized with the same vector w rather than different vectors? #277

Huakui-Zhang commented Nov 8, 2020

Why is the first layer of the query stream initialized with the same vector w rather than different vectors? #277

Why is the first layer of the query stream initialized with the same vector w rather than different vectors? #277

Comments

Huakui-Zhang commented Nov 8, 2020