Open
Description
In the forward()
for MultiHeadAttention
class in assignment3/cs231n/transformer_layers.py
In the argument list provided by the setup code:attn_mask: Array of shape (T, S) where mask[i,j] == 0
should be attn_mask: Array of shape (S, T) where mask[i,j] == 0
If attn_mask is of (T, S)
shape, then it needs to be transposed because the product of the query and key matrix is of the shape (batch_size, num_heads, S, T)
so the code for masking should be query_key_product.masked_fill(torch.transpose(attn_mask, 0,1) == 0, -np.inf))
which doesn't give the output value provided by expected_masked_self_attn_output
. The output only matches the provided output if people don't transpose attn_mask
which is wrong
Metadata
Metadata
Assignees
Labels
No labels