You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
rand = torch.randn(inp.shape, device = x.device) ---> creates a random array of normal dist number (0,1)
rand[:, 0] = -torch.finfo(rand.dtype).max # first token should not be masked out ---> makes first <bos> token unmaskable; will be smallest p for topk
num_mask = min(int(seq * self.mask_prob), seq - 1) ---> we need to mask each token with a mask_prob probability == we can just choose randomly mask_prob ratio of numbers from the token array AND it should never exceed (seq-1) number of tokens
indices = rand.topk(num_mask, dim = -1).indices ---> topk of the random numbers are chosen to be masked (so, shouldn't this be uniform distribution according to the paper?)
mask = ~torch.zeros_like(inp).scatter(1, indices, 1.).bool() ---> creates a boolean mask
I have 2 questions :
(1) Will there ever be a case where seq-1 is bigger than int(seq * self.mask_prob) if we already have asserted the mask_prob is always <1. earlier in the code?
(2) We are masking with a probability value, doesnt it mean sometimes the model might get to see more than (1-mask_prob) tokens? But here we force the ratio throughout? And then does using normal vs uniform make any big difference?
Thanks!
The text was updated successfully, but these errors were encountered:
From the paper
![image](https://private-user-images.githubusercontent.com/8834712/298405783-d34a7c3a-ef07-496b-89a1-3a49ea60c59b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAzNTQzNjcsIm5iZiI6MTcyMDM1NDA2NywicGF0aCI6Ii84ODM0NzEyLzI5ODQwNTc4My1kMzRhN2MzYS1lZjA3LTQ5NmItODlhMS0zYTQ5ZWE2MGM1OWIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDdUMTIwNzQ3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NTU1MjZkMGMxMTk1ZDNiMjdhZmY5YWExMGFjYTk5YTg0ZmQxOTUwYTQ0M2NiMThjNDk2MDViNDg2MTM1MzBlZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.-g1eK9dp7amrmyuNiODZNWQuretd3rQLKrilkDRP2uw)
x-transformers/x_transformers/autoregressive_wrapper.py
Lines 274 to 280 in 90cef69
I am still trying to understand the code,
I have 2 questions :
(1) Will there ever be a case where
seq-1
is bigger thanint(seq * self.mask_prob)
if we already have asserted themask_prob
is always <1. earlier in the code?(2) We are masking with a probability value, doesnt it mean sometimes the model might get to see more than (1-mask_prob) tokens? But here we force the ratio throughout? And then does using normal vs uniform make any big difference?
Thanks!
The text was updated successfully, but these errors were encountered: