You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code I'm using is in file "one_file_ref".
I was trying to apply Mistral Transformer on other non-text tubular data. I initialised "positions" as torch.arange(1, num_of_most_instances) where "num_of_most_instances" is equivalent to the number of tokens in the longest sequence.
However, I have observed that each time I called loss.backward() and enter the next batch, there would be 30mb of gpu memory which could not be released. Thus, after 1000 steps it took 30gb of gpu memory.
Also I found that it always entered line 131 and never went into the "else" branch with my initialised "positions".
Is there any mistake of my usage of "positions"? Though the issue does not happen again after I comment out all the codes related to self.cache, I'm wondering if that will affect the attention mechanism.
The text was updated successfully, but these errors were encountered:
The code I'm using is in file "one_file_ref".
I was trying to apply Mistral Transformer on other non-text tubular data. I initialised "positions" as torch.arange(1, num_of_most_instances) where "num_of_most_instances" is equivalent to the number of tokens in the longest sequence.
However, I have observed that each time I called loss.backward() and enter the next batch, there would be 30mb of gpu memory which could not be released. Thus, after 1000 steps it took 30gb of gpu memory.
Also I found that it always entered line 131 and never went into the "else" branch with my initialised "positions".
Is there any mistake of my usage of "positions"? Though the issue does not happen again after I comment out all the codes related to self.cache, I'm wondering if that will affect the attention mechanism.
The text was updated successfully, but these errors were encountered: