-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unsupervised learning -tsda #894
Comments
Looks like some issue with CUDA. Don't know how to fix it |
Hi ReySadeghi, could you please run on CPU and see whether there is still a problem? |
Hi, in one case I tried and Got this error: and in another cases that I tried, RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED |
Could you please paste here the whole training script and also the whole log? |
training script: from sentence_transformers import SentenceTransformer, LoggingHandler import nltk vocab=[] vocab=vocab[:10000] model_name = 'HooshvareLab/bert-fa-base-uncased' word_embedding_model.tokenizer.add_tokens(vocab) pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False) train_sentences=[] train_sentences=train_sentences[:2000000] train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences) train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True) train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True) model.fit( .................................................. the Error: lib/python3.7/site-packages/pandas/compat/init.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
|
Does it work when you use bert-base-uncased? Also check that you have a recent version of Pytorch and transformers |
I edited it, actually the model name is 'HooshvareLab/bert-fa-base-uncased'. |
Thanks for reporting this issue! |
thanks. please inform me when the bug fixed. |
Hi, ReySadeghi. The bug has been fixed since this commit 022b2dd . So please git clone the latest version and |
@kwang2049 Traceback (most recent call last): ] Assertion /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [126,0,0] Assertion |
Are you using the same script? Please try the code below: from sentence_transformers import SentenceTransformer
from sentence_transformers import models, datasets, losses
from torch.utils.data import DataLoader
model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)
existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word]
print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
train_sentences=[
'A sentence containing <new_word_1> and <new_word_2>.',
'A sentence containing only <new_word_2>.',
'A sentence containing <سلامسلام>',
f'A sentence containing {existing_word}'
]
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
weight_decay=0,
scheduler='constantlr',
optimizer_params={'lr': 3e-5},
show_progress_bar=True
) This works fine on my server. If this does not work from your side, then I think it is either because of your wrong version of SBERT repo (I pass the test above using sentence-transformers==1.1.1) or a CUDA problem. And if this also works from your side, then I think it is related to a new word/token. And you can do this to locate it: You can iterate over all the new words, create a sentence containing each of them and fit the TSDAE model for each of them. Your computer may throw an exception at a certain point and if that happened, please tell us which it is. |
yes, I used latest version of SBERT and used the same script but still got error!! I got this warning too, could this cause the problem? /lib/python3.7/site-packages/pandas/compat/init.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError. |
Could you please run the code snippet mentioned above? |
yeah, it's solved. |
@nreimers |
@kwang2049 @nreimers AssertionError: Non-consecutive added token '#سلام' found. Should have index 100005 but has index 100006 in saved vocabulary. |
@nreimers |
Train loss is not computed & plotted during training |
Hi @ReySadeghi, I cannot reproduce it: I found it can successfully load the SBERT checkpoint with added tokens. Before a more detailed conversation, could you please do this checking: (to see if there will still be the assertion error without TSDAE) from sentence_transformers import SentenceTransformer
from sentence_transformers import models
model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)
existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word, '<new_subword111>', '<new_subword222>']
print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
train_sentences=[
'A sentence containing <new_word_1> and <new_word_2>.',
'A sentence containing only <new_word_2>.',
'A sentence containing <سلامسلام>',
f'A sentence containing {existing_word}'
'A sentence containing <new_subword111>xxx, my<new_subword222>yyyu'
]
model.save('sbert_tokens_added')
model = SentenceTransformer('sbert_tokens_added')
print([model[0].tokenizer.tokenize(sentence) for sentence in train_sentences]) If running this new snippet also reports the error, I think it might be related to your transformers version. And if this works well, you can change the |
I tried this and it was ok, but actually I think the problem was due to some tokens that weren't in utf-8 encoding, when I removed them the problem was solved. |
Hi,
I used TSDA method to pretrain a BERT model on a corpus of sentences and I got this error:
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling
cublasCreate(handle)
and then used CUDA_LAUNCH_BLOCKING=1 python [YOUR_PROGRAM] to trace the error and got this:
RuntimeError: CUDA error: device-side assert triggered
any help?
The text was updated successfully, but these errors were encountered: