-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TrainingWrapper does not support line breaks #23
Comments
@0x7o ohh interesting, this must be some issue with the default BERT tokenizer, i'll take a look next week |
Judging by the error below, the script handles the file as a whole, not in batches
|
@lucidrains, tokenizer distorts the text. I think the problem is the difference between bert-cased and bert-uncased
|
@0x7o i think the solution may be to add the newline token https://discuss.huggingface.co/t/feat-tokenizers-how-to-make-models-aware-of-structuring-linebreaks/3711 , however, without training BERT from scratch including the newline token, it may suffer i can't think of a solution besides finding a model out there that doesn't do away with newlines (treat it as whitespace) |
the code will also have to be modularized to accept different models and their encoders, as a lot of the logic is specific to BERT-base |
@0x7o Notebook is not accessible (says link don't exist). Can you please share working link. Its very important for me. Thanks |
@0x7o would you share your notebook from above? If not, that's cool, or if its long gone, get that. Thank you, |
I appreciate your interest immensely, but I no longer have access to this notebook as it has been irretrievably deleted. |
Notebook
When training RETRO with the standard methods, TrainingWrapper does not add line breaks to the dataset. This can have a bad effect on many NLP tasks.
Input *.txt:
Model output after traing:
The text was updated successfully, but these errors were encountered: