-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restrict sentence length #10
Comments
UKPLab/sentence-transformers#364 (comment) Answer: max 128 word pieces WordPiece. See Ch. 4.1 (p. 7) https://arxiv.org/pdf/1609.08144v2.pdf WordPiece splits 1 word into N>=1 wordpiece tokens. A German word has approx. 6.3 letters. (Any thoughts?) |
Hi, You can look for the subword fertility (the average number of subwords a word is split into) of a respective tokenizer model. Using the fertility score you can then estimate how many words you can process (on average):
From this you can take the approximation of average number of characters to compute the average number of total characters:
Here is a larger study for mBERT: If you want to compute sentence similarity scores, have you considered just using the Best, |
Hi Ji-Ung, I did a rough calculation for 150 sentences and on average the number of characters per word pieces (for German for this sample) seems to be around 3.84. So the limit could be around 128*3.84=491.52 . Also, we could consider removing punctuation before encoding, since each punctuation character takes two word pieces:
(and maybe also removing stop words?) Best, Luise |
Hi Luise, in general, I'd say keep the preprocessing for prediction as similar as possible to the preprocessing of the data, the model was initially trained on (usually there is only minor/no preprocessing). Otherwise the model may express random behavior during prediction. Especially with contextualized models such as BERT, it may be better to keep the stopwords in the text, as they seem to receive equal attention as non-stopwords (cf. [1]). Although removing stopwords does not really change the performance in the study, the task is information retrieval, so stopwords may not matter that much. For instance, in sentiment classification, you'd definitively want to keep stopwords such as 'nicht':
Then again, for sentence similarity, stopwords may actually be not that important :D. I'd suggest to sample some more or less representative test data and just try it out -- with neural networks, the data domain and task often have a rather high impact on the outcome. Best, |
Most bert-models only allow a restricted sentences length and will truncated longer sentences. Should the api check if sentences are to long inform when sentences are shortened?
The text was updated successfully, but these errors were encountered: