Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restrict sentence length #10

Closed
knit-bee opened this issue Oct 29, 2021 · 4 comments · Fixed by #11
Closed

restrict sentence length #10

knit-bee opened this issue Oct 29, 2021 · 4 comments · Fixed by #11
Labels
enhancement New feature or request

Comments

@knit-bee
Copy link
Collaborator

Most bert-models only allow a restricted sentences length and will truncated longer sentences. Should the api check if sentences are to long inform when sentences are shortened?

@ulf1
Copy link
Owner

ulf1 commented Oct 29, 2021

UKPLab/sentence-transformers#364 (comment)

Answer: max 128 word pieces

WordPiece. See Ch. 4.1 (p. 7) https://arxiv.org/pdf/1609.08144v2.pdf

WordPiece splits 1 word into N>=1 wordpiece tokens.
Example: An 8 letter word "Baumhaus" might become something like ["__Baum", "haus"] what still has 8 letters but N=2 wordpieces (excluding the special characters added by WordPiece).

A German word has approx. 6.3 letters.
Assume 33% of words are split into N=2 word pieces (???).
64 word pieces would be full words, and 32 words are split into N=2 pieces (Total 128 word pieces).
96 * 6.3 = 603 characters could be the limit.

(Any thoughts?)

@Wuhn
Copy link

Wuhn commented Oct 29, 2021

Hi,

You can look for the subword fertility (the average number of subwords a word is split into) of a respective tokenizer model. Using the fertility score you can then estimate how many words you can process (on average):

maximum word pieces (128) / subword fertility = maximum average number of tokens

From this you can take the approximation of average number of characters to compute the average number of total characters:

maximum average number of tokens * average number of characters per token = maximum average number of characters

Here is a larger study for mBERT:
https://juditacs.github.io/2019/02/19/bert-tokenization-stats.html

If you want to compute sentence similarity scores, have you considered just using the [CLS] vector? It should hold the representation for the whole sentence.

Best,
Ji-Ung

@knit-bee
Copy link
Collaborator Author

knit-bee commented Oct 29, 2021

Hi Ji-Ung,
thanks for your suggestion! I will check it out in Monday.

I did a rough calculation for 150 sentences and on average the number of characters per word pieces (for German for this sample) seems to be around 3.84. So the limit could be around 128*3.84=491.52 .
Though, I think whitespaces are discarded by the tokenizer, so we can add some more characters to the max.

Also, we could consider removing punctuation before encoding, since each punctuation character takes two word pieces:

['<s>', '▁Ich', '▁sehe', '▁', ',', '▁wie', '▁viel', '▁Widerstand', '▁es', '▁in', '▁den', '▁entsprechenden', '▁Region', 'en', '▁gibt', '▁', ',', '▁vor', '▁allem', '▁von', '▁Partei', 'en', '▁', ',', '▁die', '▁hier', '▁im', '▁Deutschen', '▁Bundestag', '▁si', 'tzen', '▁', '.', '</s>']

(and maybe also removing stop words?)

Best, Luise

@Wuhn
Copy link

Wuhn commented Oct 29, 2021

Hi Luise,

in general, I'd say keep the preprocessing for prediction as similar as possible to the preprocessing of the data, the model was initially trained on (usually there is only minor/no preprocessing). Otherwise the model may express random behavior during prediction. Especially with contextualized models such as BERT, it may be better to keep the stopwords in the text, as they seem to receive equal attention as non-stopwords (cf. [1]). Although removing stopwords does not really change the performance in the study, the task is information retrieval, so stopwords may not matter that much. For instance, in sentiment classification, you'd definitively want to keep stopwords such as 'nicht':

Der Film war nicht schlecht. -> neutral/positive
Der Film war schlecht. -> negative

Then again, for sentence similarity, stopwords may actually be not that important :D. I'd suggest to sample some more or less representative test data and just try it out -- with neural networks, the data domain and task often have a rather high impact on the outcome.

Best,
Ji-Ung

[1] https://dl.acm.org/doi/10.1145/3397271.3401325

@ulf1 ulf1 added the enhancement New feature or request label Nov 9, 2021
@ulf1 ulf1 linked a pull request Nov 9, 2021 that will close this issue
@ulf1 ulf1 closed this as completed in #11 Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants