restrict sentence length #10

knit-bee · 2021-10-29T13:03:01Z

Most bert-models only allow a restricted sentences length and will truncated longer sentences. Should the api check if sentences are to long inform when sentences are shortened?

ulf1 · 2021-10-29T13:39:33Z

UKPLab/sentence-transformers#364 (comment)

Answer: max 128 word pieces

WordPiece. See Ch. 4.1 (p. 7) https://arxiv.org/pdf/1609.08144v2.pdf

WordPiece splits 1 word into N>=1 wordpiece tokens.
Example: An 8 letter word "Baumhaus" might become something like ["__Baum", "haus"] what still has 8 letters but N=2 wordpieces (excluding the special characters added by WordPiece).

A German word has approx. 6.3 letters.
Assume 33% of words are split into N=2 word pieces (???).
64 word pieces would be full words, and 32 words are split into N=2 pieces (Total 128 word pieces).
96 * 6.3 = 603 characters could be the limit.

(Any thoughts?)

Wuhn · 2021-10-29T14:39:31Z

Hi,

You can look for the subword fertility (the average number of subwords a word is split into) of a respective tokenizer model. Using the fertility score you can then estimate how many words you can process (on average):

maximum word pieces (128) / subword fertility = maximum average number of tokens

From this you can take the approximation of average number of characters to compute the average number of total characters:

maximum average number of tokens * average number of characters per token = maximum average number of characters

Here is a larger study for mBERT:
https://juditacs.github.io/2019/02/19/bert-tokenization-stats.html

If you want to compute sentence similarity scores, have you considered just using the [CLS] vector? It should hold the representation for the whole sentence.

Best,
Ji-Ung

knit-bee · 2021-10-29T14:57:22Z

Hi Ji-Ung,
thanks for your suggestion! I will check it out in Monday.

I did a rough calculation for 150 sentences and on average the number of characters per word pieces (for German for this sample) seems to be around 3.84. So the limit could be around 128*3.84=491.52 .
Though, I think whitespaces are discarded by the tokenizer, so we can add some more characters to the max.

Also, we could consider removing punctuation before encoding, since each punctuation character takes two word pieces:

['<s>', '▁Ich', '▁sehe', '▁', ',', '▁wie', '▁viel', '▁Widerstand', '▁es', '▁in', '▁den', '▁entsprechenden', '▁Region', 'en', '▁gibt', '▁', ',', '▁vor', '▁allem', '▁von', '▁Partei', 'en', '▁', ',', '▁die', '▁hier', '▁im', '▁Deutschen', '▁Bundestag', '▁si', 'tzen', '▁', '.', '</s>']

(and maybe also removing stop words?)

Best, Luise

Wuhn · 2021-10-29T15:27:18Z

Hi Luise,

in general, I'd say keep the preprocessing for prediction as similar as possible to the preprocessing of the data, the model was initially trained on (usually there is only minor/no preprocessing). Otherwise the model may express random behavior during prediction. Especially with contextualized models such as BERT, it may be better to keep the stopwords in the text, as they seem to receive equal attention as non-stopwords (cf. [1]). Although removing stopwords does not really change the performance in the study, the task is information retrieval, so stopwords may not matter that much. For instance, in sentiment classification, you'd definitively want to keep stopwords such as 'nicht':

Der Film war nicht schlecht. -> neutral/positive
Der Film war schlecht. -> negative

Then again, for sentence similarity, stopwords may actually be not that important :D. I'd suggest to sample some more or less representative test data and just try it out -- with neural networks, the data domain and task often have a rather high impact on the outcome.

Best,
Ji-Ung

[1] https://dl.acm.org/doi/10.1145/3397271.3401325

ulf1 added the enhancement New feature or request label Nov 9, 2021

ulf1 linked a pull request Nov 9, 2021 that will close this issue

Add check for long sentences #11

Merged

ulf1 closed this as completed in #11 Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restrict sentence length #10

restrict sentence length #10

knit-bee commented Oct 29, 2021

ulf1 commented Oct 29, 2021

Wuhn commented Oct 29, 2021 •

edited

Loading

knit-bee commented Oct 29, 2021 •

edited

Loading

Wuhn commented Oct 29, 2021

restrict sentence length #10

restrict sentence length #10

Comments

knit-bee commented Oct 29, 2021

ulf1 commented Oct 29, 2021

Wuhn commented Oct 29, 2021 • edited Loading

knit-bee commented Oct 29, 2021 • edited Loading

Wuhn commented Oct 29, 2021

Wuhn commented Oct 29, 2021 •

edited

Loading

knit-bee commented Oct 29, 2021 •

edited

Loading