-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not truncate when tagging #97
Conversation
For BERT you need to truncate the input to a max which is 512 (or less), no other choice it will fail otherwise. In general, I would say for tagging we normally need to truncate up to the |
OK so then I understood the opposite way :-) there is nothing to do then in Delft... we need to somehow read the max-length-size and apply it in grobid then? |
I'm closing this pull request then. |
Yes I think there is actually no need for Grobid to know the The error is line 128 in DeLFTModel.java, we need a test on the size of |
OK, this is easily fixable. However, I'm wondering. 🤔 What do you think about this workaround? Instead of ignoring what's beyond the max-length (it won't be annotated), what about we split the sequence in chunks of |
You would need a sliding window (see #90) with enough redundant context to have a proper labelling. Otherwise the tagger will consider that you have two distinct sequences, and this can lead terrible things in the second chunk :D Apart what is mentioned in #90, there are quite some papers on the subject, for instance for applying BERT to sequences of arbitrary size using RNN, but I have not found the time to study them. |
Here an example of output in Grobid (from the reference-segmenter):
Vs not truncating:
Maybe this example is not relevant or correct, I'm not sure, but in this case we loose the tagging with part of it... |
Yes reference-segmenter is typically a model having super huge inputs as sequences to be labelled. So it is currently simply not a good idea at all to use a sequence labelling DL model for this task, there will always be unlabelled stuff. It's the reason I introduced in GROBID the possibility to mix CRF and DL sequence labelling "engine" by models, supporting a scenario where short sequence tasks might be handled by DL and the rest by CRF. |
This PR avoids IndexOutOfBounds when running from Grobid.
UPDATE: For bert the input is truncated:
See
def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer):
atdelft/delft/sequenceLabelling/models.py
Line 574 in 6c4a718
and
delft/delft/sequenceLabelling/models.py
Line 576 in 6c4a718
I'm not sure what is the best way to fix it... We could give a very high number in max_sequence_lenght or the higher length, but that's means we need to pre-process all the inputs... not sure it's efficient enough.
My solution here would be to use
sys.maxsize
: