-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_sequence_length not used(?) #44
Comments
Hello ! Yes this is correct, thanks for pointing this issue! At some early point I used this as a fixed I kept it to have the possibility to force a maximum length to truncate the input sequence as a safety limit - however I have forgotten to add it! |
As you can see, it's not related to a sliding windows as we have in CRF. With CRF, we always have a contextual window centered on the current "target" token which will be labelled, and the CRF template is used to determine the size of the window. With the DL approach, we have a prediction on a complete sequence, without sliding window and the whole sequence is involved when training the weights or outputting something. For very large input sequence like the header model, it's of course an issue (size of the input could be more than 1000 tokens in worst cases) - but it's potentially also where it is interesting because the "recursive" aspect of the RNN makes the backpropagation potentially impacting the complete sequence. It would be indeed interesting to compare the "traditional" global full sequence network approach and a local sliding-window network, though I am not sure how to do it. It would require some review to see how it was approached by other people. |
[...]
I'm trying to understand the principle when the max_lenght is dynamically calculated. This is in the |
|
Another detail I note not to forget: the text classification part still use the name |
ah yes, sorry I overlooked.
Indeed |
it will be updated in the branch |
OK. No problem. 🙂 Is there an issue / a list of things to be done in the bert branch? |
Fix with the merging of branch |
Just looking at it. I can see there is a change in the delft.sequenceLabelling.DataGenerator for when tokenize is There is also something specifically for the BERT_Sequence model. It would appear that it otherwise is not truncating the text for already tokenized text, but it is possible that I am missing something? |
In my copy of the DataGenerator I have now also applied the max_sequence_length to already tokenized text (see PR). Although this could probably be applied already earlier on, e.g. when reading the input. But then need to be careful that it doesn't affect validation. |
I just reopen it since we are discussing it. Otherwise, is hard to find back. What about rewriting it like this? # tokenize texts in self.x if not already done
max_length_x = 0
x_tokenized = []
for i in range(0, max_iter):
if self.tokenize:
tokens = tokenizeAndFilterSimple(sub_x[i])
else:
tokens = sub_x[i]
if len(tokens) > self.max_sequence_length:
max_length_x = self.max_sequence_length
# truncation of sequence at max_sequence_length
tokens = tokens[:self.max_sequence_length]
elif len(tokens) > max_length_x:
max_length_x = len(tokens)
x_tokenized.append(tokens) |
You will need to calculate This is what I ended up with so far: https://github.com/elifesciences/sciencebeam-trainer-delft/blob/c31f97433243a2b0a66671c0dd3e652dcd306362/sciencebeam_trainer_delft/sequence_labelling/data_generator.py#L102-L118 Training is now significantly faster and no memory issues anymore. Although I haven't evaluated it yet. |
Oh... I forgot to use the brain I'm afraid... 😓 Let's discuss in the PR #89 |
Ah yes thank you very much @de-code, I forgot to apply |
No worries. Actually I had to ensure that the |
what do you mean? That you truncate only when training? |
Possibly the header sequence input to be tagged is too large and when truncated it results in some alignment problems in GROBID (due to the missing truncated end not present in the labeled data, but expected in the tokenizations on the grobid side)? |
Yes, for now. Due to an issue in the segmentation model, I am getting up to 10k tokens in the header training data. Although I think I also had memory issues with GROBID's default training data. I will be looking into sliding windows. For prediction I didn't experience memory issues. Which may be because it doesn't need to allocate all of the shapes. For prediction via GROBID I am using the CPU only.
Yes, it was expecting "more" and is throwing IndexOutOfBoundsException (see elifesciences/sciencebeam-trainer-delft#154) I will experiment with sliding windows as one of my next steps. |
If I'm not wrong this can be closed, unless there are some more tests to do, IMHO |
Maybe the one issue that could be considered part of this is that the maximum length used to train should not necessarily be used for tagging. (I believe that is how it is currently implemented) |
Yes, I just had this problem, if the sequences are truncated in the tagging, then grobid will crash with IndexOutOfBounds. The fix is in #97 . |
We need to fix it in Grobid then I think. It's normal to truncate when tagging at max So Grobid has to avoid looking beyond |
Personally I would think that Additionally the I would also be careful that this affects the evaluation in a "positive" way. i.e. a model with There are a few options:
|
@de-code I would say yes and no :) For something like Grobid where we really abstract the process of sequence labelling, I think we indeed don't want to deal with the issue of the max sequence length (which does not exist for the graphical models like CRF). So we expect to have labels and it's up to the underlying sequence labelling "engine" to ensure that. However as in practice with DL we cannot ensure that currently, from the client/Grobid point of view it's also normal to benchmark the algorithms as they are for the different tasks, and if a task uses long sequences and results are crap, it's a fair evaluation as compared to other methods which have no flaw with long sequences. I think it's what you are saying, not to influence in a positive way the benchmark by only considering small sequence tasks? For DeLFT itself, I think the problem of max sequence length is so important and constraining with Deep Learning that it has to be made visible to the users. The tasks have to be adapted according to the limit of the method, given that we cannot easily do the contrary right now. Even with sliding windows usage of overlapping sequences or some sort of truncated backpropagation, accuracy and performance will be impacted. About the options, I have doubts about the first option. Except for BERT models, for prediction we can for sure leave the sequence larger than the The second option is what is done with Grobid now. I think it's better to make the problem visible to the user in DeLFT as mentioned above. Third option is going to be very interesting to explore, but I am a bit worry about the impact on accuracy and performance. Some might be introduced in DeLFT as distinct architectures, especially for BERT for instance. |
Just some early results which are related to the discussion. Wapiti:
DL (my
page numbers got worse, but better results on the others. Even end-to-end evaluation: The author surnames got a bit worse which seem to be due to line numbers. I actually started with 1800 training documents but see to have lost a few during the conversion to DeLFT (maybe due to the way I had annotated the line numbers). But that's another topic. Just mentioning it to explain the drop. This seem to suggest that if enough documents contain the relevant fields within the maximum sequence length (here 3000), then we could still get better results even before having implemented sliding windows for training. I also didn't experience any memory issues evaluating or tagging on CPU. (The longest validation sequence is 169257 whereas the median is around 1000) EDIT: the majority of the author surname drop seems to be due to it extracting authors from the reference section for one of the documents (out of 100). The current evaluation evaluates authors individually, and the reference section provides a lot of incorrectly labelled authors. |
I think that the issue comes from fields that have more than the maximum sequence length.
This is another model that must take in input a sequence longer than the limit, resulting in possibly bad or incomplete results. From Delft point of view, IMHO this issue can be closed. As short term plan we better migrate to wapiti all the models that exceed such limit (we might have really bad results) (kermitt2/grobid#559) and we investigate the sliding window more in depth #90 |
Which fields are you thinking of? I have a hunch that it is coming mainly from spacing issues (kermitt2/grobid#564) but I haven't investigated it yet.
The issue with the bad reference extraction seems to be mainly due to the line numbers (it went away after I removed them in GROBID). And probably not helped by the current segmentation DeLFT model only using the first token (#99). In my opinion the maximum training sequence is not the same as what should be used for prediction because documents are of variable lengths (and we are using a RNN). Using a maximum length that is equal to the median length will still have seen a whole document about half of the time. I chose something that was three times the median. I guess evaluation will be useful. To summarise my preference would be to leave the option open to use a different maximum length if any. But I guess the issue can be closed as the maximum sequence length is now used. |
sorry I wrote |
I think this is obsolete and can be closed |
yes |
Hi @kermitt2
I was just considering whether we need sliding windows to not have to use a really large
max_sequence_length
. But then I realised thatmax_sequence_length
doesn't actually seem to be used. It's passed to theDataGenerator
which doesn't seem to use it. Instead it simply pads the batch to whatever the maximum length is within the batch.The text was updated successfully, but these errors were encountered: