Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER with bert #21

Open
byzhang opened this issue Jan 11, 2019 · 12 comments
Open

NER with bert #21

byzhang opened this issue Jan 11, 2019 · 12 comments
Labels
question Further information is requested

Comments

@byzhang
Copy link

byzhang commented Jan 11, 2019

Do you have a plan to reproduce the BERT NER model? I tried, but with Bert_base, the best micro-avg Test F1 on CoNLL-2003 is 91.37, while the reported in the paper is 92.4.

@kermitt2
Copy link
Owner

Hello @byzhang ! Yes I plan to reproduce BERT NER when I will find the time (also FLAIR).

Did you use the fine-tuning approach or the "ELMo-like" feature-based approach they describe in their paper in section 5.4?

In the NER evaluation of their paper, it was unclear if they used or not the CoNLL 2003 dev section for training, which can make a quite big difference in the final f-score (but not as big as what you mention).

@byzhang
Copy link
Author

byzhang commented Jan 11, 2019

I used the fine-tuning approach, and the dev set is used for hyper tuning and early stop only

@kermitt2
Copy link
Owner

see ongoing work on PR #78

@kermitt2
Copy link
Owner

The best run I could get with BERT-base-en (cased) is 91.68 for CoNLL 2003 NER test set, tuning with dev set, training only with train set - but I added a CRF activation layer for fine tuning instead of the default softmax (CRF brings around +0.3 to f-score). So this is the same as you.

Average over 10 training+eval, this gives 91.20 - so very far from the reported 92.4.

As discussed there, the results reported in the paper for NER are likely token-level scores, not entity-level - very misleading of course.

@kermitt2 kermitt2 added the question Further information is requested label Dec 30, 2019
@ghaddarAbs
Copy link

In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.

#https://github.com/daltonfury42/truecase
#pip install truecase
import truecase
import re




# original tokens
#['FULL', 'FEES', '1.875', 'REOFFER', '99.32', 'SPREAD', '+20', 'BP']

def truecase_sentence(tokens):
   word_lst = [(w, idx) for idx, w in enumerate(tokens) if all(c.isalpha() for c in w)]
   lst = [w for w, _ in word_lst if re.match(r'\b[A-Z\.\-]+\b', w)]

   if len(lst) and len(lst) == len(word_lst):
       parts = truecase.get_true_case(' '.join(lst)).split()

       # the trucaser have its own tokenization ...
       # skip if the number of word dosen't match
       if len(parts) != len(word_lst): return tokens

       for (w, idx), nw in zip(word_lst, parts):
           tokens[idx] = nw

# truecased tokens
#['Full', 'fees', '1.875', 'Reoffer', '99.32', 'spread', '+20', 'BP']

Also, i found useful to use : very small learning rate (5e-6) \ large batch size (128) \ high epoch num (>40).

With these configurations and preprocessing, I was able to reach 92.8 with bert-large.

@kermitt2
Copy link
Owner

kermitt2 commented Aug 16, 2020

Hello @ghaddarAbs !

Thank you for your message and spending the time to share your experiments to reproduce the BERT reported results.

Sorry it took me some time to come back to this.

I've tried to see the impact of the truecase pre-processing with bert-base-en (cased), so having in mind the reported 92.4 f-score (I am using bert-base because I don't have easily the GPU to use bert-large). Below I didn't touch the hyper-parameters:

bert-base bert-base+CRF BidLSTM-CRF (glove)
no truecase 90.77 (90.43-91.15) 91.20 (90.78-91.68) 90.75 (90.39-91.35)
with truecase - 91.42 (91.22-91.74) 90.77 (90.43-91.15)

The scores are averaged over 10 train/runs with worst-best scores in the parentheses. So the gain alone of the pre-processing is significant (+0.22) but not big. Apparently the truecase has no impact on BidLSTM-CRF, but an impact with BERT. I guess it's because in BERT, the vocabulary is case-sensitive and do not consider extra casing variants from the 30522 sub-tokens, while BidLSTM has a dedicated char input channel which will deal very well with generalization of casing (which also explains why adding "casing" features in the BidLSTM-CRF has zero effect).

In term of evaluation, I think we are not comparing really anymore just NER algorithm here, we are also evaluating the true case tool, it's what people call usually "using external knowledge".

I've started to experiment with your indicated hyper-parameters, but it takes a lot of time (> 40 epoch with learning rate so low, it's really different from the usual 3-6 max epoch usually selected with BERT, it takes days and days with 10 runs :/).

Regarding your results, may I ask you the following questions:

  • are you using a final CRF layer instead of the default softmax? it improves by +0.3 for me, but it was not used in the BERT paper
  • are you using the best run or a random run for the score you report? or are you using an average (on 5 or 10 runs... the BERT paper reports average over 5 runs) - choosing the best run typically adds +0.6 to the average, but is not comparable with the usual evaluation method
  • are you using the dev set for training? It adds +0.3 to +0.4 but we should not do that if we want to compare with other works

On my side, when I add all these "tricks", I am not very far from the reported score (but still 0.3-0.4 missing). But, from the reproducibility point of view, according to the original BERT paper they are not using any of them (thus the 91.20 f-score versus the reported 92.4). From the evaluation point of view, I mus say using these tricks makes the evaluation not anymore comparable with other reported numbers, or we have to add them too to the other algorithms.

@ghaddarAbs
Copy link

@kermitt2 ...

I used GPU with 32 GB for these experiments.

To answer your 3 questions:

  • yes i am using CRF, it give slightly better results.
  • all experiments are for 5 runs
  • I train on train set only, and pick the best model performing on dev to report results on test set.

My own intuition is that the authors of BERT applied true casing on CoNLL-2003 for fine-tune NER. It was the only way for me to reproduce their results, but I don't know actually if they have done it or not. Of course, if true casing is applied than the results are not comparable with previous works.

@wangxinyu0922
Copy link

I trained the NER model with bert-base-cased and truecase as well and find that in can get 91.72 F1 score on average, but it is still far from the reported score in BERT paper

@pinesnow72
Copy link

@ghaddarAbs, @kermitt2

I tried truecase with bert-base-cased and it gave a little improvement but the test f1 still was limited below 92.0. The BERT paper says that they used maximal document context for NER. That means, I think, that they used left/right sentence context for predicting target sentence. I tried this document context and could get around 92.4 test f1.

@BCWang93
Copy link

BCWang93 commented Aug 9, 2022

@pinesnow72
hi, how do you use the document context in conll2003? can you share some method?Thanks!

@pinesnow72
Copy link

pinesnow72 commented Aug 9, 2022

@pinesnow72
hi, how do you use the document context in conll2003? can you share some method?Thanks!

@BCWang93
For each sentence, I added previous and next sentence tokens (sub-tokens for BERT) before and after the target sentence respectively to maximally fill the max-len of each sample. I put the target sentence in the middle and so roughly same size of left and right sentence tokens were added as context. Of course, CLS and SEP were inserted at the beginning and sentence boundaries, respectively. This context-added samples will be passed to BERT encoder. But the output labels should be predicted only on the target sentence of each sample. To do this, before the classification layer, I implemented and added TargetSelection layer, which gets as inputs the BERT output and target sentence token indices and selects only target sentence encoding from the context-added BERT output by using tf.gather() method.

@BCWang93
Copy link

@ghaddarAbs, @kermitt2

I tried truecase with bert-base-cased and it gave a little improvement but the test f1 still was limited below 92.0. The BERT paper says that they used maximal document context for NER. That means, I think, that they used left/right sentence context for predicting target sentence. I tried this document context and could get around 92.4 test f1.

@pinesnow72 ,hi, can you share some code you process data in this method?Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants