-
-
Notifications
You must be signed in to change notification settings - Fork 54
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: fix entity aggregation bug for NER detection
It looks like it’s because we’re using the “FIRST” aggregation strategy, with a tokenizer that is not word-aware: we’re falling back to some heuristics (the presence of spaces before/after the word), that somehow fails here. Indeed, XLM-RoBERTa model does not use the same tokenizer as RoBERTa, and uses an Unigram model (instead of BPE), which is not word-aware. Another issue of the “FIRST” aggregation strategy is that the ending dot after the ingredient list is predicted as part of the ingredient list, even though it’s not in the non-aggregated prediction. By switching to “SIMPLE” strategy (a strategy without an error correction mechanism), we don’t have this issue anymore, but two subwords belonging to the same word are sometimes predicted as belonging to two entities. A more in-depth analysis of the TokenClassificationPipeline reveals that the issue comes from the Punctuation() pre-tokenizer we added: it was not included in the original tokenizer, and the heuristic doesn’t take it into account, leading to an incorrect detection. I updated the heuristic to use the `word_ids` provided by the tokenizer to know whether the token is a subword or not (with respect to the pre-tokenization output).
- Loading branch information
1 parent
6eae9d5
commit 5f2b94c
Showing
2 changed files
with
60 additions
and
109 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters