-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor grobid sections #281
Conversation
src/mmda/utils/tools.py
Outdated
pad_x: bool = False, | ||
center: bool = False, | ||
unallocated_tokens_dict: Optional[Dict[int, SpanGroup]] = None | ||
) -> Union[List[SpanGroup], Tuple[List[SpanGroup], Dict[int, SpanGroup]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the union-type return. You can always just return List[SpanGroup]
. Since you're mutating unallocated_tokens_dict
when it's passed in, the caller will always have the latest.
src/mmda/utils/tools.py
Outdated
# span_groups=derived_span_groups, field_name=field_name | ||
# ) | ||
return derived_span_groups | ||
return (derived_span_groups, unallocated_tokens) if return_unallocated_tokens else derived_span_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: above, this can stay
return derived_span_groups
Chris reminds me that tt verify on figure-tables passes ✅
tt verify on bibpredictor cause
the test:
as for the spangroup overlap errors that arose in VILA (and actually came from LayoutParser, but don't anymore as of this change: https://github.com/allenai/mmda/pull/236/files#) -- you can see we used to Might be better to just have boxes, which I think is what our dream "Entity" allows (these make more sense as just boxes) however someone annotating boxgroups onto a doc expecting SpanGroups might not want this result.
|
Solution to https://github.com/allenai/scholar/issues/38452
Refactored the way Grobid sections/paragraphs/sentences are annotated onto the doc to reduce SpanGroup overlap errors
This involved refactoring the "sections" section of the code to only generate spangroups from sentences and headings (since those are the boxgroups Grobid provides) and using tuples of [optional[heading], [list of paragraphs[list of sentences]]] to make the hierarchical section/paragraph spangroups instead of trying to make huge box lists for each piece as originally written. This made it easier to pinpoint the source of SpanGroup overlap errors.
A necessary update was to make _box_groups_to_span_groups keep track of which tokens were already allocated in previous sentences, since we loop through by Grobid paragraph tags, and we were sometimes overlapping between paragraphs.
Another update to _box_groups_to_span_groups that prevents SpanGroup overlaps of a different type (when MergeSpans merges token spans encompassing already allocated tokens) was also added in.
Also added a fix for the "missing attribute: 'coords'" that was also contributing to the overall failure rate.
Ran this on a list of past test PDFs, as well as recently failed PDFs from spp prod logs and they all pass (snippet from jupyter notebook includes my personal debugging comments...)
TODO: