Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor grobid sections #281

Merged
merged 14 commits into from
Nov 9, 2023
Merged

Refactor grobid sections #281

merged 14 commits into from
Nov 9, 2023

Conversation

geli-gel
Copy link
Contributor

@geli-gel geli-gel commented Nov 7, 2023

Solution to https://github.com/allenai/scholar/issues/38452

Refactored the way Grobid sections/paragraphs/sentences are annotated onto the doc to reduce SpanGroup overlap errors

This involved refactoring the "sections" section of the code to only generate spangroups from sentences and headings (since those are the boxgroups Grobid provides) and using tuples of [optional[heading], [list of paragraphs[list of sentences]]] to make the hierarchical section/paragraph spangroups instead of trying to make huge box lists for each piece as originally written. This made it easier to pinpoint the source of SpanGroup overlap errors.

A necessary update was to make _box_groups_to_span_groups keep track of which tokens were already allocated in previous sentences, since we loop through by Grobid paragraph tags, and we were sometimes overlapping between paragraphs.

example of previously failing (now passing) pdf
grobid xml has a ".":

image

the actual PDF does not:

image

we end up with overlapping spangroups
A:

image

and B:

image

So, it seems that both Grobid and PDFPlumber are possibly mistaking the dot of the "i" in the line below as a "." in the line in question.

And Grobid splits into a new "paragraph" there. but both sentence boxes grab that ".".


Another update to _box_groups_to_span_groups that prevents SpanGroup overlaps of a different type (when MergeSpans merges token spans encompassing already allocated tokens) was also added in.


Also added a fix for the "missing attribute: 'coords'" that was also contributing to the overall failure rate.


Ran this on a list of past test PDFs, as well as recently failed PDFs from spp prod logs and they all pass (snippet from jupyter notebook includes my personal debugging comments...)

sha = '74da5d99e7d951f0dc9c3111186b22544a18bff5' # spangroups overlapping at paragraphs -- passes!
sha = '43659b55f75e3b2ea626bfc8eeea80afa3798c97' # spangroups overlapping at sections -- passes!
sha = 'ade545fda5015a8aac957a69a126da55451ff016' # spangroups overlapping at sections -- passes!
sha = '59e4c0ecfdcbaa651ca2c40625817bb83a9af4c3' # spangroups overlapping at sections -- passes!
sha = '3d5c8c04f42be8bc2fd6038e6f4099bbcfaa0c54' # overlap at (27919, 28096, 9), (27952, 27953, 10) -- passes!
sha = '1d1d7702cc4aaa3f66c29d4eb5ac023091d601e0' # this one's effed up no paragraphs -- successful but no sentences YET does have mentions ??? -- passes!!!!

sha = '121e30c48546e671dc5e16c694c5e69b392cf8fb' # OG experimental paper (3 pager), wondering if takase et al ref is part of sentence... should be! -- yep, it is. Passes!!
sha = 'e5910c027af0ee9c1901c57f6579d903aedee7f4' # test paper for test_grobid_augment_existing...

# sha = '32ff296b592d9cb69c88e239c8e80c7cc5cb3207' # this one has weird stuff for in-between text deciphering -- passes though!
# sha = '2423065e82ffbeb15353517cd8ceed9b168f039d' # a successful one -- has nothing of the sort (GOOD) -- still works after sections refactor! -- great
# sha = 'd55e9255deeb98ca2db55cd2e9bfac22774a2c32' # messed up weird mention not found in section -- 
# sha = '7535981e48c5cccd4d101895b2a350f114d25f5f' # ok maybe better same as above
# sha = 'b936dc63ad9a1380537b0bcc889c92b6af00431e' # sentence doesn't have coords? let's see... -- passes!
# sha = '6ae0afceaaa55ac6d4ec9b5b321f9aa1334b0429' # coords.. -- passes
# sha = '0706021a12b2d74eb5f9fd2f5dc187581a8c66a5' # i jus wanna see "after discussing the risks and benefits" section if 30 is there
# sha = '63929df1d44cec7b407d063f222fcc64e3de2ad3' # dikken et al 2012 section 0 -- only 2012 is there  --- is it the mention? or the section sentence? -- NOICE. added pad_x on sentences
# sha = '51c96902345101a9f2108749ad96d869e595548d' # is it the same in grobid xml, missing numbers/years? of refs?
# sha = '6bb4b89a1dd3bb03a3a2523a2e7867c1bb73a52a' # i jus wana picture -- no i also want grobid boxes drawn (BAD)
# sha = '304e2a42e897aa728d394e2d1e60ea26f4f1c101' # is abstract just ","?? -- no.
# sha = '8dd9ac4f26bee54cf1ee85c50fd63a1f44555fd1' # let's see a recently failed one -- WORKS!! IRL it's a straight up UGLY PDF.
# sha = '2c7f2e6f481873f72c9477e6d5447d1715668da6' 
# sha = 'dfbd16a81af6763d77696a620263295c2ea230f4'
# sha = 'b1e7e7df5aa502a2922ede6325e9aae2f14f6b71'
# sha = '383cfcef25477da08c86f96f5abcd7f796a1b51e'
# sha = '8a408271a1a2226163a579499905a0f4752b5085'
# sha = 'd5be1893584b41f6b567a6ac1a4d7676d80b0b98' # works! previously failed? i think?

TODO:

  • update mmda version in spp-grobid-api

@geli-gel geli-gel requested a review from cmwilhelm November 7, 2023 23:29
pad_x: bool = False,
center: bool = False,
unallocated_tokens_dict: Optional[Dict[int, SpanGroup]] = None
) -> Union[List[SpanGroup], Tuple[List[SpanGroup], Dict[int, SpanGroup]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the union-type return. You can always just return List[SpanGroup]. Since you're mutating unallocated_tokens_dict when it's passed in, the caller will always have the latest.

# span_groups=derived_span_groups, field_name=field_name
# )
return derived_span_groups
return (derived_span_groups, unallocated_tokens) if return_unallocated_tokens else derived_span_groups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: above, this can stay

return derived_span_groups

@geli-gel
Copy link
Contributor Author

geli-gel commented Nov 8, 2023

Chris reminds me that box_groups_to_span_groups is used by anything that uses .annotate with boxgroups.

tt verify on figure-tables passes ✅
the test:

predictions = container.predict_batch(instances)

(mmda) angelez@ip-10-0-0-231 mmda % tt verify
Usage: tt verify [OPTIONS]
Try 'tt verify --help' for help.

Error: Missing option '--config-file' / '-c'.
(mmda) angelez@ip-10-0-0-231 mmda % tt verify -c src/ai2_internal/config.yaml    

Choose a variant by name or number: 
1. ivila-row-layoutlm-finetuned-s2vl-v2
2. layout_parser
3. bibentry_predictor
4. bibentry_predictor_mmda
5. citation_mentions
6. citation_links
7. bibentry_detection_predictor
8. figure_table_predictors
9. dwp-heuristic
10. svm-word-predictor
> 8
Using selected option: figure_table_predictors
...
 => => naming to docker.io/library/figure_table_predictors__timo-server                                                                                                                                                       0.0s
============================= test session starts ==============================
platform linux -- Python 3.8.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /opt/ml/code
plugins: anyio-3.7.1
collected 12 items

test_entrypoint.py .....                                                 [ 41%]
integration_tests/test_runner.py ..                                      [ 58%]
server/test_invocation_sampler.py .....                                  [100%]

=============================== warnings summary ===============================
../../../usr/local/lib/python3.8/site-packages/pkginfo/installed.py:62
  /usr/local/lib/python3.8/site-packages/pkginfo/installed.py:62: UserWarning: No PKG-INFO found for package: workingdir
    warnings.warn('No PKG-INFO found for package: %s' % self.package_name)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 12 passed, 1 warning in 15.76s ========================

tt verify on bibpredictor cause

bib_entry_span_groups = box_groups_to_span_groups(processed_bib_entry_box_groups, doc, pad_x=True, center=True)
, passes ✅
the test:
# A bib box overlapped others, causing it to end up with no spans

(mmda) angelez@ip-10-0-0-231 mmda % tt verify -c src/ai2_internal/config.yaml

Choose a variant by name or number: 
1. ivila-row-layoutlm-finetuned-s2vl-v2
2. layout_parser
3. bibentry_predictor
4. bibentry_predictor_mmda
5. citation_mentions
6. citation_links
7. bibentry_detection_predictor
8. figure_table_predictors
9. dwp-heuristic
10. svm-word-predictor
> 7
Using selected option: bibentry_detection_predictor
...
 => => naming to docker.io/library/bibentry_detection_predictor__timo-server                                                                                                                                                  0.0s
x ./
x ./archive/
x ./archive/config.yaml
x ./archive/model_final.pth
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.4.3, pluggy-1.3.0
rootdir: /opt/ml/code
plugins: hydra-core-1.3.2, anyio-3.7.1
collected 13 items

test_entrypoint.py .....                                                 [ 38%]
integration_tests/test_runner.py ...                                     [ 61%]
server/test_invocation_sampler.py .....                                  [100%]

=============================== warnings summary ===============================
../../../usr/local/lib/python3.8/dist-packages/detectron2/data/transforms/transform.py:46
  /usr/local/lib/python3.8/dist-packages/detectron2/data/transforms/transform.py:46: DeprecationWarning: LINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use BILINEAR or Resampling.BILINEAR instead.
    def __init__(self, src_rect, output_size, interp=Image.LINEAR, fill=0):

../../../usr/local/lib/python3.8/dist-packages/pkginfo/installed.py:62
  /usr/local/lib/python3.8/dist-packages/pkginfo/installed.py:62: UserWarning: No PKG-INFO found for package: workingdir
    warnings.warn('No PKG-INFO found for package: %s' % self.package_name)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================== 13 passed, 2 warnings in 797.48s (0:13:17) ==================

as for the spangroup overlap errors that arose in VILA (and actually came from LayoutParser, but don't anymore as of this change: https://github.com/allenai/mmda/pull/236/files#) -- you can see we used to .annotate(blocks=[layoutparser BoxGroups]) which would activate _box_groups_to_span_groups.
This comment shows what LayoutParser BoxGroups look like (and explains why they cause SpanGroup overlaps):
https://github.com/allenai/scholar/issues/36351#issuecomment-1584986899
This change would mask those errors, and we'd end up with Spanless SpanGroups with BoxGroups. No errors, but probably not the best result anyway. ❌

Might be better to just have boxes, which I think is what our dream "Entity" allows (these make more sense as just boxes) however someone annotating boxgroups onto a doc expecting SpanGroups might not want this result.

  • Going to update this PR so that omitting from the derived spangroups isn't the default.

@geli-gel geli-gel merged commit dd65039 into main Nov 9, 2023
5 checks passed
@geli-gel geli-gel deleted the refactor-grobid-sections branch November 9, 2023 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants