Refactor grobid sections #281

geli-gel · 2023-11-07T23:18:11Z

Solution to https://github.com/allenai/scholar/issues/38452

Refactored the way Grobid sections/paragraphs/sentences are annotated onto the doc to reduce SpanGroup overlap errors

This involved refactoring the "sections" section of the code to only generate spangroups from sentences and headings (since those are the boxgroups Grobid provides) and using tuples of [optional[heading], [list of paragraphs[list of sentences]]] to make the hierarchical section/paragraph spangroups instead of trying to make huge box lists for each piece as originally written. This made it easier to pinpoint the source of SpanGroup overlap errors.

A necessary update was to make _box_groups_to_span_groups keep track of which tokens were already allocated in previous sentences, since we loop through by Grobid paragraph tags, and we were sometimes overlapping between paragraphs.

example of previously failing (now passing) pdf
grobid xml has a ".":

the actual PDF does not:

we end up with overlapping spangroups
A:

and B:

So, it seems that both Grobid and PDFPlumber are possibly mistaking the dot of the "i" in the line below as a "." in the line in question.

And Grobid splits into a new "paragraph" there. but both sentence boxes grab that ".".

Another update to _box_groups_to_span_groups that prevents SpanGroup overlaps of a different type (when MergeSpans merges token spans encompassing already allocated tokens) was also added in.

Also added a fix for the "missing attribute: 'coords'" that was also contributing to the overall failure rate.

Ran this on a list of past test PDFs, as well as recently failed PDFs from spp prod logs and they all pass (snippet from jupyter notebook includes my personal debugging comments...)

sha = '74da5d99e7d951f0dc9c3111186b22544a18bff5' # spangroups overlapping at paragraphs -- passes!
sha = '43659b55f75e3b2ea626bfc8eeea80afa3798c97' # spangroups overlapping at sections -- passes!
sha = 'ade545fda5015a8aac957a69a126da55451ff016' # spangroups overlapping at sections -- passes!
sha = '59e4c0ecfdcbaa651ca2c40625817bb83a9af4c3' # spangroups overlapping at sections -- passes!
sha = '3d5c8c04f42be8bc2fd6038e6f4099bbcfaa0c54' # overlap at (27919, 28096, 9), (27952, 27953, 10) -- passes!
sha = '1d1d7702cc4aaa3f66c29d4eb5ac023091d601e0' # this one's effed up no paragraphs -- successful but no sentences YET does have mentions ??? -- passes!!!!

sha = '121e30c48546e671dc5e16c694c5e69b392cf8fb' # OG experimental paper (3 pager), wondering if takase et al ref is part of sentence... should be! -- yep, it is. Passes!!
sha = 'e5910c027af0ee9c1901c57f6579d903aedee7f4' # test paper for test_grobid_augment_existing...

# sha = '32ff296b592d9cb69c88e239c8e80c7cc5cb3207' # this one has weird stuff for in-between text deciphering -- passes though!
# sha = '2423065e82ffbeb15353517cd8ceed9b168f039d' # a successful one -- has nothing of the sort (GOOD) -- still works after sections refactor! -- great
# sha = 'd55e9255deeb98ca2db55cd2e9bfac22774a2c32' # messed up weird mention not found in section -- 
# sha = '7535981e48c5cccd4d101895b2a350f114d25f5f' # ok maybe better same as above
# sha = 'b936dc63ad9a1380537b0bcc889c92b6af00431e' # sentence doesn't have coords? let's see... -- passes!
# sha = '6ae0afceaaa55ac6d4ec9b5b321f9aa1334b0429' # coords.. -- passes
# sha = '0706021a12b2d74eb5f9fd2f5dc187581a8c66a5' # i jus wanna see "after discussing the risks and benefits" section if 30 is there
# sha = '63929df1d44cec7b407d063f222fcc64e3de2ad3' # dikken et al 2012 section 0 -- only 2012 is there  --- is it the mention? or the section sentence? -- NOICE. added pad_x on sentences
# sha = '51c96902345101a9f2108749ad96d869e595548d' # is it the same in grobid xml, missing numbers/years? of refs?
# sha = '6bb4b89a1dd3bb03a3a2523a2e7867c1bb73a52a' # i jus wana picture -- no i also want grobid boxes drawn (BAD)
# sha = '304e2a42e897aa728d394e2d1e60ea26f4f1c101' # is abstract just ","?? -- no.
# sha = '8dd9ac4f26bee54cf1ee85c50fd63a1f44555fd1' # let's see a recently failed one -- WORKS!! IRL it's a straight up UGLY PDF.
# sha = '2c7f2e6f481873f72c9477e6d5447d1715668da6' 
# sha = 'dfbd16a81af6763d77696a620263295c2ea230f4'
# sha = 'b1e7e7df5aa502a2922ede6325e9aae2f14f6b71'
# sha = '383cfcef25477da08c86f96f5abcd7f796a1b51e'
# sha = '8a408271a1a2226163a579499905a0f4752b5085'
# sha = 'd5be1893584b41f6b567a6ac1a4d7676d80b0b98' # works! previously failed? i think?

TODO:

update mmda version in spp-grobid-api

…ssing paragraphs

cmwilhelm · 2023-11-07T23:54:16Z

src/mmda/utils/tools.py

+        pad_x: bool = False,
+        center: bool = False,
+        unallocated_tokens_dict:  Optional[Dict[int, SpanGroup]] = None
+) -> Union[List[SpanGroup], Tuple[List[SpanGroup], Dict[int, SpanGroup]]]:


I don't think you need the union-type return. You can always just return List[SpanGroup]. Since you're mutating unallocated_tokens_dict when it's passed in, the caller will always have the latest.

cmwilhelm · 2023-11-07T23:57:12Z

src/mmda/utils/tools.py

-    #     span_groups=derived_span_groups, field_name=field_name
-    # )
-    return derived_span_groups
+    return (derived_span_groups, unallocated_tokens) if return_unallocated_tokens else derived_span_groups


re: above, this can stay

return derived_span_groups

geli-gel · 2023-11-08T00:23:14Z

Chris reminds me that box_groups_to_span_groups is used by anything that uses .annotate with boxgroups.

tt verify on figure-tables passes ✅
the test:

mmda/src/ai2_internal/figure_table_predictors/integration_test.py

Line 88 in c28a17f

predictions = container.predict_batch(instances)

(mmda) angelez@ip-10-0-0-231 mmda % tt verify
Usage: tt verify [OPTIONS]
Try 'tt verify --help' for help.

Error: Missing option '--config-file' / '-c'.
(mmda) angelez@ip-10-0-0-231 mmda % tt verify -c src/ai2_internal/config.yaml    

Choose a variant by name or number: 
1. ivila-row-layoutlm-finetuned-s2vl-v2
2. layout_parser
3. bibentry_predictor
4. bibentry_predictor_mmda
5. citation_mentions
6. citation_links
7. bibentry_detection_predictor
8. figure_table_predictors
9. dwp-heuristic
10. svm-word-predictor
> 8
Using selected option: figure_table_predictors
...
 => => naming to docker.io/library/figure_table_predictors__timo-server                                                                                                                                                       0.0s
============================= test session starts ==============================
platform linux -- Python 3.8.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /opt/ml/code
plugins: anyio-3.7.1
collected 12 items

test_entrypoint.py .....                                                 [ 41%]
integration_tests/test_runner.py ..                                      [ 58%]
server/test_invocation_sampler.py .....                                  [100%]

=============================== warnings summary ===============================
../../../usr/local/lib/python3.8/site-packages/pkginfo/installed.py:62
  /usr/local/lib/python3.8/site-packages/pkginfo/installed.py:62: UserWarning: No PKG-INFO found for package: workingdir
    warnings.warn('No PKG-INFO found for package: %s' % self.package_name)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 12 passed, 1 warning in 15.76s ========================

tt verify on bibpredictor cause

mmda/src/ai2_internal/bibentry_detection_predictor/interface.py

Line 105 in cab36b6

    
           bib_entry_span_groups = box_groups_to_span_groups(processed_bib_entry_box_groups, doc, pad_x=True, center=True)

, passes ✅
the test:

mmda/src/ai2_internal/bibentry_detection_predictor/integration_test.py

Line 147 in 5d853d7

# A bib box overlapped others, causing it to end up with no spans

(mmda) angelez@ip-10-0-0-231 mmda % tt verify -c src/ai2_internal/config.yaml

Choose a variant by name or number: 
1. ivila-row-layoutlm-finetuned-s2vl-v2
2. layout_parser
3. bibentry_predictor
4. bibentry_predictor_mmda
5. citation_mentions
6. citation_links
7. bibentry_detection_predictor
8. figure_table_predictors
9. dwp-heuristic
10. svm-word-predictor
> 7
Using selected option: bibentry_detection_predictor
...
 => => naming to docker.io/library/bibentry_detection_predictor__timo-server                                                                                                                                                  0.0s
x ./
x ./archive/
x ./archive/config.yaml
x ./archive/model_final.pth
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.4.3, pluggy-1.3.0
rootdir: /opt/ml/code
plugins: hydra-core-1.3.2, anyio-3.7.1
collected 13 items

test_entrypoint.py .....                                                 [ 38%]
integration_tests/test_runner.py ...                                     [ 61%]
server/test_invocation_sampler.py .....                                  [100%]

=============================== warnings summary ===============================
../../../usr/local/lib/python3.8/dist-packages/detectron2/data/transforms/transform.py:46
  /usr/local/lib/python3.8/dist-packages/detectron2/data/transforms/transform.py:46: DeprecationWarning: LINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use BILINEAR or Resampling.BILINEAR instead.
    def __init__(self, src_rect, output_size, interp=Image.LINEAR, fill=0):

../../../usr/local/lib/python3.8/dist-packages/pkginfo/installed.py:62
  /usr/local/lib/python3.8/dist-packages/pkginfo/installed.py:62: UserWarning: No PKG-INFO found for package: workingdir
    warnings.warn('No PKG-INFO found for package: %s' % self.package_name)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================== 13 passed, 2 warnings in 797.48s (0:13:17) ==================

as for the spangroup overlap errors that arose in VILA (and actually came from LayoutParser, but don't anymore as of this change: https://github.com/allenai/mmda/pull/236/files#) -- you can see we used to .annotate(blocks=[layoutparser BoxGroups]) which would activate _box_groups_to_span_groups.
This comment shows what LayoutParser BoxGroups look like (and explains why they cause SpanGroup overlaps):
https://github.com/allenai/scholar/issues/36351#issuecomment-1584986899
This change would mask those errors, and we'd end up with Spanless SpanGroups with BoxGroups. No errors, but probably not the best result anyway. ❌

Might be better to just have boxes, which I think is what our dream "Entity" allows (these make more sense as just boxes) however someone annotating boxgroups onto a doc expecting SpanGroups might not want this result.

Going to update this PR so that omitting from the derived spangroups isn't the default.

geli-gel added 10 commits October 26, 2023 13:04

some kind of progress, still need to address overlap in sentences cro…

3464a02

…ssing paragraphs

ok cool this seems to be working!

a39d0c1

make heading spans part of section

df9849f

make sentences have unique ids, give paragraphs and sections ids

1cf469a

fix 'coords' error

95a6e5b

pad_x for sentences

172f073

IT WORKS we get nice spans for sentences for this one specific sha now

952ac9b

remove spanless results (useless)

002a8a2

lil rename

8ab0d61

mmda version bump

e16bc84

geli-gel requested a review from cmwilhelm November 7, 2023 23:29

cmwilhelm reviewed Nov 7, 2023

View reviewed changes

just return list

6c534cc

geli-gel added 3 commits November 7, 2023 16:31

oops delete my thoughts

5fb4c0f

oops fix my error made when switching to just list being returned

8336d06

new fix_overlaps param

7af1e93

cmwilhelm approved these changes Nov 9, 2023

View reviewed changes

geli-gel merged commit dd65039 into main Nov 9, 2023
5 checks passed

geli-gel deleted the refactor-grobid-sections branch November 9, 2023 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor grobid sections #281

Refactor grobid sections #281

geli-gel commented Nov 7, 2023 •

edited

Loading

cmwilhelm Nov 7, 2023

cmwilhelm Nov 7, 2023

geli-gel commented Nov 8, 2023 •

edited

Loading

Refactor grobid sections #281

Refactor grobid sections #281

Conversation

geli-gel commented Nov 7, 2023 • edited Loading

cmwilhelm Nov 7, 2023

Choose a reason for hiding this comment

cmwilhelm Nov 7, 2023

Choose a reason for hiding this comment

geli-gel commented Nov 8, 2023 • edited Loading

geli-gel commented Nov 7, 2023 •

edited

Loading

geli-gel commented Nov 8, 2023 •

edited

Loading