Text rects overlap with tables and images that should be excluded #171

Meaveryway · 2024-10-19T00:53:41Z

Originally opened this as a discussion, but after getting into the code, it appears to be an issue that impacts the extraction of not only tables but also images with text on them.

The problem is that bboxes that are supposed to be avoided (images and tables) during text box detection are still finding themselves within the final joint text bboxes. This results in the text of the table being extracted in-place as raw text, and the formatted table being shifted to the bottom of the merged text bbox.

Here are a PDF file presenting a simple mock case, the markdown that PyMuPDF4LLM outputs, and the expected output.
table_sample.pdf
table_sample.md
table_expected.md

The issue is happening in column_boxes(): The rects passed in the avoid param can get re-included because we're not checking the intersection of the new block (temp) with them at these calls:
check = can_extend(temp, nbb, nblocks, vert_bboxes) # Lines [417, 427] multi_column.py
Including the img_bboxes in the checks does seem to fix the issue at this point.

Afterwards, the call to join_rects_phase3() # Line [440] multi_column.py re-includes the excluded rects once again because it merges without checking whether it intersects with an avoidable rect:

# Lines [245 - 250] multi_column.py:
                    temp = prect0 | prect1
                    test = set(
                        [tuple(b) for b in prects + new_rects if b.intersects(temp)]
                    )
                    if test == set((tuple(prect0), tuple(prect1))):
                        prect0 |= prect1

Discussed in #168

^{Originally posted by Meaveryway October 13, 2024}
Hello there,

Thanks for the wonderful work! this outperforms even most commercial solutions out there!
I have a question regarding tables extraction: when extracting a PDF page that has a table to markdown, it seems that the table's raw text is first extracted and put in place of the table, then the formatted table at the bottom of the page.

Is this the desired output? Why?

The text was updated successfully, but these errors were encountered:

Meaveryway · 2024-10-19T01:18:41Z

This function seems to be returning the opposite of what's intended (because of the negation).

# Lines [103 - 108] multi_column.py
    def intersects_bboxes(bb, bboxes):
        """Return True if a bbox touches bb, else return False."""
        for bbox in bboxes:
            if not (bb & bbox).is_valid:
                return True
        return False
# intersects_bboxes(Rect(0, 0, 10, 10), [Rect(5, 5, 20, 20)]) --> False

kingennio · 2024-10-19T10:09:06Z

I experienced the same issue about table extraction. It seems to me you've identified the problem in code but I'm not sure I managed to follow how you corrected the code yourself, could you please elaborate a bit more so that I can try and fix the code too? Thanks

kingennio · 2024-10-19T12:02:09Z

for anyone interested, I was able to fix the issue with the help of Claude. I submitted the code and the comments by @Meaveryway and it understood the problem and provided the fixes.
I tested this patched code on a bunch of documents with tables and it seems to work fine, whereas before the extraction resulted in the text in the table being extracted on place and than repeated as a table at the end of the page as @Meaveryway pointed out.
claude proposed fixes.pdf
.

Meaveryway · 2024-10-19T14:42:02Z

@kingennio I apologize for the late reply.
That's pretty much what I'm doing on my side as a temporary patch, though I didn't want to suggest that as a definitive fix because I I didn't test it on a substantial variety of layouts and can't really tell how it will reflect on different cases (it may work on this case but break the processing of other layouts).

SrikarNamburu · 2024-11-15T18:13:47Z

@Meaveryway @kingennio
The fixes attached above seems to be working for tables, but if there is any text on images how to ignore it?
Thanks in advance.

tahitimoon · 2024-11-20T06:35:41Z

0.0.17 seems to have not fixed this issue

Meaveryway mentioned this issue Oct 25, 2024

First column of table is repeated before the actual table #173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text rects overlap with tables and images that should be excluded #171

Text rects overlap with tables and images that should be excluded #171

Meaveryway commented Oct 19, 2024 •

edited

Loading

Meaveryway commented Oct 19, 2024 •

edited

Loading

kingennio commented Oct 19, 2024

kingennio commented Oct 19, 2024 •

edited

Loading

Meaveryway commented Oct 19, 2024

SrikarNamburu commented Nov 15, 2024

tahitimoon commented Nov 20, 2024

Text rects overlap with tables and images that should be excluded #171

Text rects overlap with tables and images that should be excluded #171

Comments

Meaveryway commented Oct 19, 2024 • edited Loading

Discussed in #168

Meaveryway commented Oct 19, 2024 • edited Loading

kingennio commented Oct 19, 2024

kingennio commented Oct 19, 2024 • edited Loading

Meaveryway commented Oct 19, 2024

SrikarNamburu commented Nov 15, 2024

tahitimoon commented Nov 20, 2024

Meaveryway commented Oct 19, 2024 •

edited

Loading

Meaveryway commented Oct 19, 2024 •

edited

Loading

kingennio commented Oct 19, 2024 •

edited

Loading