Fix tesseract preprocessor for blank pages #202

JSv4 · 2023-02-26T05:43:35Z

Fix for Issue #201. When the processed PDF is empty, there appears to be a single token returned for the page and the text is na. This becomes a problem in extract_page_tokens in the tesseract preprocessor. At the start of the call to the processor the
token df, tokens with text of na are filtered out:

res[~res.text.isna()]

leaving you with an empty dataframe. For pages with at least one token that is not na, you do not have an empty
dataframe. Where the dataframe is not empty and you apply groupby():

.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"], group_keys=False)
.apply(
    lambda gp: pd.Series(
        [
            gp["left"].min(),
            gp["top"].min(),
            gp["width"].max(),
            gp["height"].max(),
            gp["conf"].mean(),
            gp["text"].astype(str).str.cat(sep=" "),
        ]
    )
)

You wind up with cols for the dataframe of RangeIndex(start=0, stop=6, step=1). So, when you call rename like this:

    .rename(
        columns={
            0: "x",
            1: "y",
            2: "width",
            3: "height",
            4: "score",
            5: "text",
            "index": "id",
        }
    )

the cols with "names" of 0, 1, 2, 3, 4, and 5 ARE renamed. This doesn't happen with empty dataframes, however.
The grouping step doesn't change the df so the column names remain unchanged - you have an empty df with col names of

[id, level, page_num, block_num, par_num, line_num, word_num, left, top, width, height, conf, text]

Thus, the renaming above totally fails to make any changes because there are no cols 0, 1, 2, 3, 4, or 5. And so there
is no col named "score" annnnddd when you call .drop(columns=["score", "id"]), you get KeyError: "['score'] not found in axis

My suggested fix is to change extract_page_tokens() to test if the page's token df is empty when stripped of all tokens where text
is na. If False, proceed with the preprocessor as usual. If True, however, return an empty array.

FYI, I also changed

.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"])
to

.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"], group_keys=False)
as I noticed a deprecation warning leaving out the group_keys keyword arg.

I've attached two sample PDFs, one blank and one not. Both processor successfully now whereas the blank failed before:
00075cb9-9428-4270-baac-93ed12d284ef.pdf
0d953016-c4c1-4d0f-8745-dc59bef8351f.pdf

Fixes #201

… fails when the underlying text datatype is not actually text. I assume this is rare but is dependent on the original source PDF authoring tool. I have a pdf where once page only has a number on it and it appears the data type that is extracted to the dataframe is float64. This fails with the extract_page_tokens() function as written. Added .astype(str) to line 43 to force conversion to string, which should cover these kinds of corner cases. Working for me at least on the pdf that was crashingt the parser.

…s-only-numbers Line 43 of cli.pawls.preprocessors.tesseract in extract_page_tokens()…

…o properly handle parsing empty pdfs (no tokens at all in the entire pdf).

JSv4 · 2023-04-13T01:46:08Z

You guys open to merging this? I use your pre-processor in another project, and it'd be greet to use your repo as a dependency instead of my fork.

JSv4 added 4 commits February 3, 2023 01:11

Merge pull request #1 from JSv4/JSv4/fix-parsing-failure-where-page-i…

e2bec40

…s-only-numbers Line 43 of cli.pawls.preprocessors.tesseract in extract_page_tokens()…

Merge branch 'allenai:main' into main

7de52c5

modified extract_page_tokens() in cli.pawls.preprocessors.tesseract t…

7fbdcec

…o properly handle parsing empty pdfs (no tokens at all in the entire pdf).

JSv4 changed the title ~~J sv4/fix tesseract preprocessor for blank pages~~ Fix tesseract preprocessor for blank pages Feb 26, 2023

JSv4 added 2 commits February 26, 2023 00:56

Accidentally created two copies of scrubbed df. Cleaned up.

cc262fe

Accidentally created two copies of scrubbed df. Cleaned up.

84e7c8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tesseract preprocessor for blank pages #202

Fix tesseract preprocessor for blank pages #202

JSv4 commented Feb 26, 2023 •

edited

Loading

JSv4 commented Apr 13, 2023

Fix tesseract preprocessor for blank pages #202

Are you sure you want to change the base?

Fix tesseract preprocessor for blank pages #202

Conversation

JSv4 commented Feb 26, 2023 • edited Loading

JSv4 commented Apr 13, 2023

JSv4 commented Feb 26, 2023 •

edited

Loading