Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tesseract preprocessor for blank pages #202

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

JSv4
Copy link
Contributor

@JSv4 JSv4 commented Feb 26, 2023

Fix for Issue #201. When the processed PDF is empty, there appears to be a single token returned for the page and the text is na. This becomes a problem in extract_page_tokens in the tesseract preprocessor. At the start of the call to the processor the
token df, tokens with text of na are filtered out:

res[~res.text.isna()]

leaving you with an empty dataframe. For pages with at least one token that is not na, you do not have an empty
dataframe. Where the dataframe is not empty and you apply groupby():

.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"], group_keys=False)
.apply(
    lambda gp: pd.Series(
        [
            gp["left"].min(),
            gp["top"].min(),
            gp["width"].max(),
            gp["height"].max(),
            gp["conf"].mean(),
            gp["text"].astype(str).str.cat(sep=" "),
        ]
    )
)

You wind up with cols for the dataframe of RangeIndex(start=0, stop=6, step=1). So, when you call rename like this:

    .rename(
        columns={
            0: "x",
            1: "y",
            2: "width",
            3: "height",
            4: "score",
            5: "text",
            "index": "id",
        }
    )

the cols with "names" of 0, 1, 2, 3, 4, and 5 ARE renamed. This doesn't happen with empty dataframes, however.
The grouping step doesn't change the df so the column names remain unchanged - you have an empty df with col names of

[id, level, page_num, block_num, par_num, line_num, word_num, left, top, width, height, conf, text]

Thus, the renaming above totally fails to make any changes because there are no cols 0, 1, 2, 3, 4, or 5. And so there
is no col named "score" annnnddd when you call .drop(columns=["score", "id"]), you get KeyError: "['score'] not found in axis

My suggested fix is to change extract_page_tokens() to test if the page's token df is empty when stripped of all tokens where text
is na. If False, proceed with the preprocessor as usual. If True, however, return an empty array.

FYI, I also changed

.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"])
to

.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"], group_keys=False)
as I noticed a deprecation warning leaving out the group_keys keyword arg.

I've attached two sample PDFs, one blank and one not. Both processor successfully now whereas the blank failed before:
00075cb9-9428-4270-baac-93ed12d284ef.pdf
0d953016-c4c1-4d0f-8745-dc59bef8351f.pdf

Fixes #201

JSv4 added 4 commits February 3, 2023 01:11
… fails when the underlying text datatype is not actually text. I assume this is rare but is dependent on the original source PDF authoring tool. I have a pdf where once page only has a number on it and it appears the data type that is extracted to the dataframe is float64. This fails with the extract_page_tokens() function as written. Added .astype(str) to line 43 to force conversion to string, which should cover these kinds of corner cases. Working for me at least on the pdf that was crashingt the parser.
…s-only-numbers

Line 43 of cli.pawls.preprocessors.tesseract in extract_page_tokens()…
…o properly handle parsing empty pdfs (no tokens at all in the entire pdf).
@JSv4 JSv4 changed the title J sv4/fix tesseract preprocessor for blank pages Fix tesseract preprocessor for blank pages Feb 26, 2023
@JSv4
Copy link
Contributor Author

JSv4 commented Apr 13, 2023

You guys open to merging this? I use your pre-processor in another project, and it'd be greet to use your repo as a dependency instead of my fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PAWLS Tesseract Preprocessor Throws Error With Blank PDF
1 participant