Fix tesseract preprocessor for blank pages #202
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix for Issue #201. When the processed PDF is empty, there appears to be a single token returned for the page and the text is na. This becomes a problem in
extract_page_tokens
in the tesseract preprocessor. At the start of the call to the processor thetoken df, tokens with text of na are filtered out:
leaving you with an empty dataframe. For pages with at least one token that is not na, you do not have an empty
dataframe. Where the dataframe is not empty and you apply groupby():
You wind up with cols for the dataframe of
RangeIndex(start=0, stop=6, step=1)
. So, when you call rename like this:the cols with "names" of 0, 1, 2, 3, 4, and 5 ARE renamed. This doesn't happen with empty dataframes, however.
The grouping step doesn't change the df so the column names remain unchanged - you have an empty df with col names of
[id, level, page_num, block_num, par_num, line_num, word_num, left, top, width, height, conf, text]
Thus, the renaming above totally fails to make any changes because there are no cols 0, 1, 2, 3, 4, or 5. And so there
is no col named "score" annnnddd when you call
.drop(columns=["score", "id"])
, you getKeyError: "['score'] not found in axis
My suggested fix is to change
extract_page_tokens()
to test if the page's token df is empty when stripped of all tokens where textis na. If False, proceed with the preprocessor as usual. If True, however, return an empty array.
FYI, I also changed
.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"])
to
.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"], group_keys=False)
as I noticed a deprecation warning leaving out the
group_keys
keyword arg.I've attached two sample PDFs, one blank and one not. Both processor successfully now whereas the blank failed before:
00075cb9-9428-4270-baac-93ed12d284ef.pdf
0d953016-c4c1-4d0f-8745-dc59bef8351f.pdf
Fixes #201