Extracting sectioned text ("tables") without any lines/rects #647

WinstonDoodle · 2022-04-29T20:16:34Z

WinstonDoodle
Apr 29, 2022

I'm working through extracting tables (more explicitly: "rectangular areas of the page without borders") from a few thousand PDFs. Each PDF structure is the same with respect to the x-axis breaks, but y-axis breaks are variable. All said and done, I'm looking for a table with 4 columns and each row split by the grey lines in the PDF.

extract_testing.pdf

I included my visualization of end goal:

I started with extract_table() with default settings, but none were found (i.e. list = []). I believe it's because the gray lines are not continuous across the page. They have breaks in them
Next, I toggled table_settings strategy to "text", but this caused 15 vertical breaks splitting text mid-word.
Next, I ran the objects command to identify lines. The only line on the page that is recognized is at the bottom, which is not associated with the tables. I don't understand why it's not catching any of those gray lines. I tried cropping the page so that those gray lines were not split, but still none of the lines were caught after re-running page.lines.
I ran page.find_tables() and page.rects returning nothing

I haven't successfully hooked up ImageMagick to Spyder even after assigning PATH environment variables, so this has been extensive semi-blind trial and error so far. I believe x-axis breaks should occur around 208, 361, and 474. The y-axis breaks are variable in each PDF so I don't think any hardcoded breaks make sense.

Questions:

Is there any other way to capture the breaks that I'm visualizing with red lines? Note there are cases where the text spills between columns, I'm less concerned about successfully capturing these.
If I need to take the brute-force approach, what is the syntax for feeding those x-axis breaks into explicit_vertical_lines of the table settings?
I see the extract_words() function for finding individual words, is there any way to find y-coordinates for a specific phrase (i.e. "Physician Visit with Doctor")?

jsvine · 2022-05-06T15:17:33Z

jsvine
May 6, 2022
Maintainer

Hi @WinstonDoodle, and thanks for the detailed description! Very helpful. First, I'm going to share a simple approach I think will get you most of the way (though you may need/want to customize further):

import pdfplumber
pdf = pdfplumber.open("extract_testing-repaired.pdf") # See note below
page = pdf.pages[0]
im = page.to_image()

im.debug_tablefinder({
    "explicit_vertical_lines": [ 30, 250, 360, 470, 580 ],
})

... produces:

Note: I'm using a Ghostscript-repaired version of your PDF, because the one you shared contains a PDF instruction that pdfminer.six (pdfplumber's main dependency) currently has trouble processing. (See #636 (comment) for more details an repair instructions. I've filed a PR on that repo to fix it, so hopefully will be a moot point in the near future.)

Now, onto some of your specific notes and questions:

I started with extract_table() with default settings, but none were found (i.e. list = []). I believe it's because the gray lines are not continuous across the page. They have breaks in them

Good guess! ... but not quite the reason. The table-extraction algorithm does seem to be handling those breaks fine. (If you have even larger breaks, you can use the join_tolerance extraction setting.) In this case, rather, it's because there are scant vertical lines.

Next, I ran the objects command to identify lines. The only line on the page that is recognized is at the bottom, which is not associated with the tables. I don't understand why it's not catching any of those gray lines.

In this case, that's because those lines are actually rects, at least as parsed by pdfminer.six. To see what I mean, try running im.reset().draw_rects(page.rects) — or, since you mention you're having difficulty getting ImageMagick hooked up to your IDE, just print(page.rects). In any case: If you want to gather both standard lines and the implicit lines within rects, you can do so via page.edges.

If I need to take the brute-force approach, what is the syntax for feeding those x-axis breaks into explicit_vertical_lines of the table settings?

Yep, I think this is going to be the best approach for you, as demonstrated at the top of my response here. Here's the relevant note from the README.md:

Items in the list should be either numbers — indicating the x coordinate of a line the full height of the page — or line/rect/curve objects.

As you'll see above, I went for the numbers approach, because it seemed fine for the vertical lines to extend the full height of the page.

I see the extract_words() function for finding individual words, is there any way to find y-coordinates for a specific phrase (i.e. "Physician Visit with Doctor")?

Not currently, but it's among the features I'd like someday to add: #201

0 replies

WinstonDoodle · 2022-06-16T19:01:46Z

WinstonDoodle
Jun 16, 2022
Author

Thanks for the thorough write-up. This project was sidelined for a month, but thrilled to pick it back up. The explicit_vertical_lines table_settings slices the PDF perfectly. The issue is that the lines and rects horizontally are all still returning null (e.g., print(page.rects), print(page.edges), and print(page.lines) all producting "[] or NoneType Object"), so I haven't been able to lean on "horizontal_strategy": "lines" yet.

Based on your review, it sounds like this is due to pdfminer not liking the format of the PDFs I'm working with. Can you share the repaired version so that I can see if the horizontal lines are recognized on my end as well?

I'm not familiar with type of coding syntax, but I downloaded GhostScript and plan to run something along the lines of:
gs ^
-o "extract_testing-repaired.pdf" ^ #name of output file. "-o" is short for "-sOutputFile="
-sDEVICE=pdfwrite ^
-dPDFSETTINGS=/prepress ^ #output in an optimized format
"extract_testing.pdf" #name of file to process

If the repair step is all that is needed, I'll need to figure out how to run a repair loop on all of these PDFs prior to pushing them through pdfplumber.

I found a few resources I plan to dig through to figure out how to write this repair script:
https://www.ghostscript.com/doc/current/Use.htm#Finding_files
https://superuser.com/questions/1066419/how-to-use-ghostscript-for-windows-to-repair-damaged-pdf-files?noredirect=1&lq=1
https://stackoverflow.com/questions/34595527/use-ghostscript-to-convert-every-file-in-a-folder-automatically

1 reply

WinstonDoodle Jun 16, 2022
Author

Following up on my message above confirming "repairing" the PDF via GhostScript resolved the identification of horizontal lines:

I figured out how to run GhostScript through the cmd line taking the following steps (detailed steps below because this is all new to me)

Open Command Prompt
Change the directory to wherever the PDF is stored: cd "C:\Users\WinstonDoodle\Documents\test\extract_testing.pdf"
Run the repair script by first referencing where the GhostScript execution file is stored:
"C:\Program Files\gs\gs9.53.3\bin\gswin64.exe" -o extract_testing-repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress extract_testing.pdf

Next, I returned to Python and ran page.table_extract(ts) on the repaired version with explicit vertical lines discussed above and "horizontal_strategy": "lines". The output looks as expected and consistent with your im.debug_tablefinder screenshot above!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting sectioned text ("tables") without any lines/rects #647

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Extracting sectioned text ("tables") without any lines/rects #647

WinstonDoodle Apr 29, 2022

Replies: 2 comments · 1 reply

jsvine May 6, 2022 Maintainer

WinstonDoodle Jun 16, 2022 Author

WinstonDoodle Jun 16, 2022 Author

WinstonDoodle
Apr 29, 2022

Replies: 2 comments 1 reply

jsvine
May 6, 2022
Maintainer

WinstonDoodle
Jun 16, 2022
Author

WinstonDoodle Jun 16, 2022
Author