Extracting sectioned text ("tables") without any lines/rects #647
Replies: 2 comments 1 reply
-
Hi @WinstonDoodle, and thanks for the detailed description! Very helpful. First, I'm going to share a simple approach I think will get you most of the way (though you may need/want to customize further): import pdfplumber
pdf = pdfplumber.open("extract_testing-repaired.pdf") # See note below
page = pdf.pages[0]
im = page.to_image()
im.debug_tablefinder({
"explicit_vertical_lines": [ 30, 250, 360, 470, 580 ],
}) ... produces: Note: I'm using a Ghostscript-repaired version of your PDF, because the one you shared contains a PDF instruction that Now, onto some of your specific notes and questions:
Good guess! ... but not quite the reason. The table-extraction algorithm does seem to be handling those breaks fine. (If you have even larger breaks, you can use the
In this case, that's because those lines are actually
Yep, I think this is going to be the best approach for you, as demonstrated at the top of my response here. Here's the relevant note from the README.md:
As you'll see above, I went for the numbers approach, because it seemed fine for the vertical lines to extend the full height of the page.
Not currently, but it's among the features I'd like someday to add: #201 |
Beta Was this translation helpful? Give feedback.
-
Thanks for the thorough write-up. This project was sidelined for a month, but thrilled to pick it back up. The Based on your review, it sounds like this is due to pdfminer not liking the format of the PDFs I'm working with. Can you share the repaired version so that I can see if the horizontal lines are recognized on my end as well? I'm not familiar with type of coding syntax, but I downloaded GhostScript and plan to run something along the lines of: If the repair step is all that is needed, I'll need to figure out how to run a repair loop on all of these PDFs prior to pushing them through pdfplumber. I found a few resources I plan to dig through to figure out how to write this repair script: |
Beta Was this translation helpful? Give feedback.
-
I'm working through extracting tables (more explicitly: "rectangular areas of the page without borders") from a few thousand PDFs. Each PDF structure is the same with respect to the x-axis breaks, but y-axis breaks are variable. All said and done, I'm looking for a table with 4 columns and each row split by the grey lines in the PDF.
extract_testing.pdf
I included my visualization of end goal:
extract_table()
with default settings, but none were found (i.e. list = []). I believe it's because the gray lines are not continuous across the page. They have breaks in themtable_settings
strategy to "text", but this caused 15 vertical breaks splitting text mid-word.page.lines
.page.find_tables()
andpage.rects
returning nothingI haven't successfully hooked up ImageMagick to Spyder even after assigning PATH environment variables, so this has been extensive semi-blind trial and error so far. I believe x-axis breaks should occur around 208, 361, and 474. The y-axis breaks are variable in each PDF so I don't think any hardcoded breaks make sense.
Questions:
explicit_vertical_lines
of the table settings?extract_words()
function for finding individual words, is there any way to find y-coordinates for a specific phrase (i.e. "Physician Visit with Doctor")?Beta Was this translation helpful? Give feedback.
All reactions