Multiple Tables of banded shaded rows with varying number of lines in row #884
Replies: 2 comments 5 replies
-
Hi @ramakrse, and interesting example. I'd suggest something like the following:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for your input. I have progressed with your suggestion.
I am looking for help, how do we handle below here code base on debug visualization # Load the PDF file with pdfplumber
plumber_file = pdfplumber.open(pdf_file)
pdf_page = plumber_file.pages[29-1]
rect_info = pdf_page.rects
rectangles_bbox = []
rectangles_info=[]
table_header_color=(0.5, 0.25, 0, 0.1)
for idx, rect in enumerate(rect_info):
print("index: {} Bbox: {}".format(idx, rect))
#Check whether it is headers and footer of table
if rect['non_stroking_color'] == table_header_color:
#Get the bounding box of rectangle
temp_rect_bb = (rect['x0'],rect['top'],rect['x1'],rect['bottom'])
#Check rect is duplicated in .rects
if rect not in rectangles_info:
rectangles_info.append(rect)
if temp_rect_bb not in rectangles_bbox: #Check if boundingbox alread there
rectangles_bbox.append(temp_rect_bb)
print('Rectangle Info Count: {}'.format(len(rect_info)))
print('Unique Rectangle Info Count: {}\n'.format(len(rectangles_info)))
#Get the table bounding box
table_reactangle_bbox = []
n_idx = len(rectangles_bbox)
if (n_idx%2) == 0: #Check whether it is even - header and footer line
for idx in range(0, n_idx, 2):
temp_rect = (rectangles_bbox[idx][0],rectangles_bbox[idx][1],rectangles_bbox[idx+1][2],rectangles_bbox[idx+1][3])
table_reactangle_bbox.append(temp_rect)
else :
raise Exception('No of identified {} rectangles is not even'.format(n_idx))
print('Table Bounding Box')
for idx, rect in enumerate(table_reactangle_bbox):
print('{}: Table Bounding Box: {}'.format(idx,rect))
ts = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
}
table_idx = 0 #1
pdf_page_filtered = pdf_page.crop(table_reactangle_bbox[table_idx])
tables = pdf_page_filtered.extract_table(ts)
if (len(tables) >0):
print('Extracted Table Information')
for idx, table in enumerate(tables):
print('Table: {} and {}'.format(idx, table))
im = pdf_page_filtered.to_image()
im.debug_tablefinder(ts) Output #1 - Table 1 Table Bounding Box |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
PDF has multiple tables across the documents. Tables are shaded/banded rows with varying lines in row
Code to reproduce the problem
Load the PDF file with pdfplumber
plumber_file = pdfplumber.open(pdf_file)
pdf_page = plumber_file.pages[29-1] #127 #67
im = pdf_page.to_image()
Table settings.
ts = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
'intersection_tolerance': 32
}
im.debug_tablefinder(ts)
PDF file
Using the Public available pdf
https://www.mtu-solutions.com/content/dam/mtu/technical-information/operating-instructions/diesel/mtu-series-1600/marine/MS15029_01E.pdf/_jcr_content/renditions/original./MS15029_01E.pdf
Expected behavior
To identify the tables in each page properly. Here there are two tables
Actual behavior
playing with intersection_tolerance variable to handle more lines in a row, it detect one table, Space between tables also consider as row. Not able to detect two tables properly
Screenshots
Environment
-Collab notebook
Additional context
Add any other context/notes about the problem here.
Beta Was this translation helpful? Give feedback.
All reactions