Replies: 1 comment 3 replies
-
I know that you can use You can look at their >>> table = page.find_tables()[0]
>>> rows = table.rows
>>> rows
[<pdfplumber.table.Row at 0x1437ab400>,
<pdfplumber.table.Row at 0x1375c0a00>,
<pdfplumber.table.Row at 0x137216590>,
<pdfplumber.table.Row at 0x137214e50>,
<pdfplumber.table.Row at 0x137214430>,
<pdfplumber.table.Row at 0x137216200>]
im.reset().draw_rect(rows[1].cells[0], stroke_width=5) im.reset().draw_rect(rows[1].cells[1], stroke_width=5) # type rowspan
>>> (rows[0].cells[0][-1] - rows[0].cells[0][1]) / (rows[0].cells[1][-1] - rows[0].cells[1][1])
1.0
# poc rowspan
>>> (rows[1].cells[0][-1] - rows[1].cells[0][1]) / (rows[1].cells[1][-1] - rows[1].cells[1][1])
3.0103686635944618 Not sure if pdfplumber attempts to use this information or not. Update: Perhaps you could do something like this: pdf = pdfplumber.open("Downloads/test_span.pdf")
page = pdf.pages[0]
table = page.find_tables()[0]
# size of smallest col and row for reference
col_unit = min(int(cell[2] - cell[0]) for cell in table.cells if cell)
row_unit = min(int(cell[3] - cell[1]) for cell in table.cells if cell)
cells = {}
# Process in reverse order so we can modify
for row_nr in range(len(table.rows) - 1, -1, -1):
row = table.rows[row_nr]
for col_nr in range(len(row.cells) - 1, -1, -1):
cell = row.cells[col_nr]
text = None
if cell is not None:
colspan = int(cell[2] - cell[0]) // col_unit
rowspan = int(cell[3] - cell[1]) // row_unit
text = page.crop(cell).extract_text()
# forward_fill column
for new_col in range(colspan):
cells[row_nr, col_nr + new_col] = text
# forward_fill row
for new_row in range(rowspan):
cells[row_nr + new_row, col_nr] = text
cells[row_nr, col_nr] = text
num_rows = range(len(table.rows))
num_cols = range(len(table.rows[0].cells))
for row_nr in num_rows:
row = [cells[row_nr, col_nr] for col_nr in num_cols]
print(row)
|
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Describe the bug
test_span.pdf
the table row0 col (1,2,3,4) is combine into row 0 col 1,when extracted, col2 col3, col4, is None
[['type', 'cost', None, None, None, 'cost', None, None, None],
['poc', 'before', None, None, None, 'after', None, None, None],
[None,
'deploy',
'dev',
'test',
'support',
'deploy',
'dev',
'test',
'support'],
[None, '100', '200', '50', '50', '30', '200', '40', '20'],
['uat', '200', '400', '555', '666', '201', '401', '557', '668'],
['prod', '300', '600', '700', '900', '301', '601', '701', '901']]
can pdfplumber output like html to show the relation
html code is:
Code to reproduce the problem
PDF file
test_span.pdf
Expected behavior
pdfplumber output like html to show the relation like the html
Actual behavior
the combined col filled None instead
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
Add any other context/notes about the problem here.
Beta Was this translation helpful? Give feedback.
All reactions