how to get colspan or rowspan info in the table? #929

tujinshu · 2023-07-06T11:46:17Z

tujinshu
Jul 6, 2023

Describe the bug

test_span.pdf

企业微信截图_b154b9f9-2d3f-4d7e-9eea-570428f8fb15

the table row0 col (1,2,3,4) is combine into row 0 col 1,when extracted, col2 col3, col4, is None
[['type', 'cost', None, None, None, 'cost', None, None, None],
['poc', 'before', None, None, None, 'after', None, None, None],
[None,
'deploy',
'dev',
'test',
'support',
'deploy',
'dev',
'test',
'support'],
[None, '100', '200', '50', '50', '30', '200', '40', '20'],
['uat', '200', '400', '555', '666', '201', '401', '557', '668'],
['prod', '300', '600', '700', '900', '301', '601', '701', '901']]

can pdfplumber output like html to show the relation

type	cost				cost
poc	before				after
	deploy	dev	test	support	deploy	dev	test	support

html code is:

<html>
<head>
</head>
<table border="1 " width="900">
    <tr>
        <td width="11% "> type </td>
        <td width="11% " colspan="4"> cost　 </td>
        <td width="11% " colspan="4"> cost　 </td>
    </tr>
    <tr>
        <td width="11% " rowspan="2" > poc </td>
        <td width="11% " colspan="4"> before </td>
        <td width="11% " colspan="4"> after </td>
    </tr>
    <tr>
        <td width="11% "> deploy </td>
        <td width="11% "> dev </td>
        <td width="11% "> test </td>
        <td width="11% "> support </td>
        <td width="11% "> deploy </td>
        <td width="11% "> dev </td>
        <td width="11% "> test </td>
        <td width="11% "> support </td>
    </tr>
</table>
</html>

Code to reproduce the problem

import pdfplumber
pdf = pdfplumber.open("./data/test_span.pdf")
p0 = pdf.pages[0]
p0 = p0.filter(keep_visible_lines)
im = p0.to_image()
im.debug_tablefinder()
table = p0.find_tables()
table[0].extract()

PDF file

test_span.pdf

Expected behavior

pdfplumber output like html to show the relation like the html

Actual behavior

the combined col filled None instead

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

pdfplumber version: [0.9]
Python version: [3.8.1]
OS: [Linux]

Additional context

Add any other context/notes about the problem here.

cmdlineluser · 2023-07-07T02:55:04Z

cmdlineluser
Jul 7, 2023

I know that you can use .find_tables() to get the table objects.

You can look at their .rows, .cells, etc.

>>> table = page.find_tables()[0]
>>> rows = table.rows
>>> rows
[<pdfplumber.table.Row at 0x1437ab400>,
 <pdfplumber.table.Row at 0x1375c0a00>,
 <pdfplumber.table.Row at 0x137216590>,
 <pdfplumber.table.Row at 0x137214e50>,
 <pdfplumber.table.Row at 0x137214430>,
 <pdfplumber.table.Row at 0x137216200>]

im.reset().draw_rect(rows[1].cells[0], stroke_width=5)

im.reset().draw_rect(rows[1].cells[1], stroke_width=5)

# type rowspan
>>> (rows[0].cells[0][-1] - rows[0].cells[0][1]) / (rows[0].cells[1][-1] - rows[0].cells[1][1])
1.0
# poc rowspan
>>> (rows[1].cells[0][-1] - rows[1].cells[0][1]) / (rows[1].cells[1][-1] - rows[1].cells[1][1])
3.0103686635944618

Not sure if pdfplumber attempts to use this information or not.

Update: Perhaps you could do something like this:

pdf = pdfplumber.open("Downloads/test_span.pdf")
page = pdf.pages[0]

table = page.find_tables()[0]

# size of smallest col and row for reference
col_unit = min(int(cell[2] - cell[0]) for cell in table.cells if cell)
row_unit = min(int(cell[3] - cell[1]) for cell in table.cells if cell)

cells = {}
# Process in reverse order so we can modify
for row_nr in range(len(table.rows) - 1, -1, -1):
    row = table.rows[row_nr]

    for col_nr in range(len(row.cells) - 1, -1, -1):
        cell = row.cells[col_nr]

        text = None

        if cell is not None:
            colspan = int(cell[2] - cell[0]) // col_unit
            rowspan = int(cell[3] - cell[1]) // row_unit

            text = page.crop(cell).extract_text()

            # forward_fill column
            for new_col in range(colspan):
                cells[row_nr, col_nr + new_col] = text

            # forward_fill row
            for new_row in range(rowspan):
                cells[row_nr + new_row, col_nr] = text

        cells[row_nr, col_nr] = text


num_rows = range(len(table.rows))
num_cols = range(len(table.rows[0].cells))

for row_nr in num_rows:
    row = [cells[row_nr, col_nr] for col_nr in num_cols]
    print(row)

['type', 'cost', 'cost', 'cost', 'cost', 'cost', 'cost', 'cost', 'cost']
['poc', 'before', 'before', 'before', 'before', 'after', 'after', 'after', 'after']
['poc', 'deploy', 'dev', 'test', 'support', 'deploy', 'dev', 'test', 'support']
['poc', '100', '200', '50', '50', '30', '200', '40', '20']
['uat', '200', '400', '555', '666', '201', '401', '557', '668']
['prod', '300', '600', '700', '900', '301', '601', '701', '901']

3 replies

jsvine Jul 7, 2023
Maintainer

Thank you, @cmdlineluser! I think this is a great response. This is one of the reasons why pdfplumber does expose the cell bounding boxes. I've been meaning to add options/features for representing tables in more nuanced ways that would, in particular, help with tables like these.

@tujinshu, your suggestion for HTML-like output is very interesting, and certainly worth considering. It could be converted to something more Python-native, perhaps:

[
  [('type', 1, 1), ('cost', 4, 1), ('cost', 4, 1)],
  [('poc', 1, 3), ('before', 4, 1), ('after', 4, 1)],
  [('deploy', 1, 1), ('dev', 1, 1), ('test', 1, 1), ('support', 1, 1), ('deploy', 1, 1), ('dev', 1, 1), ('test', 1, 1), ('support', 1, 1)],
  # ...
]

Any other suggestions for this representation? I'd also been considering something more nested, but haven't quite figured out what that'd look like.

tujinshu Jul 11, 2023
Author

pdfplumber can provide this output as a solution:
[
{
cell:[71.18399999999998, 149.05999999999995, 121.45999499999998, 175.2199999999999]
text:'type'
},
{
cell:[71.18399999999998, 175.2199999999999, 121.45999499999998, 253.60999999999993]
text:'poc'
},
{
cell:[71.18399999999998, 253.60999999999993, 121.45999499999998, 279.65000000000003]
text:'uat'
},
{
cell:[71.18399999999998, 279.65000000000003, 121.45999499999998, 305.80999999999995]
text:'prod'
}
#
]

the user who care the colspan and rowspan can get more infomation from this, can use the absolute pos or calc the colspan n; the other user who care only the text can use the origin output , or ingore some infomation from the output

gulfamhussain80 Nov 14, 2024

Were you able to write some script to reconstruct table in html?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to get colspan or rowspan info in the table? #929

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

how to get colspan or rowspan info in the table? #929

tujinshu Jul 6, 2023

Describe the bug

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

Additional context

Replies: 1 comment · 3 replies

cmdlineluser Jul 7, 2023

jsvine Jul 7, 2023 Maintainer

tujinshu Jul 11, 2023 Author

gulfamhussain80 Nov 14, 2024

tujinshu
Jul 6, 2023

Replies: 1 comment 3 replies

cmdlineluser
Jul 7, 2023

jsvine Jul 7, 2023
Maintainer

tujinshu Jul 11, 2023
Author