Duplicate value for merged cell instead of `None` #420

tungph · 2021-04-22T08:34:45Z

Hi there,
Firstly, I want to thank you for creating this awesome pdfplumber project. I have been using it a lot lately.
Secondly, I'm facing a problem and want to propose change to the output of the table module regarding the utility extract_tables with merged cells.
As it is now, only the first row of the merged cell has value, the remaining cells get None.

Header 1, Header 2
Merged cell 1, Cell 2-1
None, Cell 2-2

It would be more semantically correct to have the value in merged cell duplicated like this

Header 1, Header 2
Merged cell 1, Cell 2-1
Merged cell 1, Cell 2-2

I'm preparing a PR, please consider it.
Thank you!

Update 1

Here's the PR: #422

Approach: finding the centroid of each "cell", and then get the cells from the centroids for each row.

Update 2:

I've encountered this case:

Expected output is:

1-1, 2-1
None, 2-2
None, 2-3

None means not a cell there, and it makes the number of elements is even between rows which is convenient for further processing.
Note: the PR is updated

The text was updated successfully, but these errors were encountered:

jsvine · 2021-04-26T13:49:42Z

Hi @tungph, thanks for your interest in this library and thank you for the PR. I agree that pdfplumber's current approach to complex tables is not optimal, but it is the result of not wanting to over-impose assumptions about any given table. The handling of merged cells in tables is a tricky topic, and one with a lot of edge cases — what may be a good solution for one set of tables may be a poor one for others. So I'll need to consider this suggestion and PR somewhat carefully, and think about (and test) other PDFs.

Relatedly: I have an idea for providing a richer representation of complex tables — one that would make the structure of the tables much clearer — but it would be a substantial change and so it will probably have to wait for v0.6.0.

tungph · 2021-04-29T04:46:58Z

Thank you for the consideration, @jsvine. Please let me know if you found bug with my PR. I'm happy to provide fix.

Safrone · 2024-03-20T19:17:59Z

I would love to have this as an option when extracting tables even if it is not the default behavior.

tungph added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Apr 22, 2021

tungph mentioned this issue Apr 22, 2021

Duplicate value for merged cell instead of None #422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate value for merged cell instead of `None` #420

Duplicate value for merged cell instead of `None` #420

tungph commented Apr 22, 2021 •

edited

Loading

jsvine commented Apr 26, 2021

tungph commented Apr 29, 2021

Safrone commented Mar 20, 2024

Duplicate value for merged cell instead of None #420

Duplicate value for merged cell instead of None #420

Comments

tungph commented Apr 22, 2021 • edited Loading

Update 1

Update 2:

jsvine commented Apr 26, 2021

tungph commented Apr 29, 2021

Safrone commented Mar 20, 2024

Duplicate value for merged cell instead of `None` #420

Duplicate value for merged cell instead of `None` #420

tungph commented Apr 22, 2021 •

edited

Loading