Extract table row splitted across multiple pages #768

jsanjay63 · 2022-11-26T08:46:47Z

jsanjay63
Nov 26, 2022

Hello, I have been using this library before and I am really amazed at how "easy" it is to extract data.
Recently, I came across a situation where the pdf table row(refer attached image) was split across multiple pages with a page break in between. I am trying to extract tabular data in a CSV from this pdf using pdfplumber, but am getting this data in separate rows in a csv. Basically, I would like to get this data in a single row. I know, with some post-processing, I could merge both rows. But, I am in need of a much more generic solution so that in cases, when the rows aren't splitted, the same solution could work. With pdfplumber, is there a way to identify if the row has a horizontal border or not? If this information is available, it could help in merging the rows otherwise I could skip merging.

In the attached image, grey color coded are the cell's content.

samkit-jain · 2022-12-01T12:04:55Z

samkit-jain
Dec 1, 2022
Collaborator

Hi @jsanjay63 Appreciate your interest in the library. To solve for these cases, you would need to write a custom logic. You can do so by checking for any line/rect objects at the end of the last row on a page and if none, merge the next 2 rows. Or, if you know the type of data you are dealing with then you can use that as well. For example, combining rows until column A has a value, or similar.

10 replies

samkit-jain Dec 12, 2022
Collaborator

You can also find all the different table settings with explanation at https://github.com/jsvine/pdfplumber#table-extraction-settings

akshatmittal2223 Dec 30, 2024

Hi @samkit-jain, using the post-processing logic based on the type column, if there is not any such key on basis we can merge the tables, is there any approach to do so. Please let me know if anything related available where I will be able to merge tables on multiple pages.

samkit-jain Dec 30, 2024
Collaborator

Hi @akshatmittal2223 Could you please share the PDF and also what all you have tried? It is hard to assess without seeing the PDF. Please remove any sensitive information from it before sharing.

akshatmittal2223 Dec 31, 2024

@samkit-jain I have attached the screenshot from the pdf for reference. In my case, the columns are not fixed, and in single pdf I have almost 15 tables that too with different headers. I have used PyMuPDF library also for extracting the data.

samkit-jain Jan 2, 2025
Collaborator

Post-processing logic that you can try

Absence of any text between the last table on the current page and the first table on the next page.
Coordinates of the vertical line separators being exactly same.

kathimohan · 2022-12-08T11:13:24Z

kathimohan
Dec 8, 2022

@samkit-jain Is it possible to combine two pdf pages and combine as one (opposite to page.crop functionality) and then extract tables from it

1 reply

samkit-jain Dec 12, 2022
Collaborator

I have never tried it and don't think it is possible out of the box. You can experiment and see if it is something that is doable or not. If it is, don't forget to raise a PR for it to be added to the FAQ section as it can be helpful for others as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract table row splitted across multiple pages #768

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Extract table row splitted across multiple pages #768

jsanjay63 Nov 26, 2022

Replies: 2 comments · 11 replies

samkit-jain Dec 1, 2022 Collaborator

samkit-jain Dec 12, 2022 Collaborator

akshatmittal2223 Dec 30, 2024

samkit-jain Dec 30, 2024 Collaborator

akshatmittal2223 Dec 31, 2024

samkit-jain Jan 2, 2025 Collaborator

kathimohan Dec 8, 2022

samkit-jain Dec 12, 2022 Collaborator

jsanjay63
Nov 26, 2022

Replies: 2 comments 11 replies

samkit-jain
Dec 1, 2022
Collaborator

samkit-jain Dec 12, 2022
Collaborator

samkit-jain Dec 30, 2024
Collaborator

samkit-jain Jan 2, 2025
Collaborator

kathimohan
Dec 8, 2022

samkit-jain Dec 12, 2022
Collaborator