Extract table row splitted across multiple pages #768
jsanjay63
started this conversation in
Ask for help with specific PDFs
Replies: 2 comments 11 replies
-
Hi @jsanjay63 Appreciate your interest in the library. To solve for these cases, you would need to write a custom logic. You can do so by checking for any line/rect objects at the end of the last row on a page and if none, merge the next 2 rows. Or, if you know the type of data you are dealing with then you can use that as well. For example, combining rows until column A has a value, or similar. |
Beta Was this translation helpful? Give feedback.
10 replies
-
@samkit-jain Is it possible to combine two pdf pages and combine as one (opposite to page.crop functionality) and then extract tables from it |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I have been using this library before and I am really amazed at how "easy" it is to extract data.
Recently, I came across a situation where the pdf table row(refer attached image) was split across multiple pages with a page break in between. I am trying to extract tabular data in a CSV from this pdf using pdfplumber, but am getting this data in separate rows in a csv. Basically, I would like to get this data in a single row. I know, with some post-processing, I could merge both rows. But, I am in need of a much more generic solution so that in cases, when the rows aren't splitted, the same solution could work. With pdfplumber, is there a way to identify if the row has a horizontal border or not? If this information is available, it could help in merging the rows otherwise I could skip merging.
In the attached image, grey color coded are the cell's content.
Beta Was this translation helpful? Give feedback.
All reactions