-
Notifications
You must be signed in to change notification settings - Fork 57
Issues from cell spanning multiple rows #86
Comments
Hi Jordan, Sorry again for the delay in getting back to you. You found some nice bugs here! Thanks! I've figured out the source of the first problem; for better or for worse, the bug is more philosophical than technical. Tabula's "spreadsheet" extraction method uses vector lines to attempt to recreate the structure of the table; since PDFs only have lines, with no conception of tables or relationships between lines. Sometimes, there are lines that are not visible to a viewer, but are present in the PDF. That's the case here: there's a white line running across cell B6 ( 1 Jumlah pemilih terdaftar dalam Daftar Pemilih PR The way to fix this is to tell Tabula to ignore non-black lines. This is all built out, but isn't present in the script in bin/ -- since it's weird, hard to describe and hard to tune. I could probably send you a substitute file that'd include that option, or maybe add it as an undocumented feature. (@jazzido, what do you think about the options for surfacing the line_color_filter thing?) I think it's too complicated for a command-line option -- or at least, I don't know how to represent a range of RGB colors on the command line and describe that method in an intuitive way. The second problem was sort of related, but is an actual technological bug. The "split" line is actually at two different y-axis locations: 114.0199966430664 for the first two (empty) cells) and 114.02000427246094 for the rest. We need to round... because floating point numbers are dumb. That patch is 9b650f4 With both changes, here's the CSV: I think it looks much better. https://gist.githubusercontent.com/jeremybmerrill/d624986d48c81fde2d29/raw/06fd9284a774f1a9175a382f012ec2dbd076373b/papua.csv |
(needed to round y-positions in grouping cells into rows to account for floating point BS)
Just for the record, The command ...which clearly shows what @jeremybmerrill described. |
Hey @jtbates, any thoughts on how to implement the solution I mention above? Would love to get this problme solved for you. |
Hello, Can you please send me solution/command to convert PDF to csv/excel to ignore non-black lines? Thanks, |
I have PDFs from Indonesian election results that I am attempting to parse to CSVs. These contain spreadsheets where a cell may span multiple rows:
I used the following command with tabula-extractor:
The row spanning cells seem to be causing a couple problems. Output for reference:
The first problem is that for the cells that span multiple rows the text after the first line is discarded. This can be seen in the selected cell in the picture: Tetap (DPT) is missing. Similarly Tambahan (DPTb) is missing for the next cell and so forth.
The second problem is that the row below is sometimes split. This seems to happen once or twice but not thereafter. In this example, rows 7 and 8 should be joined. This can be seen more clearly in the CSV output (lines 7-9):
Here is the PDF I used in this example and here is the output from tabula-extractor.
The text was updated successfully, but these errors were encountered: