Issues from cell spanning multiple rows #86

jtbates · 2014-06-24T01:21:16Z

I have PDFs from Indonesian election results that I am attempting to parse to CSVs. These contain spreadsheets where a cell may span multiple rows:

I used the following command with tabula-extractor:

$ tabula DD-1_-_DPR_-_9201_-_PAPUA_BARAT.pdf -p all -r -o DD-1_-_DPR_-_9201_-_PAPUA_BARAT.csv

The row spanning cells seem to be causing a couple problems. Output for reference:

The first problem is that for the cells that span multiple rows the text after the first line is discarded. This can be seen in the selected cell in the picture: Tetap (DPT) is missing. Similarly Tambahan (DPTb) is missing for the next cell and so forth.

The second problem is that the row below is sometimes split. This seems to happen once or twice but not thereafter. In this example, rows 7 and 8 should be joined. This can be seen more clearly in the CSV output (lines 7-9):

"",""
PR,"82,484","79,536","38,602","17,395","16,796","9,965","14,300","16,500","24,532","21,368","10,751","332,229"
"","",JML,"174,769","165,250","86,097","38,185","34,895","21,874","28,869","35,937","50,255","49,179","23,791","709,101"

Here is the PDF I used in this example and here is the output from tabula-extractor.

The text was updated successfully, but these errors were encountered:

jeremybmerrill · 2014-08-22T18:12:17Z

Hi Jordan,

Sorry again for the delay in getting back to you. You found some nice bugs here! Thanks!

I've figured out the source of the first problem; for better or for worse, the bug is more philosophical than technical.

Tabula's "spreadsheet" extraction method uses vector lines to attempt to recreate the structure of the table; since PDFs only have lines, with no conception of tables or relationships between lines. Sometimes, there are lines that are not visible to a viewer, but are present in the PDF. That's the case here: there's a white line running across cell B6 ( 1 Jumlah pemilih terdaftar dalam Daftar Pemilih PR
Tetap (DPT) ) at the same height as the line separating PR and JML in column 3. (And since the line crosses the Tetap (DPT) text, the text isn't included in either cell and therefore ends up ignored.)

The way to fix this is to tell Tabula to ignore non-black lines. This is all built out, but isn't present in the script in bin/ -- since it's weird, hard to describe and hard to tune. I could probably send you a substitute file that'd include that option, or maybe add it as an undocumented feature. (@jazzido, what do you think about the options for surfacing the line_color_filter thing?) I think it's too complicated for a command-line option -- or at least, I don't know how to represent a range of RGB colors on the command line and describe that method in an intuitive way.

The second problem was sort of related, but is an actual technological bug. The "split" line is actually at two different y-axis locations: 114.0199966430664 for the first two (empty) cells) and 114.02000427246094 for the rest. We need to round... because floating point numbers are dumb. That patch is 9b650f4

With both changes, here's the CSV: I think it looks much better. https://gist.githubusercontent.com/jeremybmerrill/d624986d48c81fde2d29/raw/06fd9284a774f1a9175a382f012ec2dbd076373b/papua.csv

(needed to round y-positions in grouping cells into rows to account for floating point BS)

jazzido · 2014-08-22T23:46:29Z

Sometimes, there are lines that are not visible to a viewer, but are present in the PDF. That's the case here: there's a white line running across cell B6 ( 1 Jumlah pemilih terdaftar dalam Daftar Pemilih PR
Tetap (DPT) )

Just for the record, tabula-java —which is soon going to become tabula-extractor's engine—, includes a tool to debug these kind of issues.

The command java -cp tabula-extractor-0.7.4-SNAPSHOT-jar-with-dependencies.jar org.nerdpower.tabula.debug.Debug --rulings -p 1 DD-1_-_DPR_-_9201_-_PAPUA_BARAT.pdf generates this output image:

...which clearly shows what @jeremybmerrill described.

jeremybmerrill · 2014-09-11T22:07:45Z

Hey @jtbates, any thoughts on how to implement the solution I mention above? Would love to get this problme solved for you.

umesh-kalia · 2019-05-05T18:05:16Z

Tabula to ignore non-black lines

Hello,

Can you please send me solution/command to convert PDF to csv/excel to ignore non-black lines?

Thanks,

jeremybmerrill pushed a commit that referenced this issue Aug 22, 2014

fix bug 2 from issue #86

9b650f4

(needed to round y-positions in grouping cells into rows to account for floating point BS)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues from cell spanning multiple rows #86

Issues from cell spanning multiple rows #86

jtbates commented Jun 24, 2014

jeremybmerrill commented Aug 22, 2014

jazzido commented Aug 22, 2014

jeremybmerrill commented Sep 11, 2014

umesh-kalia commented May 5, 2019

Issues from cell spanning multiple rows #86

Issues from cell spanning multiple rows #86

Comments

jtbates commented Jun 24, 2014

jeremybmerrill commented Aug 22, 2014

jazzido commented Aug 22, 2014

jeremybmerrill commented Sep 11, 2014

umesh-kalia commented May 5, 2019