Skip to content
This repository has been archived by the owner on Jan 20, 2021. It is now read-only.

Issues from cell spanning multiple rows #86

Open
jtbates opened this issue Jun 24, 2014 · 4 comments
Open

Issues from cell spanning multiple rows #86

jtbates opened this issue Jun 24, 2014 · 4 comments

Comments

@jtbates
Copy link

jtbates commented Jun 24, 2014

I have PDFs from Indonesian election results that I am attempting to parse to CSVs. These contain spreadsheets where a cell may span multiple rows:

screen shot 2014-06-23 at 5 56 09 pm

I used the following command with tabula-extractor:

$ tabula DD-1_-_DPR_-_9201_-_PAPUA_BARAT.pdf -p all -r -o DD-1_-_DPR_-_9201_-_PAPUA_BARAT.csv

The row spanning cells seem to be causing a couple problems. Output for reference:

screen shot 2014-06-23 at 5 55 41 pm

The first problem is that for the cells that span multiple rows the text after the first line is discarded. This can be seen in the selected cell in the picture: Tetap (DPT) is missing. Similarly Tambahan (DPTb) is missing for the next cell and so forth.

The second problem is that the row below is sometimes split. This seems to happen once or twice but not thereafter. In this example, rows 7 and 8 should be joined. This can be seen more clearly in the CSV output (lines 7-9):

"",""
PR,"82,484","79,536","38,602","17,395","16,796","9,965","14,300","16,500","24,532","21,368","10,751","332,229"
"","",JML,"174,769","165,250","86,097","38,185","34,895","21,874","28,869","35,937","50,255","49,179","23,791","709,101"

Here is the PDF I used in this example and here is the output from tabula-extractor.

@jeremybmerrill
Copy link
Member

Hi Jordan,

Sorry again for the delay in getting back to you. You found some nice bugs here! Thanks!

I've figured out the source of the first problem; for better or for worse, the bug is more philosophical than technical.

Tabula's "spreadsheet" extraction method uses vector lines to attempt to recreate the structure of the table; since PDFs only have lines, with no conception of tables or relationships between lines. Sometimes, there are lines that are not visible to a viewer, but are present in the PDF. That's the case here: there's a white line running across cell B6 ( 1 Jumlah pemilih terdaftar dalam Daftar Pemilih PR
Tetap (DPT)
) at the same height as the line separating PR and JML in column 3. (And since the line crosses the Tetap (DPT) text, the text isn't included in either cell and therefore ends up ignored.)

The way to fix this is to tell Tabula to ignore non-black lines. This is all built out, but isn't present in the script in bin/ -- since it's weird, hard to describe and hard to tune. I could probably send you a substitute file that'd include that option, or maybe add it as an undocumented feature. (@jazzido, what do you think about the options for surfacing the line_color_filter thing?) I think it's too complicated for a command-line option -- or at least, I don't know how to represent a range of RGB colors on the command line and describe that method in an intuitive way.

The second problem was sort of related, but is an actual technological bug. The "split" line is actually at two different y-axis locations: 114.0199966430664 for the first two (empty) cells) and 114.02000427246094 for the rest. We need to round... because floating point numbers are dumb. That patch is 9b650f4

With both changes, here's the CSV: I think it looks much better. https://gist.githubusercontent.com/jeremybmerrill/d624986d48c81fde2d29/raw/06fd9284a774f1a9175a382f012ec2dbd076373b/papua.csv

jeremybmerrill pushed a commit that referenced this issue Aug 22, 2014
(needed to round y-positions in grouping cells into rows to account for floating point BS)
@jazzido
Copy link
Contributor

jazzido commented Aug 22, 2014

Sometimes, there are lines that are not visible to a viewer, but are present in the PDF. That's the case here: there's a white line running across cell B6 ( 1 Jumlah pemilih terdaftar dalam Daftar Pemilih PR
Tetap (DPT) )

Just for the record, tabula-java —which is soon going to become tabula-extractor's engine—, includes a tool to debug these kind of issues.

The command java -cp tabula-extractor-0.7.4-SNAPSHOT-jar-with-dependencies.jar org.nerdpower.tabula.debug.Debug --rulings -p 1 DD-1_-_DPR_-_9201_-_PAPUA_BARAT.pdf generates this output image:

dd-1_-dpr-9201-_papua_barat-1

...which clearly shows what @jeremybmerrill described.

@jeremybmerrill
Copy link
Member

Hey @jtbates, any thoughts on how to implement the solution I mention above? Would love to get this problme solved for you.

@umesh-kalia
Copy link

Tabula to ignore non-black lines

Hello,

Can you please send me solution/command to convert PDF to csv/excel to ignore non-black lines?

Thanks,

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants