Helping tabula find the top of a table - column heading cribs? #112

psychemedia · 2016-05-24T11:10:48Z

When parsing large documents with tables placed in arbitrary locations on a page, I wonder if it would useful to help Tabula get its eye in as to the location of a table by giving it one or more keywords that you expect to see, or require, in the table column headings?

So for example we might provide a set of required heading tokens (Date, Region) that must appear in a tokenised set generated from words in guessed at column headings to help identify a particular table or sort of table, or a set of possible heading tokens that we know often appear in the headings of tables we want to extract, though we're also open to Tabula extracting other things it thinks are tables?

jeremybmerrill · 2016-05-27T20:39:07Z

That's a great idea...

Darkvater · 2017-03-25T03:37:42Z

I wonder if this has gotten anywhere? I'm writing a bank-statement parser however the table detection can be very fickle. In essence I extract the whole page and unfortunately Tabula doesn't always find the tables so depending on the contents it will group one-or-more columns together making it really difficult to work with the data.

I was just thinking of doing the something very similar in my app as suggested above:

extract everything from the page
find the keywords that mark the start/end of a table
rerun the table extraction process using just those coordinates as start/end (hoping that it will now work). As I only have tables that span the whole page I will only be using the Y-coordinates

I have empirically verified that this works with a few examples using the Tabula UI so I think I will give it a try, however if it already exists, or people have better ideas I would be delighted to hear.

Darkvater mentioned this issue Mar 25, 2017

Helping tabula find the top of a table - column heading cribs? tabulapdf/tabula-java#151

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helping tabula find the top of a table - column heading cribs? #112

Helping tabula find the top of a table - column heading cribs? #112

psychemedia commented May 24, 2016

jeremybmerrill commented May 27, 2016

Darkvater commented Mar 25, 2017

Helping tabula find the top of a table - column heading cribs? #112

Helping tabula find the top of a table - column heading cribs? #112

Comments

psychemedia commented May 24, 2016

jeremybmerrill commented May 27, 2016

Darkvater commented Mar 25, 2017