You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 20, 2021. It is now read-only.
When parsing large documents with tables placed in arbitrary locations on a page, I wonder if it would useful to help Tabula get its eye in as to the location of a table by giving it one or more keywords that you expect to see, or require, in the table column headings?
So for example we might provide a set of required heading tokens (Date, Region) that must appear in a tokenised set generated from words in guessed at column headings to help identify a particular table or sort of table, or a set of possible heading tokens that we know often appear in the headings of tables we want to extract, though we're also open to Tabula extracting other things it thinks are tables?
The text was updated successfully, but these errors were encountered:
I wonder if this has gotten anywhere? I'm writing a bank-statement parser however the table detection can be very fickle. In essence I extract the whole page and unfortunately Tabula doesn't always find the tables so depending on the contents it will group one-or-more columns together making it really difficult to work with the data.
I was just thinking of doing the something very similar in my app as suggested above:
extract everything from the page
find the keywords that mark the start/end of a table
rerun the table extraction process using just those coordinates as start/end (hoping that it will now work). As I only have tables that span the whole page I will only be using the Y-coordinates
I have empirically verified that this works with a few examples using the Tabula UI so I think I will give it a try, however if it already exists, or people have better ideas I would be delighted to hear.
When parsing large documents with tables placed in arbitrary locations on a page, I wonder if it would useful to help Tabula get its eye in as to the location of a table by giving it one or more keywords that you expect to see, or require, in the table column headings?
So for example we might provide a set of required heading tokens (Date, Region) that must appear in a tokenised set generated from words in guessed at column headings to help identify a particular table or sort of table, or a set of possible heading tokens that we know often appear in the headings of tables we want to extract, though we're also open to Tabula extracting other things it thinks are tables?
The text was updated successfully, but these errors were encountered: