[Feature Request] Table structure extraction at the API #1714

troplin · 2018-06-29T14:37:46Z

There is already some table detection mechanism in tesseract but unfortunately, there is seems to be no possibility to access the table structure at the API.

This could be done only minimal changes to the API, just by expanding the PageIteratorLevel enum by two additional members RIL_TABLEROW and RIL_TABLECELL or similar.
Those would only be relevant inside PT_TABLE blocks, just like PT_PARAGRAPH is only meaningful for text blocks.

The text was updated successfully, but these errors were encountered:

zdenop · 2018-06-29T16:22:03Z

Are you able to send PR for this including simple test case (similar to #1614)?

troplin · 2018-06-29T18:59:48Z

@zdenop I didn't mean to imply that it was easy to implement, just that the interface changes are small. I have honestly no idea what it takes but if I find the time, I'll give it a try.

amitdo · 2018-06-29T20:15:38Z

I assume tesseract handle tables in one of these two ways:

Tables columns are held in tesseract blocks and cells are held as lines within blocks.
Tables rows are held in tesseract blocks and cells are held as lines within blocks.

I bet on option (1).

troplin · 2018-06-29T20:42:17Z

@amitdo I'm pretty sure that this is not the case. I fear that the information is lost completely.
IMO it's also not a very good representation. Cells can be multi-line, so they should be comparable to paragraphs, not lines.

The table I'm testing with seems to be recognized as a single block (which makes sense IMO).
But then the table is split into two paragraphs (one for the first row and one for the rest), which does not make much sense.
The lines span the whole table. For multiline cells, the lines of each cell are combined into nonsensical long lines.

If this reflects the internal table structure, that would mean that the table detection is really bad and I can just disregard it.
If not, the results could be presented much better.
The fact that the table separators are actually recognized as horizontal and vertical lines makes me think that the information might be there.

I'm going to investigate a bit more, once I've successfully set up the debug viewer.

amitdo · 2018-06-29T21:25:53Z

Tesseract considers any table it can recognize as block, so it's neither of the cases.

amitdo · 2018-06-29T21:40:52Z

The table detection code is here:
https://github.com/tesseract-ocr/tesseract/blob/master/src/textord/tablefind.cpp

amitdo · 2018-06-29T21:44:22Z

Play with the variables:

tesseract/src/textord/tablefind.cpp

Line 143 in 509a6f0

BOOL_VAR(textord_show_tables, false, "Show table regions");

amitdo · 2018-06-29T21:54:32Z

They published a paper about the table detection module.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.638.7400&rep=rep1&type=pdf

Shreeshrii · 2018-06-30T03:52:21Z

https://github.com/tabulapdf/tabula/issues/409#issuecomment-327050906 from someone who has scripted the process of splitting a PDF into cells and OCR'ing them separately: chain pdf-table-extract, ghostscript and tesseract

Shreeshrii · 2018-06-30T03:54:24Z

Related issue - How to detect table region after the update in Tablefind.cpp? #825

Shreeshrii · 2018-06-30T03:56:36Z

Also see https://github.com/DanBloomberg/leptonica/blob/5dca24f9674c7fd057ab55bbfc71efa87a83a520/version-notes.html#L180

 Improved table detection on scanned page images (tests: pageseg_reg.c)

DanBloomberg/leptonica@18342b4

troplin · 2018-06-30T07:50:24Z

Thanks for all the pointers. I don't want to change the table detection though, or even implement it myself. I just want the results accessible at the public API, if available.

Is there a high-level description of the internal processing pipeline of tesseract somewhere?

troplin · 2018-07-02T08:54:44Z

Ok, I've got the debug viewer running.

It seems, that the table detection works perfectly:

But then, the contents of the table are just processed as any other text, which doesn't make sense to me:

So, this means that the data is actually there, but it's not actually used.
Is this, because the whole table is a simple block? Would it be better to treat every cell as single block and represent the table structure on a higher level?

Sintun · 2018-07-09T10:48:07Z

Hi, I'm thinking about writing the API / structure part necessary to hold and access the otherwise lost information (as described by @troplin ). Is the approach described in @troplin s first comment feasible? Would a commit from a tesseract-team outsider be acceptable? Is there a Guideline for your c++ code-styling? And: Is this the right place to ask these questions?

amitdo · 2018-07-09T11:32:51Z

Hi @Sintun!

I'm thinking about writing the API / structure part necessary to hold and access the otherwise lost information (as described by @troplin ). Is the approach described in @troplin s first comment feasible?

I don't know.

Would a commit from a tesseract-team outsider be acceptable?

Yes, of course.
https://github.com/tesseract-ocr/tesseract/graphs/contributors
AFAIK, only 2 people in this list are from Google.

Whether a specific PR will be accepted or rejected will depend on it's code quality.

Don't break an existing API.

Is there a Guideline for your c++ code-styling?

Not officially, but since most of the code comes from Google, it's a good idea to use Google C++ Style Guide.

Is this the right place to ask these questions?

I think so.
.

Sintun · 2018-07-09T12:10:20Z

Nice, I will start this and a traceback and possible fix of #1712 as a weekend-project.

amitdo · 2018-07-09T12:43:50Z

I suggest that you provide an example that demonstrates the use of the new API.

Good luck!

troplin · 2018-07-09T17:51:26Z

@Sintun It might actually be better to invert my API suggestion and do something like:

RIL_TABLE
- RIL_TABLEROW
  - RIL_BLOCK (== cell)
    - RIL_PARA

instead of my earlier suggestion, which was:

RIL_BLOCK (== table)
- RIL_TABLEROW
  - RIL_TABLECELL
    - RIL_PARA

I don't know which one is better, it depends on what a block means to the engine. I already tried to figure out how the recognition process works on a high level, but I'm a bit lost.

Maybe someone with a deeper understanding of the internals could give a hint?

amitdo · 2018-07-10T08:47:25Z

Maybe someone with a deeper understanding of the internals could give a hint?

I think only @theraysmith can help here.

Sintun · 2018-07-10T08:55:02Z

@troplin At a first sight your new suggestion looks more logical. I hope that the existing structures will give me a clear path to hold the table information, otherwise i will stick with the most logical (at least for me) non-API breaking approach. Thanks for the hint, I will take it into account when looking at the code.

@amitdo I think i have to invest some time and dive through the surrounding code before i can understand helping hints of people who are familiar with the code base.

Shreeshrii · 2018-09-27T12:55:59Z

@zdenop Please label as Feature Request.

zdenop · 2018-09-30T15:15:52Z

@Sintun : any progress on this issue? Is API needs to be changes I would like to get it for 4.0 release...

Sintun · 2018-10-01T12:55:04Z

@Sintun : any progress on this issue? Is API needs to be changes I would like to get it for 4.0 release...

Unfortunately not yet, i'm still working on #1192 / #1712 , because it seems to be a more pressing matter.

krishna11888 · 2019-04-12T13:06:54Z

what happened to table extractor feature api

Sintun · 2019-04-13T10:51:22Z

Hi, I wanted to make the information accessible through the api.
Unfortunately i wasn't able to find enough free time to do it.
Fortunately my employer needs this feature, and I'm going to propose (next week) using this tesseract code and updating the tesseract api. I'm pretty confident that this will go through, so i could focus on it, which should enable me to finish it in the near future.

zdenop · 2019-05-03T16:32:26Z

Just side note: there is python project for extracting table data from pdf: Camelot and there is also web interface for it: excalibur

saiprasadjnv · 2020-10-10T20:30:32Z

Is this feature request still open? I am interested in working on it. If someone is already working on it, we can collaborate and speed up the process. Please feel free to contact me at [email protected].

zdenop · 2020-10-11T11:16:51Z

@saiprasadjnv you can work on this. @Sintun started to work on this, but never send PR here, so IMO this is unfinished task.

amitdo · 2020-10-11T12:33:15Z

I suggest to start with testing @Sintun's patches in his fork.

amitdo · 2020-10-11T12:46:36Z

do we have any sample codes on table detection using tesseract?

#1714 (comment)

balachandarsv · 2021-02-26T13:46:42Z

I tried with some sample tables, @Sintun's solution works well. Any idea when this would be merged into master?

Sintun · 2021-02-28T15:11:52Z

I could update it and create a pull request around the next weekend, if someone gives me an absolution for the usage of a singleton approach (an object that shares properties with global variables :( ).

balachandarsv · 2021-03-13T09:42:09Z

@Sintun any update on the pull request?

Sintun · 2021-03-14T17:18:34Z

Hi, sorry for the delay, and thanks for the reminder. Since no one objected my approach i will create a pull request and a more in-depth example tomorrow.

balachandarsv · 2021-03-15T07:20:26Z

Thank you @Sintun. Appreciate your response.

Sintun · 2021-03-15T19:57:31Z

I just created a PR #3330 , my approach circumvents the tesseract iterator infrastructure. This enables the feature, but it is uncertain if it will be merged into master.

The current master with my updated changes can be found at
https://github.com/Sintun/tesseract
There is a demonstation example and example commands on how to compile the demo and how to compile tesseract. But you need to insert the paths that correspond to your setup.

Shreeshrii · 2021-03-17T10:21:02Z

@Sintun Thanks.

I also suggest to make the demo example as part of tesseract repo.

@zdenop @stweil @egorpugin @amitdo
What is the recommended location for keeping the examples? Should we add a directory in this repo to keep this and other samples of usage of api?

egorpugin · 2021-03-17T10:27:45Z

/example dir is fine for any examples.
But since we do OCR, possible image/training files can slow down git clones. So, I think separate repo would be better for this.

Sintun · 2021-03-17T11:09:49Z

I adjusted the demo file License to Apache 2. It can be added to any repo. I was not able to find an /example directory.

Shreeshrii · 2021-03-17T11:33:24Z

Yes, /example directory does not exist. Currently api examples are there as part of documentation, tessdoc repo (having been transferred from Wiki).

I had added one apiexample to 'test' repo for use in ci.

It would be good to move them all to one place and have option to build them.

stweil · 2021-03-17T12:40:51Z

I wonder whether it would be even better to add a test case in unittest (which could also serve as an example).

Shreeshrii · 2021-03-19T02:58:11Z

I wonder whether it would be even better to add a test case in unittest (which could also serve as an example).

That's a good idea!

For starters, though, I suggest that we add /example directory to tesseract repo itself (similar to unittest) and the images needed for running can be from the test repo.

@stweil Does the directory need to be created before @Sintun can add the demonstration example to it?

egorpugin · 2021-03-19T07:24:31Z

No, I think we should put examples in separate repo (and make it a submodule if you want example dir in main repo).

stweil · 2021-03-19T09:35:56Z

Then I suggest a new repository tesscontrib (not to be confused with tessdata_contrib) and provide API examples there in a directory examples.

That new repository could include tesseract as a submodule and provide continuous integration tests which check whether all examples work with the latest tagged Tesseract release.

It could also add tesserocr and maybe other third party Tesseract APIs as submodules and test those as well.

stweil · 2021-03-19T09:38:34Z

Does the directory need to be created before @Sintun can add the demonstration example to it?

No. If new files are added to new directories, git will handle that automatically.

Shreeshrii · 2021-03-25T03:19:56Z

Then I suggest a new repository tesscontrib (not to be confused with tessdata_contrib) and provide API examples there in a directory examples.

Sorry, I missed this message earlier.

@stweil Can you set this up?

I had extracted the api examples that were in the wiki at https://github.com/tesseract-ocr/tessdoc/tree/master/examples
Those can also be added.

Thanks!

stweil · 2021-07-22T05:52:08Z

Pull request #3330 has to be reverted because of two severe regressions, so we still need another implementation.

amitdo · 2021-07-22T08:16:24Z

See PR #3505.

amitdo · 2021-07-24T20:42:12Z

What code triggered the issues? Does it come from tablerecog_test.cc? Why it was not detected before?

stweil · 2021-07-25T11:06:17Z

The Python wrapper tesserocr (which is used by OCR-D) has a global TessBaseAPI object which is destroyed at program exit. That interferred with the singleton introduced by PR #3330. The resulting segmentation fault was detected before by people working with OCR-D but not analyzed. Our CI tests don't include a global TessBaseAPI (that should be added, maybe to baseapi_test).

The other regression was an assertion which occurred with some images with tables, see #3330 (comment). tablerecog_test did not detect that issue.

stweil · 2021-07-25T11:34:07Z

See pull request #3509 which detects one of the two regressions.

ZivFisher · 2024-02-03T20:34:08Z

Did you solved the issue with tables?

zdenop added feature request accuracy labels Sep 30, 2018

zdenop closed this as completed Sep 30, 2018

zdenop reopened this Oct 1, 2018

stweil closed this as completed in 122daf1 Mar 17, 2021

amitdo added the tables label Mar 24, 2021

stweil reopened this Jul 22, 2021

[Feature Request] Table structure extraction at the API #1714

[Feature Request] Table structure extraction at the API #1714

Comments

troplin commented Jun 29, 2018

zdenop commented Jun 29, 2018 • edited Loading

troplin commented Jun 29, 2018

amitdo commented Jun 29, 2018

troplin commented Jun 29, 2018

amitdo commented Jun 29, 2018 • edited Loading

amitdo commented Jun 29, 2018

amitdo commented Jun 29, 2018

amitdo commented Jun 29, 2018

Shreeshrii commented Jun 30, 2018 via email • edited Loading

Shreeshrii commented Jun 30, 2018

Shreeshrii commented Jun 30, 2018 • edited Loading

troplin commented Jun 30, 2018

troplin commented Jul 2, 2018

Sintun commented Jul 9, 2018

amitdo commented Jul 9, 2018

Sintun commented Jul 9, 2018

amitdo commented Jul 9, 2018

troplin commented Jul 9, 2018 • edited Loading

amitdo commented Jul 10, 2018

Sintun commented Jul 10, 2018

Shreeshrii commented Sep 27, 2018

zdenop commented Sep 30, 2018

Sintun commented Oct 1, 2018

krishna11888 commented Apr 12, 2019

Sintun commented Apr 13, 2019

zdenop commented May 3, 2019

saiprasadjnv commented Oct 10, 2020

zdenop commented Oct 11, 2020

amitdo commented Oct 11, 2020 • edited Loading

amitdo commented Oct 11, 2020

balachandarsv commented Feb 26, 2021

Sintun commented Feb 28, 2021

balachandarsv commented Mar 13, 2021

Sintun commented Mar 14, 2021

balachandarsv commented Mar 15, 2021

Sintun commented Mar 15, 2021 • edited Loading

Shreeshrii commented Mar 17, 2021

egorpugin commented Mar 17, 2021 • edited Loading

Sintun commented Mar 17, 2021

Shreeshrii commented Mar 17, 2021

stweil commented Mar 17, 2021

Shreeshrii commented Mar 19, 2021

egorpugin commented Mar 19, 2021

stweil commented Mar 19, 2021 • edited Loading

stweil commented Mar 19, 2021

Shreeshrii commented Mar 25, 2021

stweil commented Jul 22, 2021

amitdo commented Jul 22, 2021

amitdo commented Jul 24, 2021

stweil commented Jul 25, 2021

stweil commented Jul 25, 2021

ZivFisher commented Feb 3, 2024

zdenop commented Jun 29, 2018 •

edited

Loading

amitdo commented Jun 29, 2018 •

edited

Loading

Shreeshrii commented Jun 30, 2018 via email •

edited

Loading

Shreeshrii commented Jun 30, 2018 •

edited

Loading

troplin commented Jul 9, 2018 •

edited

Loading

amitdo commented Oct 11, 2020 •

edited

Loading

Sintun commented Mar 15, 2021 •

edited

Loading

egorpugin commented Mar 17, 2021 •

edited

Loading

stweil commented Mar 19, 2021 •

edited

Loading