-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Table structure extraction at the API #1714
Comments
Are you able to send PR for this including simple test case (similar to #1614)? |
@zdenop I didn't mean to imply that it was easy to implement, just that the interface changes are small. I have honestly no idea what it takes but if I find the time, I'll give it a try. |
I assume tesseract handle tables in one of these two ways:
I bet on option (1). |
@amitdo I'm pretty sure that this is not the case. I fear that the information is lost completely. The table I'm testing with seems to be recognized as a single block (which makes sense IMO). If this reflects the internal table structure, that would mean that the table detection is really bad and I can just disregard it. I'm going to investigate a bit more, once I've successfully set up the debug viewer. |
Tesseract considers any table it can recognize as block, so it's neither of the cases. |
The table detection code is here: |
Play with the variables: tesseract/src/textord/tablefind.cpp Line 143 in 509a6f0
|
They published a paper about the table detection module. |
https://github.com/tabulapdf/tabula/issues/409#issuecomment-327050906
from someone who has scripted the process of splitting a PDF into cells and
OCR'ing them separately:
chain pdf-table-extract, ghostscript and tesseract
|
Related issue - How to detect table region after the update in Tablefind.cpp? #825 |
DanBloomberg/leptonica@18342b4 |
Thanks for all the pointers. I don't want to change the table detection though, or even implement it myself. I just want the results accessible at the public API, if available. Is there a high-level description of the internal processing pipeline of tesseract somewhere? |
Hi, I'm thinking about writing the API / structure part necessary to hold and access the otherwise lost information (as described by @troplin ). Is the approach described in @troplin s first comment feasible? Would a commit from a tesseract-team outsider be acceptable? Is there a Guideline for your c++ code-styling? And: Is this the right place to ask these questions? |
Hi @Sintun!
I don't know.
Yes, of course. Whether a specific PR will be accepted or rejected will depend on it's code quality. Don't break an existing API.
Not officially, but since most of the code comes from Google, it's a good idea to use Google C++ Style Guide.
I think so. |
Nice, I will start this and a traceback and possible fix of #1712 as a weekend-project. |
I suggest that you provide an example that demonstrates the use of the new API. Good luck! |
@Sintun It might actually be better to invert my API suggestion and do something like:
instead of my earlier suggestion, which was:
I don't know which one is better, it depends on what a block means to the engine. I already tried to figure out how the recognition process works on a high level, but I'm a bit lost. Maybe someone with a deeper understanding of the internals could give a hint? |
I think only @theraysmith can help here. |
@troplin At a first sight your new suggestion looks more logical. I hope that the existing structures will give me a clear path to hold the table information, otherwise i will stick with the most logical (at least for me) non-API breaking approach. Thanks for the hint, I will take it into account when looking at the code. @amitdo I think i have to invest some time and dive through the surrounding code before i can understand helping hints of people who are familiar with the code base. |
@zdenop Please label as |
@Sintun : any progress on this issue? Is API needs to be changes I would like to get it for 4.0 release... |
what happened to table extractor feature api |
Hi, I wanted to make the information accessible through the api. |
Is this feature request still open? I am interested in working on it. If someone is already working on it, we can collaborate and speed up the process. Please feel free to contact me at [email protected]. |
@saiprasadjnv you can work on this. @Sintun started to work on this, but never send PR here, so IMO this is unfinished task. |
I suggest to start with testing @Sintun's patches in his fork. |
|
I tried with some sample tables, @Sintun's solution works well. Any idea when this would be merged into master? |
I could update it and create a pull request around the next weekend, if someone gives me an absolution for the usage of a singleton approach (an object that shares properties with global variables :( ). |
@Sintun any update on the pull request? |
Hi, sorry for the delay, and thanks for the reminder. Since no one objected my approach i will create a pull request and a more in-depth example tomorrow. |
Thank you @Sintun. Appreciate your response. |
I just created a PR #3330 , my approach circumvents the tesseract iterator infrastructure. This enables the feature, but it is uncertain if it will be merged into master. The current master with my updated changes can be found at |
@Sintun Thanks. I also suggest to make the demo example as part of tesseract repo. @zdenop @stweil @egorpugin @amitdo |
|
I adjusted the demo file License to Apache 2. It can be added to any repo. I was not able to find an /example directory. |
Yes, /example directory does not exist. Currently api examples are there as part of documentation, tessdoc repo (having been transferred from Wiki). I had added one apiexample to 'test' repo for use in ci. It would be good to move them all to one place and have option to build them. |
I wonder whether it would be even better to add a test case in |
That's a good idea! For starters, though, I suggest that we add /example directory to tesseract repo itself (similar to unittest) and the images needed for running can be from the test repo. @stweil Does the directory need to be created before @Sintun can add the demonstration example to it? |
No, I think we should put examples in separate repo (and make it a submodule if you want |
Then I suggest a new repository That new repository could include It could also add |
No. If new files are added to new directories, git will handle that automatically. |
Sorry, I missed this message earlier. @stweil Can you set this up? I had extracted the api examples that were in the wiki at https://github.com/tesseract-ocr/tessdoc/tree/master/examples Thanks! |
Pull request #3330 has to be reverted because of two severe regressions, so we still need another implementation. |
See PR #3505. |
What code triggered the issues? Does it come from |
The Python wrapper The other regression was an assertion which occurred with some images with tables, see #3330 (comment). |
See pull request #3509 which detects one of the two regressions. |
Did you solved the issue with tables? |
There is already some table detection mechanism in tesseract but unfortunately, there is seems to be no possibility to access the table structure at the API.
This could be done only minimal changes to the API, just by expanding the
PageIteratorLevel
enum by two additional membersRIL_TABLEROW
andRIL_TABLECELL
or similar.Those would only be relevant inside
PT_TABLE
blocks, just likePT_PARAGRAPH
is only meaningful for text blocks.The text was updated successfully, but these errors were encountered: