-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coordinates of caption elements #1008
Comments
Hi @keto33 ! Thanks for the issue.
Yes text blocks are not part of the TEI XML output because they are presentation/layout elements, not something related to the logicial structure of the document (like paragraphs, titles, etc.).
Yes the coordinates of the caption elements are indeed not outputted currently and there is no reason not to do it. Regarding the "graphic part" of a figure, this is more or less implemented in PR #963 (the whole PR is not usable at this stage, really work in progress), the vector graphics are further analyzed to detect their boundaries, deal with overlapped text, etc. so that we have reliable "figure graphic" aggregated elements similar to the embedded bitmaps. There are many other things in this PR and it will take a lot time to be completed ! |
Hi @ClementFrvl, which version are you using? This seems a problem of grobid version 0.8.0 which disappears on the grobid master's version. 🤔 |
We're working on a new version since a few weeks, hopefully we will be able to release soon. |
Should be solved in version 0.8.1 |
This may seem unnecessary, but it should be a feasible feature suggestion.
GROBID outputs all coordinates of structures except for text blocks. I am mostly interested in the coordinates of figure captions. When figures are embedded as EPS in vector format rather than raster/bitmap, GROBID does not correctly detect the bounding box of the figure, as drawings and texts are somehow blended into the PDF structure rather than being a distinguishable stream. In such cases, the bounding box of the figure caption can be helpful in estimating the actual bounding box of the EPS figure.
The text was updated successfully, but these errors were encountered: