_getTexttrace() some origins do not match rawDict output #1185

inf3rnus · 2021-08-02T18:02:43Z

inf3rnus
Aug 2, 2021

Hey there again Jorj!

I know this is an experimental feature, so I decided to not make an actual issue out of this. I'm hoping you may have some insight into this.

I'm using _getTexttrace to get the glyph indices for the characters returned by pymu when using page.getText('rawdict')['blocks'] in an effort to match each character with the glyph in its font file.

This has been working pretty well, but I recently ran into a problem where there seems to be a discrepancy between the origin's from _getTexttrace and the origins from page.getText('rawdict')['blocks'].

In order to map each character from the raw dict to the glyph index returned by _getTexttrace, I've been "uniquely" identifying them by their origins and using a dict per page to create this mapping. (I know it's possible for the pdf to have multiple characters at the same origin potentially, although probably highly unlikely, but I'm relying on the precision of floats to try to circumvent this problem.)

The problem I've run into is that some of the origins from _getTexttrace are off by a few hundred thousandths/millionths of a value vs what's returned from the rawdict page.

Not sure if this is a "bug", but I've been truncating the last few decimal places of the float in order to provide a work around in the meantime, but this comes with loss of entropy, so it's not ideal and potentially brittle.

Here are some pictures to show you what I mean by this.

Results are from this pdf: https://arxiv.org/pdf/1912.03310.pdf

Origins from the _getTexttrace() call:

^ That's a mapping of the font to the page to the origin, with the glyph index being the innermost value

What that character looks like in the raw dict output:

Note, if you search all of the characters for an x origin of -11.922561645507812, you get two characters from the raw text output, the one above, and another one:

^ I thought I'd mention the other one to show that the first character I showed should be the only character to match (-11.922561645507812, 81.27336120605469) for font BWEWWP+CMSY10 on page 4.

Notice the x and y values. The x's from _getTexttrace and rawdict are the same: -11.922561645507812, however, for the y's, they're off by 0.00004577636, with the y's being 81.27336120605469 and 81.2733154296875 respectively.

Wondering if this is a rounding issue for rawdict, or if the API's are completely divorced from each other (_getTexttrace and rawdict), and what your thoughts are?

Extra side question -> Do you know if glyph index is dependent on the font files encoding, or if there's someway to enumerate the bitmaps in glyph index order with a tool?

Best,
Aaron

JorjMcKie · 2021-08-02T19:54:46Z

JorjMcKie
Aug 2, 2021
Maintainer

As a general comment, there is no safe / unique way to mach the character values of _getTexttrace() and "rawdict". Reasons are manifold it seems - here is an (presumably) incomplete list:

Depending on the get_text() "flags", the ligatures are handled differently. _getTexttrace returns a ligature like "fi" (unchangeably) as the following tuples in span["chars"], whereas "rawdict" does different things.
- (102, glyph, (origin.x, origin.y), width)
- (105, -1, (origin.x, origin.y), 0) ... same origin but width 0!
Depending on "flags", space characters are "invented" if a seemingly too large distance between characters with the same origin.y suggest it. This will be reflected in the rawdict, but not in _getTexttrace(): there, it would be left to your wits: there will be no spaces at all if the document creator did not specify them.
A separation of text into blocks, lines and spans in "(raw)dict" is based on MuPDF heuristics. It wil take into account things like font properties. So, if a font announces (! need not be true at all) it is italic (also depends on the italics inclination angle), the heuristical tolerance between charatcers in a row to be considered to be "adjacent" is different. I haven't yet understood the resp. algorithm.
In _getTexttrace() you may find items in span["chars"] which do not belong to the same line. This does not happen in "rawdict".
If the writing direction line["dir"] of "rawdict" is not (1, 0) then char["origin"] has nothing to do (well, not quite true) with char["bbox"]. If span["dir"] of _getTexttrace() is not equal to (1, 0) you would have some serious things to do to determine the corresponding character's bbox or quad.

So, assuming there can be 1:1 match between the two is too much to expect ...

1 reply

inf3rnus Aug 2, 2021
Author

Thanks for the info! What you've described affirms some of the things I suspected of the inner workings of pymu.

Sounds like some of this is bleed over from the unknowns in mupdf's implementation details.

As a final question (hopefully lol), and I feel like you already answered this with the above, but I guess I can distill my question into this:

Is it safe to say that the reason why the origins for what should be presumably the same character differ by incredibly tiny fractions is due to the implementation details of mupdf?

JorjMcKie · 2021-08-03T08:09:23Z

JorjMcKie
Aug 3, 2021
Maintainer

Is it safe to say that the reason why the origins for what should be presumably the same character differ by incredibly tiny fractions is due to the implementation details of mupdf?

I would agree to that! I am BTW also trying to understand more of this:
You may have noticed the new layout-preserving text extraction. It is currently based on "rawdict".

This is fine of course. But I am a performance maniac, so I am trying to find ways replacing that with _getTexttrace() - which is at least two times faster.
Other advantages of it include, that you can see more text information like whether it is hidden, its transparency, etc.
This work is making a lot of progress, but my difficulties are around your questions: I am still unable to consistently fill inter-character gaps with spaces the way MuPDF does it, etc.

1 reply

inf3rnus Aug 4, 2021
Author

Thank you, very useful! :)

Holy smokes, you can see if the text is hidden? e.g. If the text is hidden behind layers? Or font type + mode info? Is that info already in _getTexttrace()?

++ for the performance work you do.

Fascinating. Well, I'm excited to see what your findings are, as _getTexttrace() has a lot of valuable info!

JorjMcKie · 2021-08-04T17:06:09Z

JorjMcKie
Aug 4, 2021
Maintainer

Is that info already in _getTexttrace()?

No, no - nothing as exciting as that. There is the alpha key "opacity" - a value 0 <= alpha <= 1.

Then there is "type". This is 0 for normal text and 4 for text rendering mode 3 - generated by some scanning software to make text invisible. Details:

There are 5 text trace types:
0 - fill text (PDF Tr 0)
1 - stroke text (PDF Tr 1)
2 - clip text
3 - clip-stroke text
4 - ignore text (PDF Tr 3)

So you can differentiate between stroke, fill, stroke+fill and hide (ignore) text. Don't yet know what "clip" and "clip-stroke" text actually are. Maybe those contain info as we all would like to see.

0 replies

JorjMcKie · 2021-08-04T18:28:13Z

JorjMcKie
Aug 4, 2021
Maintainer

I have a working fitzcli2.py script that is based on page._getTexttrace() - not on page.get_text("rawdict",...). It seems to be at least 30% faster, but still shows a behaviour which is different when it comes to details of text layouting:

the ciritical success factor is determining the adequate "resolution" of the document page: the maximum average character width that is small enough to position every character on the correct position in the text output.
here, I am still failing to figure out the same value as in fitzcli.py - I end up with a smaller value in most cases, which means that additional spaces are sometimes inserted between characters that you would expect to be next to each other.
other than that, it already looks good.

Here is the current version, if you want to test a few things:
fitzcli2.zip
The API is the same as with fitzcli.py.
Here are examples of the two:

fitzcli.py

fitzcli2.py

this is the original PDF:
demo1.pdf

The text files both have been created via python script.py gettext -pag 1 -g 3 demo1.pdf

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_getTexttrace() some origins do not match rawDict output #1185

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

_getTexttrace() some origins do not match rawDict output #1185

inf3rnus Aug 2, 2021

Replies: 4 comments · 2 replies

JorjMcKie Aug 2, 2021 Maintainer

inf3rnus Aug 2, 2021 Author

JorjMcKie Aug 3, 2021 Maintainer

inf3rnus Aug 4, 2021 Author

JorjMcKie Aug 4, 2021 Maintainer

JorjMcKie Aug 4, 2021 Maintainer

inf3rnus
Aug 2, 2021

Replies: 4 comments 2 replies

JorjMcKie
Aug 2, 2021
Maintainer

inf3rnus Aug 2, 2021
Author

JorjMcKie
Aug 3, 2021
Maintainer

inf3rnus Aug 4, 2021
Author

JorjMcKie
Aug 4, 2021
Maintainer

JorjMcKie
Aug 4, 2021
Maintainer