_getTexttrace() some origins do not match rawDict output #1185
Replies: 4 comments 2 replies
-
As a general comment, there is no safe / unique way to mach the character values of
So, assuming there can be 1:1 match between the two is too much to expect ... |
Beta Was this translation helpful? Give feedback.
-
I would agree to that! I am BTW also trying to understand more of this: This is fine of course. But I am a performance maniac, so I am trying to find ways replacing that with |
Beta Was this translation helpful? Give feedback.
-
No, no - nothing as exciting as that. There is the alpha key Then there is
So you can differentiate between stroke, fill, stroke+fill and hide (ignore) text. Don't yet know what "clip" and "clip-stroke" text actually are. Maybe those contain info as we all would like to see. |
Beta Was this translation helpful? Give feedback.
-
I have a working
Here is the current version, if you want to test a few things:
this is the original PDF: The text files both have been created via |
Beta Was this translation helpful? Give feedback.
-
Hey there again Jorj!
I know this is an experimental feature, so I decided to not make an actual issue out of this. I'm hoping you may have some insight into this.
I'm using _getTexttrace to get the glyph indices for the characters returned by pymu when using page.getText('rawdict')['blocks'] in an effort to match each character with the glyph in its font file.
This has been working pretty well, but I recently ran into a problem where there seems to be a discrepancy between the origin's from _getTexttrace and the origins from page.getText('rawdict')['blocks'].
In order to map each character from the raw dict to the glyph index returned by _getTexttrace, I've been "uniquely" identifying them by their origins and using a dict per page to create this mapping. (I know it's possible for the pdf to have multiple characters at the same origin potentially, although probably highly unlikely, but I'm relying on the precision of floats to try to circumvent this problem.)
The problem I've run into is that some of the origins from _getTexttrace are off by a few hundred thousandths/millionths of a value vs what's returned from the rawdict page.
Not sure if this is a "bug", but I've been truncating the last few decimal places of the float in order to provide a work around in the meantime, but this comes with loss of entropy, so it's not ideal and potentially brittle.
Here are some pictures to show you what I mean by this.
Results are from this pdf: https://arxiv.org/pdf/1912.03310.pdf
Origins from the _getTexttrace() call:
^ That's a mapping of the font to the page to the origin, with the glyph index being the innermost value
What that character looks like in the raw dict output:
Note, if you search all of the characters for an x origin of -11.922561645507812, you get two characters from the raw text output, the one above, and another one:
^ I thought I'd mention the other one to show that the first character I showed should be the only character to match (-11.922561645507812, 81.27336120605469) for font BWEWWP+CMSY10 on page 4.
Notice the x and y values. The x's from _getTexttrace and rawdict are the same: -11.922561645507812, however, for the y's, they're off by 0.00004577636, with the y's being 81.27336120605469 and 81.2733154296875 respectively.
Wondering if this is a rounding issue for rawdict, or if the API's are completely divorced from each other (_getTexttrace and rawdict), and what your thoughts are?
Extra side question -> Do you know if glyph index is dependent on the font files encoding, or if there's someway to enumerate the bitmaps in glyph index order with a tool?
Best,
Aaron
Beta Was this translation helpful? Give feedback.
All reactions