You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found an issue with the getDataTm() method in version 2.11. In some cases, the result contains text from a neighboring block instead of the block specified by the coordinates. The reason is that the PDFObject::getTextArray() method returns some text from a "Do" command at the location of certain xobjects:
However, the latter does not return the "Do" command, so there are more elements in PDFObject::getTextArray() than in Page::getDataCommands(), leading to a mismatch.
Unfortunately, I cannot provide a minimal PDF example. The files I have to parse are too large, and I don't know how they were generated. In my case, commenting out $text[] = $xobject->getText($page); helped. Since I'm not sure what the original intent of handling "Do" was, I cannot suggest a pull request that would fix this issue.
The text was updated successfully, but these errors were encountered:
if (\is_object($xobject) && $xobjectinstanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
I changed it from
if (\is_object($xobject) && $xobjectinstanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
// Not a circular reference.$text[] = $xobject->getText($page);
}
to
if (\is_object($xobject) && $xobjectinstanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
// Not a circular reference.//Only add to text if there was any Text to begin with, else the count of texts and TJ/Tj commands dont match and the last Texts will be ignored$newText = $xobject->getText($page);
if($newText === '') {
break;
}
$text[] = $newText;
}
I didnt create a PR because i wasnt 100% sure if this is the correct fix, or just a dirty workaround. But maybe this can help someone with the problem.
I found an issue with the getDataTm() method in version 2.11. In some cases, the result contains text from a neighboring block instead of the block specified by the coordinates. The reason is that the PDFObject::getTextArray() method returns some text from a "Do" command at the location of certain xobjects:
pdfparser/src/Smalot/PdfParser/PDFObject.php
Line 785 in ac8e667
Then, inside the getDataTm() method, strings from PDFObject::getTextArray() are matched with commands returned by the Page::getDataCommands() method:
pdfparser/src/Smalot/PdfParser/Page.php
Line 730 in ac8e667
pdfparser/src/Smalot/PdfParser/Page.php
Line 685 in ac8e667
However, the latter does not return the "Do" command, so there are more elements in PDFObject::getTextArray() than in Page::getDataCommands(), leading to a mismatch.
Unfortunately, I cannot provide a minimal PDF example. The files I have to parse are too large, and I don't know how they were generated. In my case, commenting out
$text[] = $xobject->getText($page);
helped. Since I'm not sure what the original intent of handling "Do" was, I cannot suggest a pull request that would fix this issue.The text was updated successfully, but these errors were encountered: