-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF Debugger Needed #199
Comments
Per Stack Overflow request for a debugger, some things that tripped up this OP were
Stuff to keep in mind when puzzling out a non-working PDF. These also might eventually end up in the integrity check, although a separate tool (based on PDF::Builder?) to do this might be better. Anyone want to take a crack at this? I found a page that lists a bunch of validation tools that I will need to take a look at -- if they're free to download (they're supposedly Open Source) and run on Windows, I will try them out. There are lots of other pages concerning PDF errors, but most turn out to be incorrect use of a tool to produce them, and respondents don't seem keen on giving away their trade secrets! |
A few comments, as someone who knows PDF intimately:
|
I think the Adobe Acrobat Reader, downloaded for free in the millions, is close to a de facto reference implementation. Elsewhere, I have described a problem with compressed streams that seems to point to a problem with AAR dealing with compressed 0-length streams, so even the mighty Adobe may have feet of clay. Some day I hope to get around to updating the PDF::Builder code to not compress a stream unless it saves enough bytes to make it worthwhile. That should certainly prove whether or not my suspicions about a compressed 0-length stream are correct. In the meantime, I have three bug reports in the Adobe Community forum that I have received no help on -- if you have any ideas I'd really appreciate hearing from you. There also are some problems with handling compressed TIFF images (using CCITT fax formats), but off the top of my head I don't know if they're unique to AAR.
At this point, I will take all the help I can get to resolve these issues. If a compiler can deeply analyze a source program and flag all manner of problems, I would think it no more difficult to analyze a PDF program. Unfortunately, it looks like no one has released such an analyzer, or if they have, they're charging a lot of money for it! Seems like a good GNU-type project, and might be extended into a full Reader, although since AAR is given away...
Hmm. Has anyone released a checklist of what things could be examined? And how a mismatch between the XObject data and the data stream could best be detected? I would suppose the first two things would be to see if any compressed data can be successfully uncompressed using the claimed method (usually not Flate), and if the amount of (uncompressed) data in the stream matches the claimed length. An image stream usually has some metadata in front of the raster data, so it's often not easy to see if |
When it comes to compressed cross-reference table streams and object streams, AAR is quite limited as to which Filters it needs/supports and whether both features have to be used together or not. This is NOT what the spec/standard says, and many other implementations have a full and correct implementation. Without looking at specific PDFs, it's hard to say why 0-length streams would be problematic. Can you post links here or DM me? Note that there are explicit nuanced requirements and recommendations about EOL placement in/around In terms of validators and validation, there is a lot of work going on behind the scenes in developing formal models of PDF (as a file format - pixel- and color-precise rendering is a whole other ballgame!), but it is very much non-trivial. veraPDF recently previewed some technology based on my Arlington PDF Model (still evolving). Note that this is modelled against the ISO standard and not an implementation, as every implementation has additional permissiveness (cf. your AAR comments above - but whether this is intentional or not can only be answered by their devs). Such technology is already being used to highlight and report malformed PDFs produced by widely-used software that were otherwise silently supported by viewers and the application devs did not know their PDFs were technically invalid! And if you think this is all unique to PDF, please read "Idiosyncrasies of the HTML parser" 😁. Taking a LangSec viewpoint, it's just not possible to document everything that can be wrong with a file format - you basically cannot trust anything you read from a file! |
I talked about the possible AAR problem with compressed 0-length streams (just a guess at this point) in pdfcpu/pdfcpu/issues/684. If you'd care to take a look at it (01compressed.pdf) and let me know if you see anything obvious, that would be much appreciated. |
https://www.reportmill.com/snaptea/PDFViewer/ seems to be able to disassemble at least some of the offending PDFs (01compressed.pdf and Bspline.pdf) without trouble, so I'm guessing that AAR is the culprit. Note that this sends your PDF to someone else's server, so don't use it for anything confidential! |
Hi, see "FIGURE 4.1 Graphics objects" p. 197 and "TABLE 4.1 Operator categories" p. 196 (according to "PDF Reference 1.7.pdf" (i.e. PDF Reference sixth edition)) -- the "q", "Q" operators aren't allowed inside text objects. Simply deleting 128 of them from stream of Contents array item indexed "5" (they serve no purpose there at all, actually, but be careful not to also delete "Q" from font names :-)) makes this annoying warning by Adobe Reader go away. |
That's odd... I'm pretty sure I've used q and Q within text objects before, with no complaint from Adobe. Anyway, that gives me a lead for something to investigate. Thanks for looking into this! Did you use some favorite tool, or just happened to eyeball it? Add: I have used it before, so I wonder why it (AAR) only complained here, and only while it was compressed? Anyway, I This particular PDF also has one or more 0-length streams that I'm concerned aren't being read correctly by Adobe when compressed (and actually expand in size). One thing on my "TO DO" list is to force "no compress" on any stream where it doesn't save at least 26 bytes (to account for the compression information before the stream). I'm not sure I can do anything to detect and eliminate the entire (unused?) object with no stream content, but I'll keep that in mind. This particular (01compressed) PDF is from another package's t-test suite. In both cases, it's quite possible that Adobe is stricter than other PDF Readers, and is complaining about things that the other Readers overlook. |
Someone, in good faith of course but for no reason, warned against zero-length stream. You, out of the blue, started to worry if AAR fails to decompress it. Which lead to getting busy with "let's prevent short|zero streams filtering to save people 26 bytes". My intention was to stop you from chasing unicorns, barking the wrong tree, etc. No, I didn't use validation tools. Just from experience, something in severity level of (perceived or real) damage to the file has put improper nesting in content (+ illegal operators in "BT ET" bracket) close to the top of suspects list.
Good question -- either (1) no deflation or (2) re-shuffling content (further), and warning in AAR isn't triggered. You can't (and mustn't) control these. What should be controlled, is adherence to specification. Then neither 1 nor 2 etc. won't matter. In the same vain, content array item "2" is full of similar violations:
GS operators are prohibited after "m" i.e. during path construction. AAR is silent about them for now. Who knows when (if) something triggers a dialog to appear. Back to "BT q Q ET". Consider:
(Yes, I know the $s is initialized to nonsensical sequence, but it's legal) But either no compression (as I already said) or a number smaller than 172 -- and, behold, no warning in AAR. Why? Who cares?! It's Adobe's idiosyncrasy. What matters is, that changing to valid "q BT ET Q" -- and neither 172, nor 172000 DO NOT trigger no warnings. Now, what happened is that chequerboard in original file is stroked one path per square side (tell me about saving precious bytes), i.e. 256 paths i.e. more than "magic" 172. That's all. Pure and (very?) rare coincidence made it possible for this little exploration. So, thanks :) Edit: I'd have closer look at "page_3_g4_GT.pdf", if there are clear steps how a blob which is supposed to represent G4-compressed data was produced. I had some success in passing tiles from libtiff to assemble a PDF with imagemasks and CCITT compression. For now, AAR and Ghostscript report broken image, no reason to expect this blob to be valid data. |
But, but, unicorns are so pretty... OK, (by default) refusing to compress very short streams will go on the back burner at very low priority. I don't need to do it to fix any known bugs, so unless it happens to turn out to be very quick and easy, I won't bother to do it (unless I have absolutely nothing else to do in my life).
Interesting. So you think that some Readers (looking at you, Adobe) may (now or future) have trouble with certain mixes of operators in a stream? Well, something to keep in mind, and possibly add a reminder to the documentation. I don't know if it's reasonable to add code to prohibit certain calls within certain parts of a stream. It's probably not reasonable to put in the effort unless some problem does show up. I think I will add some wording to the documentation about how Readers differ and support different filter sets and may get upset about certain combinations of operators.
Sigh. So it's just another Adobe idiosyncrasy that makes certain stream content illegal sometimes, and is ignored other times. By the way, "q BT ET Q" is of course valid since save/restore is still within the graphics stream, and not the text stream. I've taken care of this by no-op'ing save() and restore() while in text mode, but there may be other content that will have to be dealt with in the future.
This PDF was from PDF::TableX's t-tests, so I claim no responsibility for such absurdities! I was just trying to get their code to run with PDF::Builder rather than just PDF::API2, so they could just mention that either could be used. It seems that the drawing of one thing (in this case, a checkerboard square) is pretty much self-contained, and doesn't share resources (such as a font) with other parts. Very inefficient and bloated code results.
Unfortunately, I have no idea how this CCITT Group 4 fax TIFF image was produced. @carygravel is the expert (Graphics::TIFF) on that, so maybe he has something to say (he provided the TIFF to me). BTW, there's already a ticket open (#167) if you want to further discuss this area. As @petervwyatt warned above, it's possible that some (or all) PDF Readers simply chose not to support all filters/compression, such as G4. I still would love to have a PDF debugger that will tell me exactly what is wrong with a PDF file, so I can do something about it. Your expertise and time spent delving into these problems is much appreciated, Vadim. |
Maybe the emphasis should be "specification is ultimately supreme, and the onus is on (your module's) user". And grading Readers by what features they support is orthogonal to how permissive they are to bending these features away from the spec.
What do you mean? There are no such things; content can be single stream or array of any number of streams broken anywhere on boundaries of any lexical tokens. "Text object" is operators sequence inside BT ET. The above is valid because text object doesn't contain what it shouldn't. Right :-) ?
I see. Current PDF::Builder embeds broken image from "g4.tif", because it naively concatenates encoded strips. But I don't see fixed code from where he does the right thing i.e. decodes/concatenates/encodes. If this file is placed into PDF::Builder (+ couple more it requires), then valid PDF is produced. I don't know what happened in 2021, is it (semi-)finished and whether author is still interested. Encoding to CCITT is done in pure Perl (see source) and is rather slow. Which is strange, there's libtiff with fast encoder at disposal, anyway (for e.g. those lazy like me, a few years back, avoiding to code it in Perl nor C, and even to comprehend how this encoder works :-). My respect to @carygravel) |
I went ahead and added some warnings in Docs.pm to let users know that they might possibly run into problems with certain Readers (filters supported and allowed operators within substreams, such as q/Q within text, or graphics state operators after the start of path construction).
My intent was that presumably 'q' is still in a Graphics Stream, so 'BT' through 'ET' is therefore a Text Stream, and it's back in a Graphics Stream for 'Q'. I don't know what will happen if you insert a 'BT' within an (already) Text Stream, nor can I guarantee what will happen due to popping in and out of Text mode in a stream or jumping from one stream to another. If you're going to do that advanced stuff, you take the responsibility for the consequences.
Hmm. Maybe Cary did the work and forgot to send it over to me for inclusion in PDF::Builder? He did quite a bit of work to improve TIFF processing (using his Graphics::TIFF library), but then he seemed to vanish. |
Let me state a few facts about ISO 32000-2:2020 (the latest and most up-to-date spec for PDF):
|
@petervwyatt, interesting. ISO 32000-2:2020 places the q, Q operators under "General graphics state" category. The previous "PDF, Version 1.7 (ISO 32000-1:2008)" classified them as "Special graphics state" and therefore not permitted in text objects. Which teaches me a lesson to get latest documents. Thanks. |
I'm concerned that what I think I'm hearing is that the PDF specification is changing from document to document, for the same PDF version. That doesn't sound good. Even if the version changes, a change in behavior for given PDF code that is more restrictive than in an earlier version is not a good thing. Is it what's happening, or are the documents just becoming more accurate and precise? |
No. What you are hearing is that the PDF was passed from an organization which controlled the specification to their implementation, to ISO where anyone can participate to clarify the exact and unambiguous meaning of PDF. PDF 2.0 (ISO 32000-2) is the first-ever edition of the PDF specification which was entirely developed in a vendor-neutral, consensus based, open forum. |
And that is why what you saw when testing did not always match the old specs. (And this behavior is known to vary over time). |
That's fine, but I hope that these participants are taking into account how widely used implementations (even though they're not reference implementations) behave, and not just putting their personal opinions about "how it should work" into the standard. There's no point in breaking a huge number of PDF Writers and Readers without very good reason. Future functionality needs should be occasion for new operators, not redefining existing ones in such a way that breaks existing code. The free download of Adobe's Acrobat Reader is the closest thing we have to a reference implementation, and ISO should have a very good reason to deviate from its behavior when revising the standard. |
Some further discussion in https://www.catskilltech.com/utils/show.php?link=pdf-validation (sadly, not currently appendable by anyone other than me). |
A number of times, I have had a PDF produced by PDF::Builder fail to properly display in Adobe's free Acrobat Reader. The error message (upon page load) can be infuriatingly vague, such as "An error exists on this page. Acrobat may not display the page correctly. Please contact...". Sometimes it displays correctly after all, sometimes not. Sometimes it will ask permission to write the PDF back out, apparently due to "fixes" it made. Sometimes other Readers behave in much the same way, other times not.
I have posted my problem PDF questions in the Adobe community support forum (community.adobe.com, as user Phil28073338r0c0), but rarely have received any useful information, and often no reply at all. That community is almost completely useless. If I could just get some decent diagnostic information on what a Reader might be objecting to, I could fix the problem in PDF::Builder myself (or confirm that the problem is actually in Adobe's product).
I have the above "maybe" error in PDF::TableX (when using PDF::Builder with it), and only when flate compression is turned on, and only with Acrobat Reader. In TIFF development (with @carygravel) I have several PDFs that fail to display, but only under Acrobat Reader. There are some vague complaints about truncated image data. Adobe Acrobat Reader (free downloadable) is the Gold Standard, so I really want it to be happy with PDF's produced by PDF::Builder.
I am aware of any number of PDF "repair" tools available, but they usually completely rewrite the PDF, leaving me clueless as to what (and how) they fixed it. I just need some good diagnostics that say, "Ah ha! You are missing such-and-such necessary field in object N." or "The stream compression in object N contains insufficient data (is truncated). Try this and try that.".
I maintain PDF::Builder as a free resource, so I'm not about to drop serious money on a tool. A free debugger is preferred, either online or downloadable. It can even be from Adobe. Does the free Adobe [Acrobat] Reader contain any hidden debugging/diagnostic tools that I can make use of? How about any other PDF Readers?
The text was updated successfully, but these errors were encountered: