Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Debugger Needed #199

Open
PhilterPaper opened this issue Aug 24, 2023 · 19 comments
Open

PDF Debugger Needed #199

PhilterPaper opened this issue Aug 24, 2023 · 19 comments
Labels
general discussion roadmaps, etc., discuss direction question how do I... ?

Comments

@PhilterPaper
Copy link
Owner

A number of times, I have had a PDF produced by PDF::Builder fail to properly display in Adobe's free Acrobat Reader. The error message (upon page load) can be infuriatingly vague, such as "An error exists on this page. Acrobat may not display the page correctly. Please contact...". Sometimes it displays correctly after all, sometimes not. Sometimes it will ask permission to write the PDF back out, apparently due to "fixes" it made. Sometimes other Readers behave in much the same way, other times not.

I have posted my problem PDF questions in the Adobe community support forum (community.adobe.com, as user Phil28073338r0c0), but rarely have received any useful information, and often no reply at all. That community is almost completely useless. If I could just get some decent diagnostic information on what a Reader might be objecting to, I could fix the problem in PDF::Builder myself (or confirm that the problem is actually in Adobe's product).

I have the above "maybe" error in PDF::TableX (when using PDF::Builder with it), and only when flate compression is turned on, and only with Acrobat Reader. In TIFF development (with @carygravel) I have several PDFs that fail to display, but only under Acrobat Reader. There are some vague complaints about truncated image data. Adobe Acrobat Reader (free downloadable) is the Gold Standard, so I really want it to be happy with PDF's produced by PDF::Builder.

I am aware of any number of PDF "repair" tools available, but they usually completely rewrite the PDF, leaving me clueless as to what (and how) they fixed it. I just need some good diagnostics that say, "Ah ha! You are missing such-and-such necessary field in object N." or "The stream compression in object N contains insufficient data (is truncated). Try this and try that.".

I maintain PDF::Builder as a free resource, so I'm not about to drop serious money on a tool. A free debugger is preferred, either online or downloadable. It can even be from Adobe. Does the free Adobe [Acrobat] Reader contain any hidden debugging/diagnostic tools that I can make use of? How about any other PDF Readers?

@PhilterPaper PhilterPaper added question how do I... ? general discussion roadmaps, etc., discuss direction labels Aug 24, 2023
@PhilterPaper
Copy link
Owner Author

Per Stack Overflow request for a debugger, some things that tripped up this OP were

  1. references pointing to the wrong objects
  2. image size (/Height, /Width) non-integer
  3. image size in points, not whole pixels
  4. cross reference directory with incorrect offsets
  5. extraneous content near the trailer

Stuff to keep in mind when puzzling out a non-working PDF. These also might eventually end up in the integrity check, although a separate tool (based on PDF::Builder?) to do this might be better. Anyone want to take a crack at this?

I found a page that lists a bunch of validation tools that I will need to take a look at -- if they're free to download (they're supposedly Open Source) and run on Windows, I will try them out.

There are lots of other pages concerning PDF errors, but most turn out to be incorrect use of a tool to produce them, and respondents don't seem keen on giving away their trade secrets!

@petervwyatt
Copy link

A few comments, as someone who knows PDF intimately:

  • PDF does not have any reference implementation - it is solely a specification-driven format. If you follow what one or more vendors appear to be doing, then you may well be doing the wrong thing.

  • sometimes when you get an error dialog in Adobe Reader or Acrobat you can hold down CTRL and click OK and you might get a second dialog telling you something more specific about the issue (e.g. "Expected dict", etc.). Sometimes this helps, sometimes it doesn't. If you have Acrobat, then run the Syntax Check preflight profile and it should hopefully tell you the specifics from Adobe's PoV (and which you should always check against the spec).

  • unfortunately you are correct in your assessment that many PDF processors keep their "special repair sauce" secret - meaning you cannot know what malforms they detect and/or recover from, what private data they might additionally use, etc.

  • there are multiple "layers" of PDF validation - lexical analysis errors, the overall file layout/structure itself (e.g. cross-reference table, incremental updates, validity of Linearization, etc. - this is what I call "pre-DOM" since it directly influences what the DOM will look like); the PDF DOM objects (e.g. some dict missing required keys or having invalid type/values) and then the relationship of otherwise valid PDF DOM objects to the many nested/internal formats (e.g. a bad FLATE stream, corrupted ICC Profile, etc). If the file layout/structure has issues, then nothing can be said about the other two layers since there are no normatively defined repair or recovery algorithms - and there are certainly demonstrable cases of repaired PDFs differing very significantly between implementations. This is also where the most "silent recovery" happens by viewers and there is no visibility of how/what they recovered (in some cases, you might be able to convince yourself that viewers entirely ignore valid cross-reference tables!). The PDF DOM is the most commonly reported by validators as it is thought to be simple, although mileage varies greatly on how "correct" they are and whether all issues are detected. And the last one is rarely done by PDF validators as you need to extract the data stream and use external dedicated tooling to assess the data stream against whatever spec it is supposed to be.

  • errors such as mismatched images (where the PDF Image XObject data is different to the embedded data stream) is not covered by the specification and viewers are free to do their own recovery or not - any recovery will also vary between implementations AND between versions of the same implementation AND between the same implementation across different platforms! I strongly suggest not trying to emulate any specific vendor as that will be a thankless never-ending task...

@PhilterPaper
Copy link
Owner Author

PDF does not have any reference implementation

I think the Adobe Acrobat Reader, downloaded for free in the millions, is close to a de facto reference implementation. Elsewhere, I have described a problem with compressed streams that seems to point to a problem with AAR dealing with compressed 0-length streams, so even the mighty Adobe may have feet of clay. Some day I hope to get around to updating the PDF::Builder code to not compress a stream unless it saves enough bytes to make it worthwhile. That should certainly prove whether or not my suspicions about a compressed 0-length stream are correct.

In the meantime, I have three bug reports in the Adobe Community forum that I have received no help on -- if you have any ideas I'd really appreciate hearing from you. There also are some problems with handling compressed TIFF images (using CCITT fax formats), but off the top of my head I don't know if they're unique to AAR.

there are multiple "layers" of PDF validation

At this point, I will take all the help I can get to resolve these issues. If a compiler can deeply analyze a source program and flag all manner of problems, I would think it no more difficult to analyze a PDF program. Unfortunately, it looks like no one has released such an analyzer, or if they have, they're charging a lot of money for it! Seems like a good GNU-type project, and might be extended into a full Reader, although since AAR is given away...

errors such as mismatched images (where the PDF Image XObject data is different to the embedded data stream) is not covered by the specification

Hmm. Has anyone released a checklist of what things could be examined? And how a mismatch between the XObject data and the data stream could best be detected? I would suppose the first two things would be to see if any compressed data can be successfully uncompressed using the claimed method (usually not Flate), and if the amount of (uncompressed) data in the stream matches the claimed length. An image stream usually has some metadata in front of the raster data, so it's often not easy to see if height x width matches up with the amount of data. It sounds like one would have to completely decode the image stream and see if it matches the object metadata -- a major project, as there are many image formats and compression methods.

@petervwyatt
Copy link

When it comes to compressed cross-reference table streams and object streams, AAR is quite limited as to which Filters it needs/supports and whether both features have to be used together or not. This is NOT what the spec/standard says, and many other implementations have a full and correct implementation.

Without looking at specific PDFs, it's hard to say why 0-length streams would be problematic. Can you post links here or DM me? Note that there are explicit nuanced requirements and recommendations about EOL placement in/around stream, endstream and enobj keywords.

In terms of validators and validation, there is a lot of work going on behind the scenes in developing formal models of PDF (as a file format - pixel- and color-precise rendering is a whole other ballgame!), but it is very much non-trivial. veraPDF recently previewed some technology based on my Arlington PDF Model (still evolving). Note that this is modelled against the ISO standard and not an implementation, as every implementation has additional permissiveness (cf. your AAR comments above - but whether this is intentional or not can only be answered by their devs). Such technology is already being used to highlight and report malformed PDFs produced by widely-used software that were otherwise silently supported by viewers and the application devs did not know their PDFs were technically invalid! And if you think this is all unique to PDF, please read "Idiosyncrasies of the HTML parser" 😁. Taking a LangSec viewpoint, it's just not possible to document everything that can be wrong with a file format - you basically cannot trust anything you read from a file!

@PhilterPaper
Copy link
Owner Author

I talked about the possible AAR problem with compressed 0-length streams (just a guess at this point) in pdfcpu/pdfcpu/issues/684. If you'd care to take a look at it (01compressed.pdf) and let me know if you see anything obvious, that would be much appreciated.

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Sep 17, 2023

https://www.reportmill.com/snaptea/PDFViewer/ seems to be able to disassemble at least some of the offending PDFs (01compressed.pdf and Bspline.pdf) without trouble, so I'm guessing that AAR is the culprit. Note that this sends your PDF to someone else's server, so don't use it for anything confidential!

@vadim-160102
Copy link

(01compressed.pdf)

Hi, see "FIGURE 4.1 Graphics objects" p. 197 and "TABLE 4.1 Operator categories" p. 196 (according to "PDF Reference 1.7.pdf" (i.e. PDF Reference sixth edition)) -- the "q", "Q" operators aren't allowed inside text objects. Simply deleting 128 of them from stream of Contents array item indexed "5" (they serve no purpose there at all, actually, but be careful not to also delete "Q" from font names :-)) makes this annoying warning by Adobe Reader go away.

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Oct 21, 2023

That's odd... I'm pretty sure I've used q and Q within text objects before, with no complaint from Adobe. Anyway, that gives me a lead for something to investigate. Thanks for looking into this! Did you use some favorite tool, or just happened to eyeball it?

Add: I have used it before, so I wonder why it (AAR) only complained here, and only while it was compressed? Anyway, I
think you found the problem, and removing text save() and restore() seems to fix the problem in PDF::TableX. I have pushed the
changes to GitHub and they will be in the 3.026 release. Thanks again!

This particular PDF also has one or more 0-length streams that I'm concerned aren't being read correctly by Adobe when compressed (and actually expand in size). One thing on my "TO DO" list is to force "no compress" on any stream where it doesn't save at least 26 bytes (to account for the compression information before the stream). I'm not sure I can do anything to detect and eliminate the entire (unused?) object with no stream content, but I'll keep that in mind. This particular (01compressed) PDF is from another package's t-test suite.

In both cases, it's quite possible that Adobe is stricter than other PDF Readers, and is complaining about things that the other Readers overlook.

@vadim-160102
Copy link

vadim-160102 commented Oct 22, 2023

Someone, in good faith of course but for no reason, warned against zero-length stream. You, out of the blue, started to worry if AAR fails to decompress it. Which lead to getting busy with "let's prevent short|zero streams filtering to save people 26 bytes". My intention was to stop you from chasing unicorns, barking the wrong tree, etc.

No, I didn't use validation tools. Just from experience, something in severity level of (perceived or real) damage to the file has put improper nesting in content (+ illegal operators in "BT ET" bracket) close to the top of suspects list.

why it (AAR) only complained here, and only while it was compressed?

Good question -- either (1) no deflation or (2) re-shuffling content (further), and warning in AAR isn't triggered. You can't (and mustn't) control these. What should be controlled, is adherence to specification. Then neither 1 nor 2 etc. won't matter.

In the same vain, content array item "2" is full of similar violations:

28.346 813.04 m 1 w 0 0 0 RG 95.669 813.04 l S

GS operators are prohibited after "m" i.e. during path construction. AAR is silent about them for now. Who knows when (if) something triggers a dialog to appear.

Back to "BT q Q ET". Consider:


use strict;
use warnings;
use PDF::Reuse;

prFile( 'temp_reuse.pdf' );
prCompress( 1 );
my $s = "0 0 m 0 0 l n\n" x 172;
$s .= "BT q Q ET\n";
prAdd( $s );
prEnd;

(Yes, I know the $s is initialized to nonsensical sequence, but it's legal)

But either no compression (as I already said) or a number smaller than 172 -- and, behold, no warning in AAR. Why? Who cares?! It's Adobe's idiosyncrasy. What matters is, that changing to valid "q BT ET Q" -- and neither 172, nor 172000 DO NOT trigger no warnings.

Now, what happened is that chequerboard in original file is stroked one path per square side (tell me about saving precious bytes), i.e. 256 paths i.e. more than "magic" 172. That's all. Pure and (very?) rare coincidence made it possible for this little exploration. So, thanks :)

Edit: I'd have closer look at "page_3_g4_GT.pdf", if there are clear steps how a blob which is supposed to represent G4-compressed data was produced. I had some success in passing tiles from libtiff to assemble a PDF with imagemasks and CCITT compression. For now, AAR and Ghostscript report broken image, no reason to expect this blob to be valid data.

@PhilterPaper
Copy link
Owner Author

My intention was to stop you from chasing unicorns, barking [up] the wrong tree, etc.

But, but, unicorns are so pretty... OK, (by default) refusing to compress very short streams will go on the back burner at very low priority. I don't need to do it to fix any known bugs, so unless it happens to turn out to be very quick and easy, I won't bother to do it (unless I have absolutely nothing else to do in my life).

GS operators are prohibited after "m" i.e. during path construction.

Interesting. So you think that some Readers (looking at you, Adobe) may (now or future) have trouble with certain mixes of operators in a stream? Well, something to keep in mind, and possibly add a reminder to the documentation. I don't know if it's reasonable to add code to prohibit certain calls within certain parts of a stream. It's probably not reasonable to put in the effort unless some problem does show up. I think I will add some wording to the documentation about how Readers differ and support different filter sets and may get upset about certain combinations of operators.

But either no compression (as I already said) or a number smaller than 172 -- and, behold, no warning in AAR. Why? Who cares?! It's Adobe's idiosyncrasy. What matters is, that changing to valid "q BT ET Q" -- and neither 172, nor 172000 DO NOT trigger no warnings.

Sigh. So it's just another Adobe idiosyncrasy that makes certain stream content illegal sometimes, and is ignored other times. By the way, "q BT ET Q" is of course valid since save/restore is still within the graphics stream, and not the text stream. I've taken care of this by no-op'ing save() and restore() while in text mode, but there may be other content that will have to be dealt with in the future.

tell me about saving precious bytes

This PDF was from PDF::TableX's t-tests, so I claim no responsibility for such absurdities! I was just trying to get their code to run with PDF::Builder rather than just PDF::API2, so they could just mention that either could be used. It seems that the drawing of one thing (in this case, a checkerboard square) is pretty much self-contained, and doesn't share resources (such as a font) with other parts. Very inefficient and bloated code results.

I'd have closer look at "page_3_g4_GT.pdf", if there are clear steps how a blob which is supposed to represent G4-compressed data was produced.

Unfortunately, I have no idea how this CCITT Group 4 fax TIFF image was produced. @carygravel is the expert (Graphics::TIFF) on that, so maybe he has something to say (he provided the TIFF to me). BTW, there's already a ticket open (#167) if you want to further discuss this area. As @petervwyatt warned above, it's possible that some (or all) PDF Readers simply chose not to support all filters/compression, such as G4.

I still would love to have a PDF debugger that will tell me exactly what is wrong with a PDF file, so I can do something about it. Your expertise and time spent delving into these problems is much appreciated, Vadim.

@vadim-160102
Copy link

I think I will add some wording to the documentation about how Readers differ and support different filter sets and may get upset about certain combinations of operators.

Maybe the emphasis should be "specification is ultimately supreme, and the onus is on (your module's) user". And grading Readers by what features they support is orthogonal to how permissive they are to bending these features away from the spec.

By the way, "q BT ET Q" is of course valid since save/restore is still within the graphics stream, and not the text stream.

What do you mean? There are no such things; content can be single stream or array of any number of streams broken anywhere on boundaries of any lexical tokens. "Text object" is operators sequence inside BT ET. The above is valid because text object doesn't contain what it shouldn't. Right :-) ?

(#167)

I see. Current PDF::Builder embeds broken image from "g4.tif", because it naively concatenates encoded strips. But I don't see fixed code from

https://github.com/carygravel/Perl-PDF-Builder/blob/81a7993265896e364b6d8a79b0f76ee8278986c3/lib/PDF/Builder/Resource/XObject/Image/TIFF_GT.pm

where he does the right thing i.e. decodes/concatenates/encodes. If this file is placed into PDF::Builder (+ couple more it requires), then valid PDF is produced. I don't know what happened in 2021, is it (semi-)finished and whether author is still interested. Encoding to CCITT is done in pure Perl (see source) and is rather slow. Which is strange, there's libtiff with fast encoder at disposal, anyway (for e.g. those lazy like me, a few years back, avoiding to code it in Perl nor C, and even to comprehend how this encoder works :-). My respect to @carygravel)

@PhilterPaper
Copy link
Owner Author

I think I will add some wording to the documentation about how Readers differ and support different filter sets and may get upset about certain combinations of operators.

I went ahead and added some warnings in Docs.pm to let users know that they might possibly run into problems with certain Readers (filters supported and allowed operators within substreams, such as q/Q within text, or graphics state operators after the start of path construction).

By the way, "q BT ET Q" is of course valid since save/restore is still within the graphics stream, and not the text stream.

What do you mean? There are no such things; content can be single stream or array of any number of streams broken anywhere on boundaries of any lexical tokens. "Text object" is operators sequence inside BT ET. The above is valid because text object doesn't contain what it shouldn't. Right :-) ?

My intent was that presumably 'q' is still in a Graphics Stream, so 'BT' through 'ET' is therefore a Text Stream, and it's back in a Graphics Stream for 'Q'. I don't know what will happen if you insert a 'BT' within an (already) Text Stream, nor can I guarantee what will happen due to popping in and out of Text mode in a stream or jumping from one stream to another. If you're going to do that advanced stuff, you take the responsibility for the consequences.

Current PDF::Builder embeds broken image from "g4.tif", because it naively concatenates encoded strips. But I don't see fixed code from (@carygravel 's TIFF work area)

Hmm. Maybe Cary did the work and forgot to send it over to me for inclusion in PDF::Builder? He did quite a bit of work to improve TIFF processing (using his Graphics::TIFF library), but then he seemed to vanish.

@petervwyatt
Copy link

Let me state a few facts about ISO 32000-2:2020 (the latest and most up-to-date spec for PDF):

  • Table 50 and Figure 9 are normative for which operators can legally occur within which "states" (graphics objects)
  • silent ignoring certain operators under specific circumstances is also normative - e.g. see text below Table 73
  • thus q/Q can legally occur within a BT/ET text object, however that does not mean that absolutely anything goes - the rules for text state parameters and operators (subclause 9.3) must still be obeyed (specifically for text rendering mode).
  • you definitely cannot have an explicit Do operator directly in a BT/ET text object (see Figure 9) - but this can implicitly happen through invoked patterns/shadings and Type3 glyph descriptions
  • anything not permitted by the normative language of the PDF standard is "out of scope". The standard does not define recovery and error handling except in very very few places. And attempting to reverse engineer and emulate other implementations may be very dangerous...

@vadim-160102
Copy link

@petervwyatt, interesting. ISO 32000-2:2020 places the q, Q operators under "General graphics state" category. The previous "PDF, Version 1.7 (ISO 32000-1:2008)" classified them as "Special graphics state" and therefore not permitted in text objects. Which teaches me a lesson to get latest documents. Thanks.

@PhilterPaper
Copy link
Owner Author

I'm concerned that what I think I'm hearing is that the PDF specification is changing from document to document, for the same PDF version. That doesn't sound good. Even if the version changes, a change in behavior for given PDF code that is more restrictive than in an earlier version is not a good thing. Is it what's happening, or are the documents just becoming more accurate and precise?

@petervwyatt
Copy link

No. What you are hearing is that the PDF was passed from an organization which controlled the specification to their implementation, to ISO where anyone can participate to clarify the exact and unambiguous meaning of PDF. PDF 2.0 (ISO 32000-2) is the first-ever edition of the PDF specification which was entirely developed in a vendor-neutral, consensus based, open forum.

@petervwyatt
Copy link

petervwyatt commented Oct 25, 2023

And that is why what you saw when testing did not always match the old specs. (And this behavior is known to vary over time).
At ISO we have tried very hard to document what PDF unambiguously and precisely means.

@PhilterPaper
Copy link
Owner Author

anyone can participate to clarify the exact and unambiguous meaning of PDF.

That's fine, but I hope that these participants are taking into account how widely used implementations (even though they're not reference implementations) behave, and not just putting their personal opinions about "how it should work" into the standard. There's no point in breaking a huge number of PDF Writers and Readers without very good reason. Future functionality needs should be occasion for new operators, not redefining existing ones in such a way that breaks existing code.

The free download of Adobe's Acrobat Reader is the closest thing we have to a reference implementation, and ISO should have a very good reason to deviate from its behavior when revising the standard.

@PhilterPaper
Copy link
Owner Author

Some further discussion in https://www.catskilltech.com/utils/show.php?link=pdf-validation (sadly, not currently appendable by anyone other than me).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
general discussion roadmaps, etc., discuss direction question how do I... ?
Projects
None yet
Development

No branches or pull requests

3 participants