Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement missing cm command #720

Merged
merged 3 commits into from
Jul 24, 2024
Merged

Implement missing cm command #720

merged 3 commits into from
Jul 24, 2024

Conversation

DominikDostal
Copy link
Contributor

Type of pull request

  • Bug fix (involves code and configuration changes)

About

When printing a landscape word document as pdf, it will be created with a concatenation matrix (cm command) which is completely ignored by the pdfparser.
This PR aims to add it.

I tested this with this document:
This is just a test - Printed As.pdf

With this code:

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseContent(file_get_contents("C:\Users\dd\Desktop\This is just a test - Printed As.pdf"));
$pages = $pdf->getPages();
foreach($pages as $page) {
    $pageData = $page->getDataTm();
    var_dump('<pre>', $pageData, '</pre>');
}

Before i got these results for the text:

  array(2) {
    [0]=>
    array(6) {
      [0]=>
      string(1) "1"
      [1]=>
      string(1) "0"
      [2]=>
      string(8) "0.000000"
      [3]=>
      string(2) "-1"
      [4]=>
      string(9) "78.879997"
      [5]=>
      string(10) "126.559998"
    }
    [1]=>
    string(20) "This is just a test "
  }

After my change i get the following result:

  array(2) {
    [0]=>
    array(6) {
      [0]=>
      float(0.75)
      [1]=>
      float(0)
      [2]=>
      float(0)
      [3]=>
      float(0.75)
      [4]=>
      float(59.159997750000002)
      [5]=>
      float(500.40000850000001)
    }
    [1]=>
    string(20) "This is just a test "
  }

If i open the pdf in adobe acrobat reader and measure the distances:
image

Converted to PDF units:
~20.02mm * 2.83465 = ~56,75
~176.06mm * 2.83465 = ~499
(deviation is because i didnt measure it precise enough in adobe)

@DominikDostal
Copy link
Contributor Author

It seems that i will have to test it with the documents from the unit tests and see why the result differs.

Will do that when i find the time

@DominikDostal
Copy link
Contributor Author

So the problem was that i also needed to implement the q and Q commands to save/load the graphics state to/from the stack.

The values are getting closer, but because of math inaccuracy they still differ:

2) PHPUnitTests\Integration\PageTest::testGetDataTm
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
 Array (
-    0 => '0.999429'
-    1 => '0'
-    2 => '0'
-    3 => '1'
-    4 => '201.96'
-    5 => '720.68'
+    0 => 0.9994286002284
+    1 => 0.0
+    2 => 0.0
+    3 => 0.9999996
+    4 => 201.959919216
+    5 => 720.6797117279999
 )

The question is how i should handle this:

  1. Use the new float values as the expected result
  2. Try to change how the result is returned (round, convert to string)

Can anyone give me some advice here?

@k00ni
Copy link
Collaborator

k00ni commented Jun 7, 2024

@DominikDostal thank you for your effort. I don't have much time right now, but will see that I can back to you soon.

Please provide reference(s) to the PDF specification you are referring to or on which your code is based on.

The question is how i should handle this:

  1. Use the new float values as the expected result
  2. Try to change how the result is returned (round, convert to string)
    Can anyone give me some advice here?

I am not sure if I got your point. At first glance, it seems to be an improvement to have more precise float values instead of rather less precise strings. Rounding should be avoided though, because it may interfere/alter peoples results.

@DominikDostal
Copy link
Contributor Author

Please provide reference(s) to the PDF specification you are referring to or on which your code is based on.

I based it on https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf Chapter 8.4.4 Graphics State Operators.
image

I am not sure if I got your point. At first glance, it seems to be an improvement to have more precise float values instead of rather less precise strings. Rounding should be avoided though, because it may interfere/alter peoples results.

I guess the document used for some of the PageTests is a special case:
Document1_pdfcreator_nocompressed.pdf text positions are extracted without using the concatenating matrices inside.
So why were the coordinates correct?
Because it uses 2 different concatenating matrices:

  • [0.12 0 0 0.12 0 0]
  • [8.33333 0 0 8.33333 0 0]

When you multiply them, the result is:
[0.9999996 0 0 0.9999996 0 0]
which is very close to the matrix that changes nothing when multiplied with: [1 0 0 1 0 0]

Thats why im not sure if not rounding is a good approach, because it looks to me like it was supposed to be a periodic number, but was cut short because of rounding.

This is btw how the calculation in the first part of the document looks like:
The CTM (Current Transformation Matrix) starts as [1 0 0 1 0 0])

q - Saved CTM to stack: [1 0 0 1 0 0]
cm - Adding concat matrix: [0.12 0 0 0.12 0 0] - CTM * cm = [0.12 0 0 0.12 0 0]
q - Saved CTM to stack: [0.12 0 0 0.12 0 0]
cm - Adding concat matrix: [8.33333 0 0 8.33333 0 0] - CTM * cm = [0.9999996 0 0 0.9999996 0 0]
BT
Tf
Tm - Text Matrix: [0.999429 0 0 1 201.96 720.68] - Tm * CTM = [0.9994286002284 0 0 0.9999996 201.959919216 720.679711728]
TJ - Text: Document title
ET
Q - Loaded CTM from stack: [0.12 0 0 0.12 0 0]

@GreyWyvern
Copy link
Contributor

I already went through all of this with the PDFObject parsing to account for cm commands, so I know how you feel! getDataTm is a completely separate parsing function that also parses the document stream data so my changes didn't touch that bit.

I really wonder if in the future we can add a global document stream parser that generates the data required by both functions/objects and thus obsoletes both of these.

@DominikDostal
Copy link
Contributor Author

I already went through all of this with the PDFObject parsing to account for cm commands, so I know how you feel! getDataTm is a completely separate parsing function that also parses the document stream data so my changes didn't touch that bit.

I really wonder if in the future we can add a global document stream parser that generates the data required by both functions/objects and thus obsoletes both of these.

Oh i see.
But... isnt it wrong to just overwrite the old concatenating matrix when the cm command comes along? In my code i added them together (or rather multiplied them together) and at least in the test-pdf that is already in the repository it got the correct result.

$current_position_cm = [

@GreyWyvern
Copy link
Contributor

Oh i see. But... isnt it wrong to just overwrite the old concatenating matrix when the cm command comes along?

You're right, of course, and I should add that. But my end didn't have to worry about also storing the exact positioning matrix along with each bit of text. It only needed to know did it move enough to warrant adding a line-break. :)

@k00ni
Copy link
Collaborator

k00ni commented Jun 21, 2024

@DominikDostal I am not sure, where we stand here. Please tell me which points are open to discuss and what you need to finalize the PR.

Thats why im not sure if not rounding is a good approach, because it looks to me like it was supposed to be a periodic number, but was cut short because of rounding.

Well, in general I would keep as much data as provided. Sure in this case, rounding seems obvious but it may have different ramifications in other cases. Therefore my statement about keep raw data. Developers can round the values themselves if they want.

Can you provide a test which fails without your changes?

@k00ni
Copy link
Collaborator

k00ni commented Jul 19, 2024

@DominikDostal Please give us an update how you want to proceed here and cover the remaining points.

@DominikDostal
Copy link
Contributor Author

@k00ni Im sorry about not being active here, I was sick in between and had to catch up with work (and then forgot 😞 )

I pushed a commit that adds a new document to test CM Commands, but i also had to fix some existing tests (since now we get more decimal numbers than before). I wasnt sure if rounding the results or changing the expected results was better.

I also had to add an error margin of 0.01 to the getTextXY tests because of the new results.

@k00ni
Copy link
Collaborator

k00ni commented Jul 23, 2024

Can I assume that your new test fails when using a prior version without your changes?

@DominikDostal
Copy link
Contributor Author

Yes

The results will be "slightly" different:

There was 1 failure:

1) PHPUnitTests\Integration\PageTest::testCmCommandInPdfs
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
 Array (
-    0 => '0.75'
-    1 => '0.0'
-    2 => '0.0'
-    3 => '0.75'
-    4 => '59.16'
-    5 => '500.4'
+    0 => 1.0
+    1 => 0.0
+    2 => 0.0
+    3 => -1.0
+    4 => 78.88
+    5 => 126.56
 )

D:\source\pdfparser\tests\PHPUnit\Integration\PageTest.php:957

@k00ni

This comment was marked as outdated.

@k00ni k00ni merged commit 9609711 into smalot:master Jul 24, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants