crop pdf page not getting the expected result #980

zorzigio · 2021-03-29T12:29:34Z

zorzigio
Mar 29, 2021

I am trying to crop an area of a pdf and I am not able to get the expected result using the transformation matrix.

The position of the area I am trying to extract is relative to the bottom left corner of the page.
The page is also rotated by 90 deg.

In the code below, the first page contains the extracted area using the transformation matrix which does not work properly, while the second page is extracted manually deriving the position of the area knowing the rotation of the page (which extracts the area correctly).

import fitz

filename = './table test.pdf'
pno = 0
# table1
x0 = 480
y0 = 470
w = 741
h = 823

x1 = x0 + w
y1 = y0 + h

src = fitz.open(filename)
spage = src[pno]
oldrot = spage.rotation
m0 = spage.transformation_matrix
spage.set_rotation(0)
doc = fitz.open()  # empty output PDF
r = spage.rect  # input page rectangle
d = fitz.Rect(
    spage.cropbox_position,  # CropBox displacement if not
    spage.cropbox_position  # starting at (0, 0)
)
# using transformation matrix
rect1 = fitz.Rect(y0, x0, y0+h, x0+w)
m1 = spage.transformation_matrix
rect2 = rect1*m1
# knowing how the page is rotated
x0b = r.width - y1
x1b = x0b + h
y0b = r.height - x1
y1b = y0b + w
rect3 = fitz.Rect(x0b, y0b, x1b, y1b)
rects = [rect2, rect3]
for rect in rects:
    page = doc.new_page(
        -1,
        width=w,
        height=h,
    )
    page.show_pdf_page(
        page.rect,  # fill all new page with the image
        src,  # input document
        spage.number,  # input page number
        clip=rect,  # which part to use of input page
        rotate=-oldrot,
    )
doc.save(
    'test.pdf',
    garbage=3,
    deflate=True,
)

I would much prefer using the transformation matrix, however I am not sure what I am doing wrong here?

Also, I was wondering if there is a method to deal with the rotation of the page automatically rather than having to rotate back and forth the page?

table test.pdf

JorjMcKie · 2021-03-29T15:37:20Z

JorjMcKie
Mar 29, 2021
Maintainer

Allow me to convert this to a discussion.
There is no issue involved here.

0 replies

JorjMcKie · 2021-03-29T15:57:32Z

JorjMcKie
Mar 29, 2021
Maintainer

You actually have no dealings with the transformation_matrix. You only ever need it if you want to convert (Py.) MuPDF coordinates back to PDF coordinates - which you do not in this case.

When you use clip rectangles for inserting parts of some source page, you of couse need to know whether you got them relative to the rotated or the unrotated source page.
Setting the source page rotation to 0 always works - whether the original rot was zero or not.
If you have a clip in rotated coordinates, you can derotate or rotate as you like: there exist srcpage.rotation_matrix / `srcpage.derotation_matrix for this.

The following script uses 2 clip rectangles for it is known they are in de-rotated coordinates and split the source page in top and bottom halves. The resulting two output pages are then rotated like the source page was (also works if there was no original rotation):

import fitz

src = fitz.open("table.test.pdf")
srcpage = src[0]
old_rot = srcpage.rotation
srcpage.set_rotation(0)
srcrect = srcpage.rect
top = fitz.Rect(0, srcrect.height/2, srcrect.width, srcrect.height)
btm = fitz.Rect(0, 0, srcrect.width, srcrect.height/2)
out = fitz.open()
outp = out.new_page(width=top.width, height=top.height)
outp.show_pdf_page(outp.rect, src, 0, clip=top)
outp.set_rotation(old_rot)
outp = out.new_page(width=btm.width, height=btm.height)
outp.show_pdf_page(outp.rect, src, 0, clip=btm)
outp.set_rotation(old_rot)
out.save("x.pdf", garbage=4, deflate=True)

1 reply

zorzigio Mar 29, 2021
Author

Hi @JorjMcKie, thanks for the reply and sorry for posting in the wrong section.

The origin of the pdf coordinates is located in the bottom-left corner of the page (same as my reference points in the example code above) so I thought I could use this transformation matrix to go from this set of coordinates to the (Py) MuPDF ones. But I guess I am missing something here?

JorjMcKie · 2021-03-29T21:09:32Z

JorjMcKie
Mar 29, 2021
Maintainer

Hi @zorzigio

The origin of the pdf coordinates is located in the bottom-left corner of the page

No, only according to the PDF standard. (Py-) MuPDF count from top-left. This is the reason why we have that transformation matrix.
But as I wrote: this is none of your business in normal circumstances. It does play a role if you e.g. directly read the /Rect in methods like doc.xref_object().

0 replies

zorzigio · 2021-03-29T21:28:12Z

zorzigio
Mar 29, 2021
Author

Hi @JorjMcKie

Yes, I understand that the bottom-left is the definition of the origin according to the PDF standard.

And if the transformation matrix has been created for transforming between these 2 coordinates systems, I think I was on the right track I guess

You see, in the code I posted originally, the Rect rect1 is defined with bottom-left origin. I was using the transformation matrix to translate these coordinate system to the (Py) MuPDF ones so that I could crop the correct area (which results in the table on the left in the pdf).

1 reply

JorjMcKie Mar 29, 2021
Maintainer

ok, I understand now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crop pdf page not getting the expected result #980

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

crop pdf page not getting the expected result #980

zorzigio Mar 29, 2021

Replies: 4 comments · 2 replies

JorjMcKie Mar 29, 2021 Maintainer

JorjMcKie Mar 29, 2021 Maintainer

zorzigio Mar 29, 2021 Author

JorjMcKie Mar 29, 2021 Maintainer

zorzigio Mar 29, 2021 Author

JorjMcKie Mar 29, 2021 Maintainer

zorzigio
Mar 29, 2021

Replies: 4 comments 2 replies

JorjMcKie
Mar 29, 2021
Maintainer

JorjMcKie
Mar 29, 2021
Maintainer

zorzigio Mar 29, 2021
Author

JorjMcKie
Mar 29, 2021
Maintainer

zorzigio
Mar 29, 2021
Author

JorjMcKie Mar 29, 2021
Maintainer