Making a compressed grayscale PDF version #769

bserg66 · 2020-12-15T14:12:25Z

bserg66
Dec 15, 2020

Hello.
I try extract all images and drawings from P48_29-30.pdf and insert its to other P48_29-30_imgall.pdf
But not all borders inserts in P48_29-30_imgall.pdf (around a spoon at first page and 3 borders at second page).
And color of book at first page and view of rectangles around pages numbers do not match the source.

P48_29-30.pdf
P48_29-30_imgall.pdf

My code is:

import fitz
import os

PATH = 'C:/PyWorks/PDF_Python/PDF/'

if __name__ == "__main__":
    FileRead = PATH + 'P48_29-30.pdf'
    DocRd = fitz.open(FileRead)
    DocWr = fitz.open()
    PCounts =  DocRd.pageCount
    for PageNumber in range(0,PCounts):
        PageRd = DocRd[PageNumber]
        PageWr = DocWr.newPage(width=PageRd.rect.width, height=PageRd.rect.height)

        # Insert text images 
        TextDict = PageRd.getText("dict")
        for block in TextDict['blocks']:
            if block["type"] == 1:
                PageWr.insertImage(block['bbox'], stream=block['image'])
        
        # Insert drawings 
        paths = PageRd .getDrawings() 
        shape = PageWr.newShape()  
        for path in paths:
            for item in path["items"]: 
                if item[0] == "l":  
                    shape.drawLine(item[1], item[2])
                elif item[0] == "re": 
                    shape.drawRect(item[1])
                elif item[0] == "c":  
                    shape.drawBezier(item[1], item[2], item[3], item[4])
                else:
                    raise ValueError("unhandled drawing", item)
            shape.finish(
                fill=path["fill"],  
                color=path["color"],  
                dashes=path["dashes"],  
                even_odd=path["even_odd"],
                closePath=path["closePath"],  
                lineJoin=path["lineJoin"],  
                lineCap=max(path["lineCap"]),  
                width=path["width"],  
                stroke_opacity=path["opacity"],  
                fill_opacity=path["opacity"] 
                )
        shape.commit()

        # Insert images
        xref = DocRd.getPageImageList(PageNumber, full=True)
        for i, x in enumerate(xref):
            rect = PageRd.getImageBbox(x)
            xref = x[0]
            img = DocRd.extractImage(xref)
            # image = Image.open(io.BytesIO(img['image']))
            # image.show()
            if i == 0:
                PageWr.insertImage(rect, stream=img['image'], overlay=False)
            else:    
                PageWr.insertImage(rect, stream=img['image'])         
        
    # Save to file
    FileExtDel = os.path.splitext(FileRead)[0]
    FileOut = FileExtDel + '_imgall.pdf' 
    DocWr.save(FileOut, deflate=True, garbage=3, clean=True) 
    DocRd.close()
    DocWr.close()```

Answered by JorjMcKie

Dec 27, 2020

Can I extract font information from this pdf (bold for digits)?

Use page.getText("dict", flags=0)["blocks"]. This is a list of text (only, because of the flags value) block dictionaries. Each such dict contains a list of line dictionaries, which in turn contains a list of text "span" dictionaries. COnsult the TextPage section of the docu to see the details.
The important point is that a span contains text with completely identical font properties: name, fontsize, color, font characteristics (bold, italic, mono, ...) are all identical.So you should receive a span containing "46)" following by a span with text "Велосипедист...".
If this is not the case (like here), then creator coded the …

View full answer

JorjMcKie · 2020-12-15T15:15:45Z

JorjMcKie
Dec 15, 2020
Maintainer

There actually is no error!
You have a complex PDF with pages containing drawings and 2 types of images: normal ones and inline images - and you all copied these over correctly (and demonstrated great knowledge of this repo by the way - chapeau!).

Your page copy process takes these 3 types of content and regenerates each on new pages correctly. And indeed you do not get what you obviously want: the same visual appearance, but just without text.

I experimented a little and commented out each of these copy blocks separately and in addition also changed their sequence.
It then became obvious that the page appearance critically depends on the sequence of every single command contained in the source page!
What I mean is that the source page's appearance relies on things like

display inline image X
draw path Y
display external image Z

If you change the sequence of the above, the page will look different!
And exactly this inevitably happens if you execute all commands of category 1, then 2, and then 3 ... or 3, 2, 1 or some other permutation.

0 replies

JorjMcKie · 2020-12-15T15:17:43Z

JorjMcKie
Dec 15, 2020
Maintainer

I do not know whether you are just experimenting or have a specific need.
For example, if you want to get rid of the text, there are other ways to achieve this ...

0 replies

bserg66 · 2020-12-15T17:26:57Z

bserg66
Dec 15, 2020
Author

My first question is how to get the missing borders?
I want to get the exact look of a page without text.

0 replies

JorjMcKie · 2020-12-15T21:08:16Z

JorjMcKie
Dec 15, 2020
Maintainer

My first question is how to get the missing borders?

The "border" around the spoons is not caused by any of the drawings. It may not even be a border, but the result of several images overlapping each other in such a manner that an impression like a border is generated. Hard to tell after all drawing and image-show commands are pulled out of their original contxt - see my previous comments.

The following script removes all text of the pages - including the link at the bottom of the page, which has technically been envelopped in a watermark artifact.

import fitz

doc = fitz.open("P48_29-30.pdf")
for page in doc:
    page.cleanContents()  # clean and unify page command syntax
    xref = page.getContents()[0]  # get the commands - now in a single source
    cont = bytearray(doc.xrefStream(xref))  # read it as a modifyable binary object
    s = 0  # position counter
    while s >= 0:
        s = cont.find(b"BT")  # search start / end of text object
        e = cont.find(b"ET", s)
        if min(s, e) >= 0:  # found one!
            cont[s : e + 2] = b""  # remove text object
    s = 0  # reset position counter
    while s >= 0:
        s = cont.find(b"/Artifact")  # search start / end of watermark
        e = cont.find(b"EMC", s)
        if min(s, e) >= 0:
            cont[s : e + 3] = b""  # remove watermark object

    doc.updateStream(xref, cont)  # write back updated stream
    page.cleanContents()  # clean again to remove now obsolete objects from PDF

doc.save(doc.name.replace(".pdf", "-no-text.pdf"), deflate=False, garbage=3)

0 replies

bserg66 · 2020-12-16T07:13:20Z

bserg66
Dec 16, 2020
Author

Super, thanks. I only change deflate=True and size from 440KB decreased at 170KB.
My aim compress pdf size from 48MB (225pages) to minimum in grayscale 4 bits color depth (the pocketbook reproduces 16 shades of gray). I find this pdf in color after pdf-tools processing (poppler show in producer field) and it size 15MB.
First - I wanted to figure out how to reduce in color pdf size from 48MB to 15MB.
Second I use Ghostscript (grayscale 4 bits color depth) and pdfsizeopt (clean garbage)to reduce 15MB color in 9MB grayscale 4 bits color depth its for pocketbook good size.
Can I do it all in PyMuPdf? Fonts I change on Helvetica. There are also small questions.

0 replies

JorjMcKie · 2020-12-16T10:26:37Z

JorjMcKie
Dec 16, 2020
Maintainer

I suggest moving your issue(s) to the new "Discussions" category - this better addresses what we are doing here.
Agree?

Independently from that:
You can do a few, but not all of the above with PyMuPDF:

you can replace fonts, there is this folder with appropriate scripts. Best use a new font that is subsettable (not Helvetica)

But here is another approach:

extract page text for later re-insertion
remove text from page
make a grayscale image of the emptied page
insert image to new output page
re-insert saved text

This has a few advantages:

there is no multiple images per page - just one
the page image can be further compressed by e.g. not using PNG but JPEG with the support of Pillow - need to check this.
freely choose appropriate font

Resulting PDF size for the 2 pages: 124 KB of which 26 KB is the Helvetica font. So we have about 50 KB per page on average. the final final file size would therefore be 225 * 50 + 26 which is about 11 MB.

0 replies

JorjMcKie · 2020-12-16T10:37:46Z

JorjMcKie
Dec 16, 2020
Maintainer

New results using Pillow:

file size = 70,8 KB
average page size = (70.8 - 26) / 2 = 22.4
expected 225 pages size: 225 * 22.4 + 26, a little over 5 MB

Here is the script:

import fitz
from PIL import Image
import io

doc = fitz.open("P48_29-30.pdf")
newdoc = fitz.open()
font = fitz.Font("helv")


def remove_text(page):
    doc = page.parent
    page.cleanContents()
    xref = page.getContents()[0]
    cont = bytearray(doc.xrefStream(xref))
    s = 0
    while s >= 0:
        s = cont.find(b"BT")  # search start / end of text object
        e = cont.find(b"ET", s)
        if min(s, e) >= 0:
            cont[s : e + 2] = b""  # remove text object
    s = 0
    while s >= 0:
        s = cont.find(b"/Artifact")  # search start / end of watermark
        e = cont.find(b"EMC", s)
        if min(s, e) >= 0:
            cont[s : e + 3] = b""  # remove text object

    doc.updateStream(xref, cont)
    page.cleanContents()


for page in doc:
    blocks = page.getText("dict", flags=0)["blocks"]
    remove_text(page)
    newpage = newdoc.newPage(width=page.rect.width, height=page.rect.height)
    tw = fitz.TextWriter(newpage.rect)
    pix = page.getPixmap(colorspace=fitz.csGRAY)
    img = Image.frombytes("L", [pix.width, pix.height], pix.samples)
    bio = io.BytesIO()
    img.save(bio, format="JPEG")
    newpage.insertImage(newpage.rect, stream=bio.getvalue())
    for b in blocks:
        for l in b["lines"]:
            for s in l["spans"]:
                tw.append(s["origin"], s["text"], font=font, fontsize=s["size"])
    tw.writeText(newpage)

newdoc.save("x.pdf", garbage=4, deflate=True)

And the resulting PDF:
x.pdf

0 replies

JorjMcKie · 2020-12-16T11:09:15Z

JorjMcKie
Dec 16, 2020
Maintainer

Pillow also offers some more optimization option - there is an optimize parameter in save(), which further reduces the x.pdf filesize to 67 KB.
when creating the pages pixmap, you can also play with the resolution via using a matrix: pix = page.getPixmap(colorspace=fitz.csGRAY, matrix=fitz.Matrix(zoom, zoom)). zoom < 1 will decrease image quality and size (quadratic impact!)
depending on how exact you want to recreate the text, you can differentiate between Helvetica normal, italic, etc. Will increase final size, but with a one-time impact per font weight only. You should still remain under 5 MB.

0 replies

bserg66 · 2020-12-16T12:54:05Z

bserg66
Dec 16, 2020
Author

Many thanks! This is tutorial of my sun and he will turn pages much faster at pocketbook.
I agree moving my issue(s) to the new "Discussions" category. Maybe call him - " Pdf compress for pocketbook".
I hope this can be useful to someone else.
I use all Helvetica - "helv", "heit", "hebo", "hebi".
I will deal with your answer and if you still have questions about working with the text, I will write to you.

1 reply

JorjMcKie Dec 16, 2020
Maintainer

ok - done. See next post which uses multiple font weights of Helvetica.

JorjMcKie · 2020-12-16T13:21:02Z

JorjMcKie
Dec 16, 2020
Maintainer

here is a version with all 4 Helvetica font weights:
bw-maker.zip

10 replies

bserg66 Dec 24, 2020
Author

Yes PragmaticaC help in first case, but others no.

JorjMcKie Dec 24, 2020
Maintainer

I found that out by checking which fonts are actually used ... in cases where the Helvetica result did not look good.
Probably you just need to do the same.
Interestingly, not only Helvtica has these problems, but also other fonts I usually use: Fira Go, Noto Sans, etc.
Maybe the Ucranian Cyrillic is a bit different from the standard (Russian) Cyrillic? I certainly cannot tell.

bserg66 Dec 26, 2020
Author

In second file- ZR-P5.pdf fonts = ['AAAAJV+SCHOOLBOOKC', 'AAHLXF+SCHOOLBOOKC-BOLD', 'BKSOJN+SCHOOLBOOKC-ITALIC', 'ARIAL', 'JNAANV+ARIAL']. I try all of this fonts https://ukrfonts.com/index.php?v=20 (whole font without subsetings) but no results:
ZR_P5_Arial.pdf
ZR_P5_SchoolBookC.pdf

JorjMcKie Dec 26, 2020
Maintainer

Replacing the fonts obviously is a never ending story and mostly leads to unsatisfactory results.
I have a new script. This one leaves the text as it is and only changes non-text to gray. Your two examples look good and the compression rate is between 66% and 90%:
bw-maker2.zip

Note: colored text is no file size problem - as opposed to colored images ...
UkrLit_P4_OCR-new.pdf
ZR_P5-new.pdf

bserg66 Dec 27, 2020
Author

Pdf not after OCR look ideal thanks to your clever idea. I finally realized that for the correct OCR I need to install language packs for
tesseract :-). I took a simple text pdf file, commented out everything related to image extraction in your first script and got this:
Uz_P10_OCR_Gray.pdf - 37,7Kb
Input file:
Uz_P10_OCR.pdf - 232Kb
Questions:

Can I extract font information from this pdf (bold for digits)?
Why font size not correct for all letters?
How can I determine that the page contains only text without images (there is an image - a white background, but it is not needed)

JorjMcKie · 2020-12-27T12:57:00Z

JorjMcKie
Dec 27, 2020
Maintainer

Can I extract font information from this pdf (bold for digits)?

Use page.getText("dict", flags=0)["blocks"]. This is a list of text (only, because of the flags value) block dictionaries. Each such dict contains a list of line dictionaries, which in turn contains a list of text "span" dictionaries. COnsult the TextPage section of the docu to see the details.
The important point is that a span contains text with completely identical font properties: name, fontsize, color, font characteristics (bold, italic, mono, ...) are all identical.So you should receive a span containing "46)" following by a span with text "Велосипедист...".
If this is not the case (like here), then creator coded the PDF text in such a way, that not all text-relevant information was included inside a pair of string b"BT" / b"ET". So my script will delete this info parts.
You can simulate bold text by using the text rendering command 2 Tr (= "fill AND stroke text"). If 2 Tr comes before b"BT" my script will delete it.

Why font size not correct for all letters?

There may be "geometry" changes outside text blocks, that change the scaling. This then has an effect like a fontsize change. You could change the only_text() method by also accepting these command lines:

equals q or Q (stack / unstack "graphics state")
endswith cm = matrix in PDF format - may change the geometry.

How can I determine that the page contains only text without images (there is an image - a white background, but it is not needed)

With the same method, but using the default flags value, any images on the page will also be extracted - detectable by block["type"] == 1. Text blocks have block type 0.

However!

I just realized: your last examples are all scanned, OCR-ed PDFs. They simply cannot be treated with what we have been discussing here.
Tesseract specifically uses a specially defined font without glyphs (i.e. undefined visual appearance) and put the recognized text invisibly (render mode 3) at places where it found the corresponding text image. The purpose is to make a PDF searchable.
There is just no hope to get what you want in these cases.

2 replies

JorjMcKie Dec 27, 2020
Maintainer

Script bw-maker.py makes thetext visible, because it can extract unvisible text ... but may not reflect all the formatting ... Because the text was scanned, there are zillions of reasons, way the Tesseract parser may determine a different fontsize, ignore apparent boldness or italics and what not.
Script bw-maker2.py shows the original scanned image only, because the recognized text is invisible.

bserg66 Dec 28, 2020
Author

Aaa! Thanks for the answer, I now understand what is and why invisible text after OCR.
I read about invisible text, but only a practical example and your answer gave me an understanding of its essence.
Thanks again and you can end the discussion.

dgrunspan · 2021-12-06T09:58:55Z

dgrunspan
Dec 6, 2021

bw-marker2.py does not work for all PDF's

Try this big one and see
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5766240/pdf/pone.0191194.pdf

I dont need 100% guarantee , but at least a sign it did not work and then I can leave this file alone and dont compress it

Thx

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making a compressed grayscale PDF version #769

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments 13 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Making a compressed grayscale PDF version #769

bserg66 Dec 15, 2020

Replies: 12 comments · 13 replies

JorjMcKie Dec 15, 2020 Maintainer

JorjMcKie Dec 15, 2020 Maintainer

bserg66 Dec 15, 2020 Author

JorjMcKie Dec 15, 2020 Maintainer

bserg66 Dec 16, 2020 Author

JorjMcKie Dec 16, 2020 Maintainer

I suggest moving your issue(s) to the new "Discussions" category - this better addresses what we are doing here. Agree?

JorjMcKie Dec 16, 2020 Maintainer

JorjMcKie Dec 16, 2020 Maintainer

bserg66 Dec 16, 2020 Author

JorjMcKie Dec 16, 2020 Maintainer

JorjMcKie Dec 16, 2020 Maintainer

bserg66 Dec 24, 2020 Author

JorjMcKie Dec 24, 2020 Maintainer

bserg66 Dec 26, 2020 Author

JorjMcKie Dec 26, 2020 Maintainer

bserg66 Dec 27, 2020 Author

JorjMcKie Dec 27, 2020 Maintainer

However!

JorjMcKie Dec 27, 2020 Maintainer

bserg66 Dec 28, 2020 Author

dgrunspan Dec 6, 2021

bserg66
Dec 15, 2020

Replies: 12 comments 13 replies

JorjMcKie
Dec 15, 2020
Maintainer

JorjMcKie
Dec 15, 2020
Maintainer

bserg66
Dec 15, 2020
Author

JorjMcKie
Dec 15, 2020
Maintainer

bserg66
Dec 16, 2020
Author

JorjMcKie
Dec 16, 2020
Maintainer

I suggest moving your issue(s) to the new "Discussions" category - this better addresses what we are doing here.
Agree?

JorjMcKie
Dec 16, 2020
Maintainer

JorjMcKie
Dec 16, 2020
Maintainer

bserg66
Dec 16, 2020
Author

JorjMcKie Dec 16, 2020
Maintainer

JorjMcKie
Dec 16, 2020
Maintainer

bserg66 Dec 24, 2020
Author

JorjMcKie Dec 24, 2020
Maintainer

bserg66 Dec 26, 2020
Author

JorjMcKie Dec 26, 2020
Maintainer

bserg66 Dec 27, 2020
Author

JorjMcKie
Dec 27, 2020
Maintainer

JorjMcKie Dec 27, 2020
Maintainer

bserg66 Dec 28, 2020
Author

dgrunspan
Dec 6, 2021