-
Hello. P48_29-30.pdf My code is: import fitz
import os
PATH = 'C:/PyWorks/PDF_Python/PDF/'
if __name__ == "__main__":
FileRead = PATH + 'P48_29-30.pdf'
DocRd = fitz.open(FileRead)
DocWr = fitz.open()
PCounts = DocRd.pageCount
for PageNumber in range(0,PCounts):
PageRd = DocRd[PageNumber]
PageWr = DocWr.newPage(width=PageRd.rect.width, height=PageRd.rect.height)
# Insert text images
TextDict = PageRd.getText("dict")
for block in TextDict['blocks']:
if block["type"] == 1:
PageWr.insertImage(block['bbox'], stream=block['image'])
# Insert drawings
paths = PageRd .getDrawings()
shape = PageWr.newShape()
for path in paths:
for item in path["items"]:
if item[0] == "l":
shape.drawLine(item[1], item[2])
elif item[0] == "re":
shape.drawRect(item[1])
elif item[0] == "c":
shape.drawBezier(item[1], item[2], item[3], item[4])
else:
raise ValueError("unhandled drawing", item)
shape.finish(
fill=path["fill"],
color=path["color"],
dashes=path["dashes"],
even_odd=path["even_odd"],
closePath=path["closePath"],
lineJoin=path["lineJoin"],
lineCap=max(path["lineCap"]),
width=path["width"],
stroke_opacity=path["opacity"],
fill_opacity=path["opacity"]
)
shape.commit()
# Insert images
xref = DocRd.getPageImageList(PageNumber, full=True)
for i, x in enumerate(xref):
rect = PageRd.getImageBbox(x)
xref = x[0]
img = DocRd.extractImage(xref)
# image = Image.open(io.BytesIO(img['image']))
# image.show()
if i == 0:
PageWr.insertImage(rect, stream=img['image'], overlay=False)
else:
PageWr.insertImage(rect, stream=img['image'])
# Save to file
FileExtDel = os.path.splitext(FileRead)[0]
FileOut = FileExtDel + '_imgall.pdf'
DocWr.save(FileOut, deflate=True, garbage=3, clean=True)
DocRd.close()
DocWr.close()``` |
Beta Was this translation helpful? Give feedback.
Replies: 12 comments 13 replies
-
There actually is no error! Your page copy process takes these 3 types of content and regenerates each on new pages correctly. And indeed you do not get what you obviously want: the same visual appearance, but just without text. I experimented a little and commented out each of these copy blocks separately and in addition also changed their sequence.
If you change the sequence of the above, the page will look different! |
Beta Was this translation helpful? Give feedback.
-
I do not know whether you are just experimenting or have a specific need. |
Beta Was this translation helpful? Give feedback.
-
My first question is how to get the missing borders? |
Beta Was this translation helpful? Give feedback.
-
The "border" around the spoons is not caused by any of the drawings. It may not even be a border, but the result of several images overlapping each other in such a manner that an impression like a border is generated. Hard to tell after all drawing and image-show commands are pulled out of their original contxt - see my previous comments. The following script removes all text of the pages - including the link at the bottom of the page, which has technically been envelopped in a watermark artifact. import fitz
doc = fitz.open("P48_29-30.pdf")
for page in doc:
page.cleanContents() # clean and unify page command syntax
xref = page.getContents()[0] # get the commands - now in a single source
cont = bytearray(doc.xrefStream(xref)) # read it as a modifyable binary object
s = 0 # position counter
while s >= 0:
s = cont.find(b"BT") # search start / end of text object
e = cont.find(b"ET", s)
if min(s, e) >= 0: # found one!
cont[s : e + 2] = b"" # remove text object
s = 0 # reset position counter
while s >= 0:
s = cont.find(b"/Artifact") # search start / end of watermark
e = cont.find(b"EMC", s)
if min(s, e) >= 0:
cont[s : e + 3] = b"" # remove watermark object
doc.updateStream(xref, cont) # write back updated stream
page.cleanContents() # clean again to remove now obsolete objects from PDF
doc.save(doc.name.replace(".pdf", "-no-text.pdf"), deflate=False, garbage=3) |
Beta Was this translation helpful? Give feedback.
-
Super, thanks. I only change deflate=True and size from 440KB decreased at 170KB. |
Beta Was this translation helpful? Give feedback.
-
I suggest moving your issue(s) to the new "Discussions" category - this better addresses what we are doing here.
|
Beta Was this translation helpful? Give feedback.
-
New results using Pillow:
Here is the script: import fitz
from PIL import Image
import io
doc = fitz.open("P48_29-30.pdf")
newdoc = fitz.open()
font = fitz.Font("helv")
def remove_text(page):
doc = page.parent
page.cleanContents()
xref = page.getContents()[0]
cont = bytearray(doc.xrefStream(xref))
s = 0
while s >= 0:
s = cont.find(b"BT") # search start / end of text object
e = cont.find(b"ET", s)
if min(s, e) >= 0:
cont[s : e + 2] = b"" # remove text object
s = 0
while s >= 0:
s = cont.find(b"/Artifact") # search start / end of watermark
e = cont.find(b"EMC", s)
if min(s, e) >= 0:
cont[s : e + 3] = b"" # remove text object
doc.updateStream(xref, cont)
page.cleanContents()
for page in doc:
blocks = page.getText("dict", flags=0)["blocks"]
remove_text(page)
newpage = newdoc.newPage(width=page.rect.width, height=page.rect.height)
tw = fitz.TextWriter(newpage.rect)
pix = page.getPixmap(colorspace=fitz.csGRAY)
img = Image.frombytes("L", [pix.width, pix.height], pix.samples)
bio = io.BytesIO()
img.save(bio, format="JPEG")
newpage.insertImage(newpage.rect, stream=bio.getvalue())
for b in blocks:
for l in b["lines"]:
for s in l["spans"]:
tw.append(s["origin"], s["text"], font=font, fontsize=s["size"])
tw.writeText(newpage)
newdoc.save("x.pdf", garbage=4, deflate=True) And the resulting PDF: |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Many thanks! This is tutorial of my sun and he will turn pages much faster at pocketbook. |
Beta Was this translation helpful? Give feedback.
-
here is a version with all 4 Helvetica font weights: |
Beta Was this translation helpful? Give feedback.
-
Use
There may be "geometry" changes outside text blocks, that change the scaling. This then has an effect like a fontsize change. You could change the
With the same method, but using the default However!I just realized: your last examples are all scanned, OCR-ed PDFs. They simply cannot be treated with what we have been discussing here. |
Beta Was this translation helpful? Give feedback.
-
bw-marker2.py does not work for all PDF's Try this big one and see I dont need 100% guarantee , but at least a sign it did not work and then I can leave this file alone and dont compress it Thx |
Beta Was this translation helpful? Give feedback.
Use
page.getText("dict", flags=0)["blocks"]
. This is a list of text (only, because of the flags value) block dictionaries. Each such dict contains a list of line dictionaries, which in turn contains a list of text "span" dictionaries. COnsult theTextPage
section of the docu to see the details.The important point is that a span contains text with completely identical font properties: name, fontsize, color, font characteristics (bold, italic, mono, ...) are all identical.So you should receive a span containing "46)" following by a span with text "Велосипедист...".
If this is not the case (like here), then creator coded the …