° display as � + some other scrambled chars #869
-
Hello All, I have a char encoding problem it seems? import fitz # this is pymupdf
with fitz.open('456_PDFsam_Paulus_THE book of biocides (1).pdf') as doc:
text = ""
for page in doc:
text += page.getText()
print(text) I have some char which are not displayed correctly
import fitz
doc = fitz.open('456_PDFsam_Paulus_THE book of biocides (1).pdf')
pages = [p for p in doc]
page0 = pages[0]
doc.get_page_fonts(page0.number) Result
Is there a way to have the right encoding? Any Ideas? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Beta Was this translation helpful? Give feedback.
-
Coming back to this after quite some time. |
Beta Was this translation helpful? Give feedback.
Coming back to this after quite some time.
I have been experimenting with using Tesseract OCR together with PyMuPDF and tried your document with it again. Remember that all the °C were incorrectly coded in it?
Well here is a script that extracts the text, detects whether a line contains uninterpreted characters and if so, it invokes OCR to make that line readable again.
A dependency is that Tesseract OCR is installed and can be invoked via Python's
subprocess
module.Here is the material. Because of the OCR invocations (ca. 80 times across all pages), the total duration (my machine) is about 30 seconds.
Maybe it helps.
issue-869.zip