° display as � + some other scrambled chars #869

gbrault · 2021-01-28T18:36:29Z

gbrault
Jan 28, 2021

Hello All,

I have a char encoding problem it seems?

import fitz  # this is pymupdf

with fitz.open('456_PDFsam_Paulus_THE book of biocides (1).pdf') as doc:
    text = ""
    for page in doc:
        text += page.getText()

print(text)

I have some char which are not displayed correctly

Microbicide group (substance class)
1. ALCOHOLS
Chemical name
1.17. 3-(4-Chlorophenoxy)-1,2-propanediol
Chemical formula
C9H11ClO3
Structural formula
Molecular mass
202.64
CAS-No.
104-29-0
EEC-No.
50
Synonym/common name
p-chlorophenyl-a-glycerolether, Chlorphenesin
Supplier
LENTIA
Chemical and physical properties
Appearance
white crystals with faint phenolic odour
Content (%)
approx.100
Boiling point/range �C (2.6 kPa)
214–215
Melting point �C
80

import fitz
doc = fitz.open('456_PDFsam_Paulus_THE book of biocides (1).pdf')
pages = [p for p in doc]
page0 = pages[0]
doc.get_page_fonts(page0.number)

Result

[(103, 'cff', 'Type1', 'DHGEIO+AdvSTP_PSTimB', 'F1', ''),
 (104, 'cff', 'Type1', 'DHGENN+AdvSTP_PSTimR', 'F2', ''),
 (105, 'cff', 'Type1', 'DHGFEM+AdvSTP_PSTimI', 'F3', ''),
 (106, 'cff', 'Type1', 'DHGHAD+AdvSTP_TIMSC', 'F5', ''),
 (107, 'cff', 'Type1', 'DHGHCD+AdvP4C4E74', 'F6', ''),
 (108, 'cff', 'Type1', 'DHGIIM+AdvPSMP10', 'F7', '')]

Is there a way to have the right encoding? Any Ideas?
456_PDFsam_Paulus_THE book of biocides (1).pdf

Answered by JorjMcKie

Mar 20, 2021

Coming back to this after quite some time.
I have been experimenting with using Tesseract OCR together with PyMuPDF and tried your document with it again. Remember that all the °C were incorrectly coded in it?
Well here is a script that extracts the text, detects whether a line contains uninterpreted characters and if so, it invokes OCR to make that line readable again.
A dependency is that Tesseract OCR is installed and can be invoked via Python's subprocess module.
Here is the material. Because of the OCR invocations (ca. 80 times across all pages), the total duration (my machine) is about 30 seconds.
Maybe it helps.
issue-869.zip

View full answer

JorjMcKie · 2021-01-28T20:43:53Z

JorjMcKie
Jan 28, 2021
Maintainer

This is one of the cases, where the PDF maker did not provide a valid UTF-8 code. The graphical appearance (glyph) may look good - it does not have to correspond to a valid unicode however.
It just happens sometimes. Nothing can be done about it. In this case, it is also not due to a lack of capabilities of MuPDF (which in rare conditions also occurs), but in general:

mutool draw -o text.txt file.pdf 1 (base library of PyMuPDF)

pdftotext file.pdf (i.e. base library of Poppler)

1 reply

gbrault Jan 29, 2021
Author

Many thanks @JorjMcKie

JorjMcKie · 2021-03-20T18:22:33Z

JorjMcKie
Mar 20, 2021
Maintainer

Coming back to this after quite some time.
I have been experimenting with using Tesseract OCR together with PyMuPDF and tried your document with it again. Remember that all the °C were incorrectly coded in it?
Well here is a script that extracts the text, detects whether a line contains uninterpreted characters and if so, it invokes OCR to make that line readable again.
A dependency is that Tesseract OCR is installed and can be invoked via Python's subprocess module.
Here is the material. Because of the OCR invocations (ca. 80 times across all pages), the total duration (my machine) is about 30 seconds.
Maybe it helps.
issue-869.zip

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

° display as � + some other scrambled chars #869

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

° display as � + some other scrambled chars #869

gbrault Jan 28, 2021

Replies: 2 comments · 1 reply

JorjMcKie Jan 28, 2021 Maintainer

gbrault Jan 29, 2021 Author

JorjMcKie Mar 20, 2021 Maintainer

gbrault
Jan 28, 2021

Replies: 2 comments 1 reply

JorjMcKie
Jan 28, 2021
Maintainer

gbrault Jan 29, 2021
Author

JorjMcKie
Mar 20, 2021
Maintainer