Getting Unicode Block after the pdf conversion #1466

YashMistry349 · 2021-12-16T11:51:53Z

YashMistry349
Dec 16, 2021

Hey,

I want to extract the text from the pdf. To extract the text from the pdf I am using the below-mentioned code.

Code Snippet:

import fitz
pdf_file = 'old_pdf.pdf'

old_pdf = fitz.open(pdf_file, filetype='pdf')
pdf_bytes = old_pdf.convert_to_pdf(from_page=0, to_page=old_pdf.page_count)
new_pdf = fitz.open(stream=pdf_bytes, filetype='pdf')
blocks = new_pdf[0].get_text('dict')['blocks']
print(blocks)

What is actually happening?
If I am trying to extract the text from pdf after conversion, the text is not extracted properly. (I gam getting Unicode Block instead of actual text).

NOTE: I have to convert the pdf into a new pdf because of this issue.

For reference, I shared the document in a personal email. [email protected]

Can you please do needful?

Thank you.

System Specification:

Ubuntu 20.04.3 LTS
Python 3.8.10
PyMuPDF 1.19.0

JorjMcKie · 2021-12-16T12:33:50Z

JorjMcKie
Dec 16, 2021
Maintainer

This a MuPDF issue:
The file contains fonts that are not supported by conversion PDF-to-PDF. Reproduce this by:

mutool draw -o test.pdf old_pdf.pdf
page old_pdf.pdf 1warning: cannot create ToUnicode mapping for DOTHRM+ArialMT
warning: cannot create ToUnicode mapping for ODHCTE+Ubuntu-Regular

page old_pdf.pdf 2

0 replies

YashMistry349 · 2021-12-16T12:38:37Z

YashMistry349
Dec 16, 2021
Author

How can we use the alternate font if some font is not supported?

0 replies

JorjMcKie · 2021-12-16T12:58:43Z

JorjMcKie
Dec 16, 2021
Maintainer

How can we use the alternate font if some font is not supported?

Hm, you have to:

create a new PDF replacing the old fonts
do the planned conversion

There are two script here: repl-fontnames.py and repl-font.py.

First run python repl-fontnames.py your.pdf.
This will produce a JSON file old_pdf.pdf-fontnames.json like this:

[
  {
    "oldfont": [
      "ArialMT"
    ],
    "newfont": "keep",
    "info": "92 glyphs, size 31580, serifed, subset font"
  },
  {
    "oldfont": [
      "Ubuntu-Regular"
    ],
    "newfont": "keep",
    "info": "261 glyphs, size 22044, serifed, subset font"
  }
]

Edit this file and replace the words "keep" by fontnames you would like instead - in this case best use "helv":

[
  {
    "oldfont": [
      "ArialMT"
    ],
    "newfont": "helv",
    "info": "92 glyphs, size 31580, serifed, subset font"
  },
  {
    "oldfont": [
      "Ubuntu-Regular"
    ],
    "newfont": "helv",
    "info": "261 glyphs, size 22044, serifed, subset font"
  }
]

Then run python repl-font.py old_pdf.pdf. Looks like this

py repl-font.py old-pdf.pdf
Processing PDF 'old-pdf.pdf' with 2 pages.

Phase 1: Analyze use of fonts.
Font replacement overview:
        ArialMT replaced by: Helvetica.
 Ubuntu-Regular replaced by: Helvetica.

Phase 2: Rebuild document with new fonts.
PHase 3: Build font subsets.
No fonts to subset.

Timings
          Analyzing: 0.018 seconds
         Rebuilding: 1.009 seconds
    Font subsetting: 0.001 seconds
             Saving: 0.023 seconds
         Total time: 1.051 seconds

The resulting PDF old_pdf-new.pdf can then be converted successfully.

0 replies

JorjMcKie · 2021-12-16T13:04:09Z

JorjMcKie
Dec 16, 2021
Maintainer

In the next version, I will issue a warning, if convert_to_pdf runs into problems. Those warnings are already generated by MuPDF today, but they land in fitz.TOOLS.mupdf_warnings() and may go unnoticed if nobody cares to actively look there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Unicode Block after the pdf conversion #1466

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Getting Unicode Block after the pdf conversion #1466

YashMistry349 Dec 16, 2021

Replies: 4 comments

JorjMcKie Dec 16, 2021 Maintainer

YashMistry349 Dec 16, 2021 Author

JorjMcKie Dec 16, 2021 Maintainer

JorjMcKie Dec 16, 2021 Maintainer

YashMistry349
Dec 16, 2021

JorjMcKie
Dec 16, 2021
Maintainer

YashMistry349
Dec 16, 2021
Author

JorjMcKie
Dec 16, 2021
Maintainer

JorjMcKie
Dec 16, 2021
Maintainer