First version to support MuPDF v1.19.* #1325
Replies: 15 comments 27 replies
-
Hi! Thanks for the release! Can you please add a bit more detail on how the Tesseract works with PyMuPDF? |
Beta Was this translation helpful? Give feedback.
-
@victor-ab - just made a little test with overlapping images: |
Beta Was this translation helpful? Give feedback.
-
But you may undergo the effort and extract the page's images yourself one by one, OCR each of them, extract the text from the produced intermediate 1-page PDFs, adjust text coordinates based on the image's rectangle on the original page and continue with that. This may be even a better way in some cases, because the image can be OCR-ed based on its original size - thus giving a better recognition rate. |
Beta Was this translation helpful? Give feedback.
-
Got this error: >>> import os
>>> os.environ['TESSDATA_PREFIX'] = "C:\\Program Files\\Tesseract-OCR"
>>> print(os.path.exists(os.environ['TESSDATA_PREFIX']+'/tessdata'))
True
>>> page = doc[0]
>>> tp = page.get_textpage_ocr(flags=3, language="eng") ---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8528/1466852534.py in <module>
1 page = doc[0]
----> 2 tp = page.get_textpage_ocr(flags=3, language="eng")
~\mambaforge\envs\layoutlmv2\lib\site-packages\fitz\fitz.py in get_textpage_ocr(self, clip, flags, language)
5639 if not clip:
5640 clip = self.rect
-> 5641 textpage = self._get_text_page_ocr(clip, flags=flags, language=language)
5642 finally:
5643 if old_rotation != 0:
~\mambaforge\envs\layoutlmv2\lib\site-packages\fitz\fitz.py in _get_text_page_ocr(self, clip, flags, language)
5617 language: OptStr = None,
5618 ) -> "TextPage":
-> 5619 val = _fitz.Page__get_text_page_ocr(self, clip, flags, language)
5620 val.thisown = True
5621
RuntimeError: Tesseract initialisation failed There's also a typo at |
Beta Was this translation helpful? Give feedback.
-
Just tested it. It generated font details for the OCR'd data, like Besides that, what's the DPI? Seems too low, 300 would be better. It'd be nice to have the ability to set it. |
Beta Was this translation helpful? Give feedback.
-
It is MuPDF's argument - not mine. Here a quote from said C file:
|
Beta Was this translation helpful? Give feedback.
-
@victor-ab - I spent a few hours to deepdive into MuPDF code for finding a way to improve resolution for page OCRs. Nothing seems to work. Their There will be a tp = page.get_textpage_ocr(clip=clip, language="eng", flags=flags, dpi=72)
# with this textpage all text-oriented methods work with a speed like usual:
rects = page.search_for("needle", textpage=tp)
xxx = page.get_text(option, textpage=tp)
# etc. My new approach will OCR the full document page. So the font will be "GlyphlessFont" for everything. |
Beta Was this translation helpful? Give feedback.
-
Here is a script performs "mixed" text extractions:
It has the following advantages:
Here is the script: I will also publish this as a Jupyter notebook soon. |
Beta Was this translation helpful? Give feedback.
-
There is progress in supporting OCR directly for document pages. This will be available in v1.19.1. textpage = page.get_textpage_ocr(
clip=None,
flags=...,
dpi=72,
language="eng",
full=False,
) All subsequent text searches and text extractions must use this textpage to access the OCR results.
|
Beta Was this translation helpful? Give feedback.
-
@victor-ab - the new v1.19.1 is out. It contains the announced improvements for document page OCR - with one minor change: |
Beta Was this translation helpful? Give feedback.
-
Yes, that's why I decided to ignore the dpi argument in that case. |
Beta Was this translation helpful? Give feedback.
-
Ah, interesting case! |
Beta Was this translation helpful? Give feedback.
-
@victor-ab : Solution was easy - thanks for the file. I am modifying the OCR textpage method Full page OCRStays as it is. All text will be in Partial page OCRFunctionality todate stays the same. Following changes will be implemented:
The question I have for you:
|
Beta Was this translation helpful? Give feedback.
-
Hi! Thanks for your work! |
Beta Was this translation helpful? Give feedback.
-
@RomaKoks - thanks for the feedback 👍. Please look at the documentation for method If you have a scanned PDF and don't want to use the specialized OCRmyPDF, try this: for page in doc:
tp = page.get_textpage_ocr(...)
words = page.get_text("words", textpage=tp)
for word in words:
bbox = fitz.Rect(word[:4])
page.insert_text(bbox.bl, word[4], fontname="cour", render_mode=3)
doc.ez_save("scanned-ocr.pdf") Tesseract OCRs with its own font "GlyphLessFont" which is mono-spaced. So best output it as Courier - or some other mono font of your liking. |
Beta Was this translation helpful? Give feedback.
-
Introduces major new features like PDF journalling and OCR support by directly invoking Tesseract-OCR.
In addition, it is possible to detect whether object are covered (hidden) by other objects.
As part of the new version, the following issues have resolved:
#1313, #1311, #1290, #1286, #1287, #1284.
This discussion was created from the release First version to support MuPDF v1.19.*.
Beta Was this translation helpful? Give feedback.
All reactions