-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A user-friendly example for a scanned multipage PDF needed #67
Comments
Thank you for the report. I agree with all three points. I had long planned to make more user friendly tooling around this technology but hadn't gotten to this point yet. This is integrated with the Archive.org stack where I also wrote the entire OCR module (which is FOSS) - which I'd have to somehow port and then tie that in to the PDF compression too. The PDF can be compressed without hOCR - but not yet with the current tooling. Regarding your suggestions:
There are a few more things to say on this:
|
I processed my scanned document with But an attempt at retrieving hOCR file fails:
|
Okay, the tooling is really under documented. :-( You first need to make a JSON file from the PDF file so that pdf-to-hocr understands what it is dealing with. The So perhaps you could just try to call
|
Thank you for this interesting project, which seems to exactly fit my needs, but so far I could not make it work. It the README.md, there is an example command like, but its use is far from straightforward.
Running just
recode_pdf --from-pdf scan.pdf --out-pdf TEST.pdf
without any hOCR file throws a confusingAttributeError: 'NoneType' object has no attribute 'seek'
. Actually I tried to reinstall with three different versions and came here to report a bug.Then I found another line in the README.md that "It is not possible to recode/compress a PDF without hOCR files". This is a crucial piece of information, but it is somewhat hidden. It is also not easy to find how to generate such a necessary file.
A google search suggested that I can use
tesseract scan.tif scan hocr
to generate hOCR file from a TIF. This would help for a single TIF file, but apparentlytesseract
does not accept PDF format.I suggest that
The text was updated successfully, but these errors were encountered: