OCR-PDF Project

Overview

This project extracts text from PDF files using Tesseract Optical Character Recognition (OCR). It downloads a PDF from a given URL, converts each page into an image, and then extracts the text using Tesseract OCR then generates a summary. The project is deployed on Replicate.

Usage

You can use this model directly on Replicate:

Try it on Replicate

Local Setup (Optional)

If you want to run the project locally:

Install Dependencies:

pip install requests pdf2image pytesseract transformers cog

Run the Predictor:

from predict import Predictor
 import json

 # Instantiate the Predictor class
 predictor = Predictor()

 # Replace with the actual URL of your PDF
 json_output = predictor.predict(url="https://example.com/your-pdf.pdf")

 # Output the full JSON result
 print(json_output)

 # Optionally, parse the JSON to access specific fields
 result = json.loads(json_output)
 print("Summary:", result["summary"])

This will download the PDF, convert it to images, extract the text, generate a summary, and return the results in JSON format.

License

This project is licensed under the MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OCR-PDF Project

Overview

Usage

Local Setup (Optional)

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

OCR-PDF Project

Overview

Usage

Local Setup (Optional)

License