OCR-PDF Project

Overview

This project extracts text from PDF files using Tesseract Optical Character Recognition (OCR). It downloads a PDF from a given URL, converts each page into an image, and then extracts the text using Tesseract OCR then generates a summary. The project is deployed on Replicate.

Usage

You can use this model directly on Replicate:

Try it on Replicate

Local Setup (Optional)

If you want to run the project locally:

Install Dependencies:

pip install requests pdf2image pytesseract transformers cog

Run the Predictor:

from predict import Predictor
 import json

 # Instantiate the Predictor class
 predictor = Predictor()

 # Replace with the actual URL of your PDF
 json_output = predictor.predict(url="https://example.com/your-pdf.pdf")

 # Output the full JSON result
 print(json_output)

 # Optionally, parse the JSON to access specific fields
 result = json.loads(json_output)
 print("Summary:", result["summary"])

This will download the PDF, convert it to images, extract the text, generate a summary, and return the results in JSON format.

License

This project is licensed under the MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR-PDF Project

Overview

Usage

Local Setup (Optional)

License

About

Releases

Packages

Languages

vwtyler/ocr-pdf

Folders and files

Latest commit

History

Repository files navigation

OCR-PDF Project

Overview

Usage

Local Setup (Optional)

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages