Skip to content

This project is using Tesseract OCR to convert images to text - then removing PII informatio

License

Notifications You must be signed in to change notification settings

CatchTheTornado/llm-pdf-ocr-anonimizer

Repository files navigation

Safely send PDF documents to LLM

This tool uses in-browser Tesseract OCR to extract text from PDF files and images.

Then, it anonymizes it by removing or PII (Personally Identitable Information) so you can safely send it to ChatGPT. What is cool you might use it for example to scan PDF documents before using them with non-multimodal LLMS (Ollama ...).

In this example we do use ChatGPT to enhance and fix Tesseract issues as well. This is a PoC project intended to be used for privacy-critical LLM cases, like health data etc.

Getting Started

First, run the development server:

npm run dev
# or
yarn dev
# or
pnpm dev
# or
bun dev

Open http://localhost:3000 with your browser to see the result.

About

This project is using Tesseract OCR to convert images to text - then removing PII informatio

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published