A simple web application to extract text from various document formats using MarkItDown.
- Extract text from multiple document formats including PDF, Office documents, images, and more
- Simple web interface using Gradio
- Optional authentication
- Optional API visibility
- Clone this repository
- Install dependencies:
pip install -r requirements.txt
You can run EZ_Txt using the pre-built Docker container:
docker pull ghcr.io/usnavy13/ez-txt:latest
docker run -p 7860:7860 ghcr.io/usnavy13/ez-txt:latest
Then access the application at http://localhost:7860
Create a .env
file in the root directory with the following optional settings:
# Optional authentication (remove or leave empty to disable)
user=your_username
password=your_password
# Optional API visibility (default: false)
show_api=false
To use Azure Document Intelligence for text extraction, ensure you have an active Azure subscription and a Document Intelligence resource. Then, add the following entries to your .env
file:
AZURE_ENDPOINT=https://<your-azure-endpoint>
AZURE_API_KEY=<your-azure-api-key>
The application will automatically enable the "Azure Document Intelligence" extraction method when both variables are present. Make sure the environment variable names match those expected by the code.
- Run the application:
python main.py
- Open your browser and navigate to
http://localhost:7860
- Upload a document and click "Extract text" to get the text content
- Documents: PDF, PPTX, DOCX, XLSX
- Images: PNG, JPG, JPEG
- Text: TXT, CSV, JSON, XML, HTML
- Archives: ZIP (will process contained files)