Ferrules is an opinionated high-performance document parsing library designed to generate LLM-ready documents efficiently.
Unlike alternatives such as unstructured
which are slow and Python-based, ferrules
is written in Rust and aims to provide a seamless experience with robust deployment across various platforms.
| NOTE A ferrule is a corruption of Latin viriola on a pencil known as a Shoe, is any of a number of types of objects, generally used for fastening, joining, sealing, or reinforcement.
-
📄 PDF Parsing and Layout Extraction:
- Utilizes
pdfium2
to parse documents. - Supports OCR using Apple's Vision on macOS (using
objc2
Rust bindings andVNRecognizeTextRequest
functionality). - Extracts and analyzes page layouts with advanced preprocessing and postprocessing techniques.
- Accelerate model inference on Apple Neural Engine (ANE)/GPU (using
ort
library). - Merges layout with PDF text lines for comprehensive document understanding.
- Utilizes
-
🔄 Document Transformation:
- Groups captions, footers, and other elements intelligently.
- Structures lists and merges blocks into cohesive sections.
- Detects headings and titles using machine learning for logical document structuring.
-
🖨️ Rendering: Provides HTML, Markdown, and JSON rendering options for versatile use cases.
-
⚡ High Performance & Easy Deployment:
- Built with Rust for maximum speed and efficiency
- Zero-dependency deployment (no Python runtime required !)
- Hardware-accelerated ML inference (Apple Neural Engine, GPU)
- Designed for production environments with minimal setup
-
⚙️ Advanced Functionalities:
- Offers configurable inference parameters for optimized processing.
- Batch inference on document pages. (COMING SOON)
-
🛠️ API and CLI:
- Provides both a CLI and API interface
- Supports tracing
⚠️ Note: Currently, Ferrules only works on macOS. Linux support (with NVIDIA GPU acceleration) is coming soon !!!
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/aminediro/ferrules/releases/download/v0.1.0/ferrules-installer.sh | sh
Once installed, you can verify the installation by running:
ferrules --version
Ferrules can be used via command line with various options to control the parsing process.
ferrules path/to/your.pdf
This will parse the PDF and save the results in the current directory:
ferrules file.pdf
[00:00:02] [########################################] Parsed document in 108ms
✓ Results saved in: ./file-results.json
To get detailed processing information and debug outputs:
ferrules path/to/your.pdf --debug
[00:00:02] [########################################] Parsed document in 257ms
ℹ Debug output saved in: /var/folders/x1/1fktcq215tl73kk60bllw9rc0000gn/T/ferrules-XXXX
✓ Results saved in: ./megatrends-results.json
Debug mode generates visual output showing the parsing results for each page:
Each color represents different elements detected in the document:
- 🟦 Layout detection
- 🟩 OCR parsed lines
- 🟥 Pdfium parsed lines
Options:
--n-page <N_PAGE>
Limit parsing to the N first pages
--output-dir <OUTPUT_DIR>
Specify the directory to store parsing result [env: FERRULES_OUTPUT_DIR=]
--layout-model-path <LAYOUT_MODEL_PATH>
Specify the path to the layout model for document parsing [env: FERRULES_LAYOUT_MODEL_PATH=]
--coreml
Enable or disable the use of CoreML for layout inference
--cuda
Enable or disable the use of CUDA for layout inference
--debug
Activate debug mode for detailed processing information [env: FERRULES_DEBUG=]
--debug-dir <DEBUG_DIR>
Specify the directory to store debug output files [env: FERRULES_DEBUG_PATH=]
-h, --help
Print help (see more with '--help')
-V, --version
Print version
You can also configure some options through environment variables:
FERRULES_OUTPUT_DIR
: Set the output directoryFERRULES_LAYOUT_MODEL_PATH
: Set the layout model pathFERRULES_DEBUG
: Enable debug modeFERRULES_DEBUG_PATH
: Set the debug output directory
-
Apple vision text detection:
-
ort
: https://ort.pyke.io/
This project uses models from the yolo-doclaynet repository. We are grateful to the contributors of that project.