Ferrules: Modern, fast, document parser written in 🦀

Ferrules is an opinionated high-performance document parsing library designed to generate LLM-ready documents efficiently. Unlike alternatives such as unstructured which are slow and Python-based, ferrules is written in Rust and aims to provide a seamless experience with robust deployment across various platforms.

| NOTE A ferrule is a corruption of Latin viriola on a pencil known as a Shoe, is any of a number of types of objects, generally used for fastening, joining, sealing, or reinforcement.

Features

📄 PDF Parsing and Layout Extraction:
- Utilizes pdfium2 to parse documents.
- Supports OCR using Apple's Vision on macOS (using objc2 Rust bindings and VNRecognizeTextRequest functionality).
- Extracts and analyzes page layouts with advanced preprocessing and postprocessing techniques.
- Accelerate model inference on Apple Neural Engine (ANE)/GPU (using ort library).
- Merges layout with PDF text lines for comprehensive document understanding.
🔄 Document Transformation:
- Groups captions, footers, and other elements intelligently.
- Structures lists and merges blocks into cohesive sections.
- Detects headings and titles using machine learning for logical document structuring.
🖨️ Rendering: Provides HTML, Markdown, and JSON rendering options for versatile use cases.
⚡ High Performance & Easy Deployment:
- Built with Rust for maximum speed and efficiency
- Zero-dependency deployment (no Python runtime required !)
- Hardware-accelerated ML inference (Apple Neural Engine, GPU)
- Designed for production environments with minimal setup
⚙️ Advanced Functionalities:
- Offers configurable inference parameters for optimized processing.
- Batch inference on document pages. (COMING SOON)
🛠️ API and CLI:
- Provides both a CLI and API interface
- Supports tracing

Installation

⚠️ Note: Currently, Ferrules only works on macOS. Linux support (with NVIDIA GPU acceleration) is coming soon !!!

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/aminediro/ferrules/releases/download/v0.1.0/ferrules-installer.sh | sh

Once installed, you can verify the installation by running:

ferrules --version

Usage

Ferrules can be used via command line with various options to control the parsing process.

Basic Usage

ferrules path/to/your.pdf

This will parse the PDF and save the results in the current directory:

ferrules file.pdf
[00:00:02] [########################################] Parsed document in 108ms
✓ Results saved in: ./file-results.json

Debug Mode

To get detailed processing information and debug outputs:

ferrules path/to/your.pdf --debug
[00:00:02] [########################################] Parsed document in 257ms
ℹ Debug output saved in: /var/folders/x1/1fktcq215tl73kk60bllw9rc0000gn/T/ferrules-XXXX
✓ Results saved in: ./megatrends-results.json

Debug mode generates visual output showing the parsing results for each page:

Each color represents different elements detected in the document:

🟦 Layout detection
🟩 OCR parsed lines
🟥 Pdfium parsed lines

Available Options

Options:
      --n-page <N_PAGE>
          Limit parsing to the N first pages
      --output-dir <OUTPUT_DIR>
          Specify the directory to store parsing result [env: FERRULES_OUTPUT_DIR=]
      --layout-model-path <LAYOUT_MODEL_PATH>
          Specify the path to the layout model for document parsing [env: FERRULES_LAYOUT_MODEL_PATH=]
      --coreml
          Enable or disable the use of CoreML for layout inference
      --cuda
          Enable or disable the use of CUDA for layout inference
      --debug
          Activate debug mode for detailed processing information [env: FERRULES_DEBUG=]
      --debug-dir <DEBUG_DIR>
          Specify the directory to store debug output files [env: FERRULES_DEBUG_PATH=]
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

You can also configure some options through environment variables:

FERRULES_OUTPUT_DIR: Set the output directory
FERRULES_LAYOUT_MODEL_PATH: Set the layout model path
FERRULES_DEBUG: Enable debug mode
FERRULES_DEBUG_PATH: Set the debug output directory

Resources:

Apple vision text detection:
ort : https://ort.pyke.io/

Credits

This project uses models from the yolo-doclaynet repository. We are grateful to the contributors of that project.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
benches		benches
font		font
imgs		imgs
libs		libs
models		models
python		python
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
build.rs		build.rs
dist-workspace.toml		dist-workspace.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ferrules: Modern, fast, document parser written in 🦀

Features

Installation

Usage

Basic Usage

Debug Mode

Available Options

Resources:

Credits

About

Releases 1

Packages

Languages

License

AmineDiro/ferrules

Folders and files

Latest commit

History

Repository files navigation

Ferrules: Modern, fast, document parser written in 🦀

Features

Installation

Usage

Basic Usage

Debug Mode

Available Options

Resources:

Credits

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages