Skip to content

AmineDiro/ferrules

Repository files navigation

Ferrules-logo

Ferrules: Modern, fast, document parser written in 🦀


Ferrules is an opinionated high-performance document parsing library designed to generate LLM-ready documents efficiently. Unlike alternatives such as unstructured which are slow and Python-based, ferrules is written in Rust and aims to provide a seamless experience with robust deployment across various platforms.

| NOTE A ferrule is a corruption of Latin viriola on a pencil known as a Shoe, is any of a number of types of objects, generally used for fastening, joining, sealing, or reinforcement.

Features

  • 📄 PDF Parsing and Layout Extraction:

    • Utilizes pdfium2 to parse documents.
    • Supports OCR using Apple's Vision on macOS (using objc2 Rust bindings and VNRecognizeTextRequest functionality).
    • Extracts and analyzes page layouts with advanced preprocessing and postprocessing techniques.
    • Accelerate model inference on Apple Neural Engine (ANE)/GPU (using ort library).
    • Merges layout with PDF text lines for comprehensive document understanding.
  • 🔄 Document Transformation:

    • Groups captions, footers, and other elements intelligently.
    • Structures lists and merges blocks into cohesive sections.
    • Detects headings and titles using machine learning for logical document structuring.
  • 🖨️ Rendering: Provides HTML, Markdown, and JSON rendering options for versatile use cases.

  • ⚡ High Performance & Easy Deployment:

    • Built with Rust for maximum speed and efficiency
    • Zero-dependency deployment (no Python runtime required !)
    • Hardware-accelerated ML inference (Apple Neural Engine, GPU)
    • Designed for production environments with minimal setup
  • ⚙️ Advanced Functionalities:

    • Offers configurable inference parameters for optimized processing.
    • Batch inference on document pages. (COMING SOON)
  • 🛠️ API and CLI:

    • Provides both a CLI and API interface
    • Supports tracing

Installation

⚠️ Note: Currently, Ferrules only works on macOS. Linux support (with NVIDIA GPU acceleration) is coming soon !!!

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/aminediro/ferrules/releases/download/v0.1.0/ferrules-installer.sh | sh

Once installed, you can verify the installation by running:

ferrules --version

Usage

Ferrules can be used via command line with various options to control the parsing process.

Basic Usage

ferrules path/to/your.pdf

This will parse the PDF and save the results in the current directory:

ferrules file.pdf
[00:00:02] [########################################] Parsed document in 108ms
✓ Results saved in: ./file-results.json

Debug Mode

To get detailed processing information and debug outputs:

ferrules path/to/your.pdf --debug
[00:00:02] [########################################] Parsed document in 257ms
ℹ Debug output saved in: /var/folders/x1/1fktcq215tl73kk60bllw9rc0000gn/T/ferrules-XXXX
✓ Results saved in: ./megatrends-results.json

Debug mode generates visual output showing the parsing results for each page:

Debug Page 1 Wizard of Oz, Scanned

Each color represents different elements detected in the document:

  • 🟦 Layout detection
  • 🟩 OCR parsed lines
  • 🟥 Pdfium parsed lines

Available Options

Options:
      --n-page <N_PAGE>
          Limit parsing to the N first pages
      --output-dir <OUTPUT_DIR>
          Specify the directory to store parsing result [env: FERRULES_OUTPUT_DIR=]
      --layout-model-path <LAYOUT_MODEL_PATH>
          Specify the path to the layout model for document parsing [env: FERRULES_LAYOUT_MODEL_PATH=]
      --coreml
          Enable or disable the use of CoreML for layout inference
      --cuda
          Enable or disable the use of CUDA for layout inference
      --debug
          Activate debug mode for detailed processing information [env: FERRULES_DEBUG=]
      --debug-dir <DEBUG_DIR>
          Specify the directory to store debug output files [env: FERRULES_DEBUG_PATH=]
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

You can also configure some options through environment variables:

  • FERRULES_OUTPUT_DIR: Set the output directory
  • FERRULES_LAYOUT_MODEL_PATH: Set the layout model path
  • FERRULES_DEBUG: Enable debug mode
  • FERRULES_DEBUG_PATH: Set the debug output directory

Resources:

Credits

This project uses models from the yolo-doclaynet repository. We are grateful to the contributors of that project.

About

Modern, fast, document parser written in 🦀

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages