Ultra-fast RAG system with 6x compression and competitive performance
Hilbert Quantization is a high-performance Retrieval-Augmented Generation (RAG) system that combines Hilbert curve mapping with MPEG-AI compression to deliver both speed and storage efficiency. It's designed for building efficient RAG applications where both search performance and storage costs matter.
- Document Processing Pipeline: Comprehensive document chunking, metadata management, and IPFS integration
- Advanced Embedding Generation: Hierarchical index embedding with compression and reconstruction capabilities
- Dual Video Storage: Synchronized embedding and document storage with frame-based retrieval
- Progressive Search Engine: Multi-stage search with frame caching and similarity calculation
- Batch Document Processing: High-performance parallel processing with progress tracking
- Document Validation: Comprehensive validation with metadata verification and content analysis
- End-to-End Pipeline: Complete workflow from document ingestion to search results
- π Complete RAG Pipeline: Document processing, embedding generation, and similarity search
- β‘ Ultra-fast retrieval: Sub-millisecond to few-millisecond search times
- πΎ 6x compression: Massive storage savings for document embeddings
- π¬ Video-enhanced storage: Advanced temporal coherence for better compression
- π Document validation: Comprehensive quality checks and metadata management
- π§ Easy to use: Simple RAG API with sensible defaults
- π Pure Python: No external dependencies beyond NumPy
Method | Search Time | Storage Size | Compression | Use Case |
---|---|---|---|---|
Hilbert Quantization | 4.6ms | 0.02GB | 6x | Best overall |
Pinecone (Managed) | 2.1ms | 0.19GB | 1x | Speed-first |
FAISS (GPT-4 style) | 4.8ms | 0.16GB | 1x | Accuracy-first |
Brute Force | 5.9ms | 0.14GB | 1x | Simple baseline |
Benchmark on 25K embeddings (1536D, GPT-4 style)
pip install hilbert-quantization
# For benchmarking and visualization
pip install hilbert-quantization[benchmark]
# For GPU acceleration (experimental)
pip install hilbert-quantization[gpu]
# For development
pip install hilbert-quantization[dev]
# Complete installation with all features
pip install hilbert-quantization[dev,benchmark,gpu]
from hilbert_quantization.rag import RAGSystem, RAGConfig
# Initialize RAG system with default settings
config = RAGConfig(
chunk_size=512,
overlap_size=50,
embedding_dimension=1024
)
rag_system = RAGSystem(config)
# Add documents to your RAG system
documents = [
"Machine learning is a subset of artificial intelligence that focuses on algorithms.",
"Natural language processing enables computers to understand human language.",
"Computer vision allows machines to interpret and understand visual information."
]
# Process and store documents
for i, doc in enumerate(documents):
rag_system.add_document(f"doc_{i}", doc)
# Search for relevant information
query = "What is machine learning?"
results = rag_system.search(query, max_results=3)
# Display results
for result in results:
print(f"Document: {result.document_id}")
print(f"Similarity: {result.similarity_score:.3f}")
print(f"Content: {result.content}")
print("---")
Build a complete RAG system with document processing and similarity search:
from hilbert_quantization.rag import RAGSystem, RAGConfig
from hilbert_quantization.rag.document_processing import DocumentChunker
from hilbert_quantization.rag.embedding_generation import EmbeddingGenerator
# Initialize RAG system
config = RAGConfig(
chunk_size=512,
overlap_size=50,
embedding_dimension=1024,
max_frames_per_video=1000
)
rag_system = RAGSystem(config)
# Process documents
documents = [
"This is the first document about machine learning.",
"This document discusses natural language processing.",
"Here we talk about computer vision and image recognition."
]
# Add documents to the system
for i, doc in enumerate(documents):
document_id = f"doc_{i}"
rag_system.add_document(document_id, doc)
# Search for similar content
query = "machine learning algorithms"
results = rag_system.search(query, max_results=5)
# Print results
for result in results:
print(f"Document: {result.document_id}")
print(f"Similarity: {result.similarity_score:.3f}")
print(f"Content: {result.content[:100]}...")
Use advanced document processing and embedding generation:
from hilbert_quantization.rag.document_processing import BatchDocumentProcessor
from hilbert_quantization.rag.embedding_generation import HierarchicalIndexGenerator
from hilbert_quantization.rag.search import ProgressiveSearchEngine
# Initialize components
batch_processor = BatchDocumentProcessor(
chunk_size=512,
overlap_size=50,
parallel_workers=4
)
embedding_generator = EmbeddingGenerator(
dimension=1024,
use_compression=True
)
search_engine = ProgressiveSearchEngine(
use_frame_caching=True,
cache_size=1000
)
# Process large document collection
documents = load_document_collection("path/to/documents/")
processed_docs = batch_processor.process_documents(documents)
# Generate embeddings with hierarchical indices
for doc in processed_docs:
embedding = embedding_generator.generate_embedding(doc.content)
doc.embedding = embedding
# Add to search engine
for doc in processed_docs:
search_engine.add_document(doc)
# Perform similarity search
query = "What is machine learning?"
results = search_engine.search(query, max_results=10)
print(f"Found {len(results)} relevant documents")
for result in results:
print(f"Document: {result.document_id}")
print(f"Similarity: {result.similarity_score:.3f}")
For large datasets or memory-constrained environments:
from hilbert_quantization import QuantizationConfig, HilbertQuantizer
import numpy as np
# Configure streaming optimization
config = QuantizationConfig(
use_streaming_optimization=True, # Enable streaming
enable_integrated_mapping=True, # Single-pass processing
memory_efficient_mode=True # Optimize for memory
)
# Create quantizer with streaming enabled
quantizer = HilbertQuantizer(config=config)
# Process large dataset with constant memory usage
large_params = np.random.randn(1_000_000).astype(np.float32) # 1M parameters
quantized = quantizer.quantize(large_params, model_id="large_model")
print(f"Processed {large_params.size:,} parameters with constant memory usage")
print(f"Compression ratio: {quantized.metadata.compression_ratio:.2f}x")
Ensure document quality and track performance:
from hilbert_quantization.rag.validation import DocumentValidator, RAGValidator
from hilbert_quantization.rag.document_processing import MetadataManager
# Initialize validation components
doc_validator = DocumentValidator()
rag_validator = RAGValidator()
metadata_manager = MetadataManager()
# Validate documents before processing
for doc in documents:
validation_result = doc_validator.validate_document(doc)
if validation_result.is_valid:
# Add metadata
metadata = metadata_manager.extract_metadata(doc)
doc.metadata = metadata
# Process document
processed_doc = rag_system.add_document(doc.id, doc.content)
print(f"Added document {doc.id} with {len(processed_doc.chunks)} chunks")
else:
print(f"Document {doc.id} failed validation: {validation_result.errors}")
# Validate RAG system performance
performance_metrics = rag_validator.validate_system_performance(rag_system)
print(f"Search accuracy: {performance_metrics.search_accuracy:.3f}")
print(f"Retrieval speed: {performance_metrics.avg_retrieval_time:.2f}ms")
print(f"Compression ratio: {performance_metrics.compression_ratio:.2f}x")
- **ORB Keypoint Detection**: Structural feature matching between model representations
- **Template Matching**: Direct pattern correlation for similar model architectures
- **Histogram Comparison**: Statistical distribution analysis of parameter values
- **SSIM Analysis**: Structural similarity assessment for fine-grained comparison
- **Temporal Coherence**: Neighboring frame analysis for context-aware similarity scoring
### Cache-Optimized Search (Recommended for Production)
```python
from hilbert_quantization import HilbertQuantizer
from hilbert_quantization.optimized import CacheOptimizedDatabase, CacheOptimizedSearch
# Setup
quantizer = HilbertQuantizer()
search_engine = CacheOptimizedSearch()
# Quantize your embeddings
quantized_models = [quantizer.quantize(emb, f"id_{i}") for i, emb in enumerate(embeddings)]
# Build cache-optimized database (one-time setup)
database = CacheOptimizedDatabase(quantized_models)
# Pre-quantize your query (for multiple searches)
query_quantized = quantizer.quantize(query_embedding, "query")
# Ultra-fast search
results = search_engine.cache_optimized_search(
query_quantized.hierarchical_indices,
database,
max_results=10
)
β Perfect For:
- AI Model Archives: Store thousands of model checkpoints with 8.2x compression
- Model Version Control: Track model evolution with temporal coherence analysis
- Research Datasets: Organize large collections of neural network models with video-based similarity search
- Model Marketplaces: Enable efficient browsing and discovery of similar models
- Distributed AI Systems: Minimize bandwidth usage with compressed video model transmission
β Ideal For:
- Model Similarity Research: Find architecturally similar models across different domains
- Transfer Learning: Identify pre-trained models with similar parameter distributions
- Model Compression Studies: Analyze compression effectiveness across model architectures
- AI Model Cataloging: Build searchable databases of transformer models with metadata
- Cross-Architecture Analysis: Compare models regardless of specific implementation details
β Essential For:
- Memory-Constrained Environments: Process models larger than available RAM (93% memory reduction)
- Edge Computing: Deploy model processing on resource-limited devices
- Cloud Cost Optimization: Reduce memory requirements and associated costs
- Large Model Analysis: Process multi-billion parameter models without infrastructure scaling
- Real-Time Model Processing: Stream and encode models as they're being trained
β Excellent For:
- Large-scale RAG systems (>100K documents with 6x compression)
- Similarity Search Databases (sub-millisecond to few-millisecond search times)
- Cost-optimized Cloud Storage (massive storage savings with competitive performance)
- Bandwidth-limited Systems (efficient data transmission with maintained accuracy)
Real-time Inference Applications:
- Need <1ms latency consistently
- Require immediate response without any processing overhead
- Critical path applications where every microsecond matters
Very Small Datasets:
- <10K embeddings where setup overhead exceeds benefits
- Simple applications with minimal storage or performance requirements
- Prototype systems where development speed is prioritized over optimization
Maximum Speed Priority:
- Applications where search speed is the only consideration
- Systems with unlimited memory and storage resources
- Use cases where compression and storage efficiency are not important
Metric | Traditional | Video Features | Hybrid | Temporal Coherence |
---|---|---|---|---|
Search Accuracy | Baseline | +25% | +35% | +45% |
Search Speed | Baseline | -40% | +15% | +20% |
Compression Ratio | 2.1:1 | 2.8:1 | 4.2:1 | 5.1:1 |
File Size Reduction | Baseline | 25% | 50% | 58% |
Storage Method | Compression Ratio | Search Speed | Memory Usage | Temporal Coherence |
---|---|---|---|---|
Video Storage | 8.2x | 3.1ms | Constant | 0.847 |
Individual Images | 6.1x | 4.6ms | Linear | N/A |
Raw Quantized | 1.0x | 2.8ms | High | N/A |
Model Size | Batch Method | Streaming Method | Memory Reduction | Speed Comparison |
---|---|---|---|---|
BERT-base (110M) | 2.1GB RAM | 0.5GB RAM | 76% reduction | +15% time |
GPT-2 (1.5B) | 6.8GB RAM | 0.5GB RAM | 93% reduction | +22% time |
T5-large (3B) | Memory Error | 0.5GB RAM | Enables processing | N/A |
Search Method | Speed | Accuracy | Use Case |
---|---|---|---|
Hierarchical | Fastest | Good | Initial filtering, large datasets |
Video Features | Medium | Highest | Detailed analysis, small datasets |
Hybrid | Balanced | Excellent | Production recommended |
π Comprehensive Analysis: See Performance Benchmarks for detailed analysis, scaling characteristics, compression benefits, and optimization guidelines.
# Basic video encoding
python examples/huggingface_video_encoder.py
# Streaming large models
python examples/streaming_huggingface_encoder.py --model microsoft/DialoGPT-large --stream
# Hybrid search demonstration
python examples/hybrid_search_demo.py
# Video frame ordering optimization
python examples/video_frame_ordering_demo.py
# Performance comparison across methods
python examples/search_performance_comparison.py
Temporal Compression Optimization:
- 4-8% compression improvement through hierarchical index-based frame ordering
- Automatic frame insertion at optimal positions to maintain temporal coherence
- Real-time optimization of existing video files without quality loss
- Multiple ordering strategies with performance benchmarking
Video-Enhanced Search:
- Computer vision algorithms: ORB features, template matching, histogram comparison
- Hybrid similarity scoring: Weighted combination of video features (60%) and hierarchical indices (40%)
- Temporal coherence analysis: Neighboring frame relationships for context-aware search
- Parallel processing: Multi-threaded search across video files for performance
Model Parameter Extraction:
- Direct integration with HuggingFace Transformers library
- Stratified sampling for large models to maintain parameter representativeness
- Layer filtering by type (attention, MLP, embeddings) for targeted analysis
- Architecture detection and metadata storage for cross-model similarity search
Model Registry and Tracking:
- Comprehensive model registry with encoding statistics and performance metrics
- Cross-architecture similarity search to find related models regardless of structure
- Encoding performance tracking with compression ratios and processing times
- Model metadata persistence including architecture details and parameter counts
Memory-Efficient Processing:
- Constant O(1) memory usage regardless of model size
- Layer-by-layer parameter extraction without loading full models into memory
- Chunk-based encoding with configurable chunk sizes for optimal performance
- Progress tracking with real-time parameter counts and processing rates
Advanced Streaming Features:
- Resume capability for interrupted encoding processes
- Target layer filtering to process specific model components
- Real-time encoding with immediate video frame generation
- Streaming validation to ensure accuracy matches batch processing results
Hierarchical Index-Based Frame Ordering:
# Automatic frame ordering for optimal compression
video_storage = VideoModelStorage(storage_dir="models", max_frames_per_video=1000)
# Models are automatically ordered by hierarchical index similarity
for model_name in ["bert-base", "distilbert", "roberta", "gpt2"]:
quantized = quantizer.encode_huggingface_model(model_name)
frame_metadata = video_storage.add_model(quantized) # Inserted at optimal position
# Analyze compression benefits
metrics = video_storage.get_frame_ordering_metrics("model_video.mp4")
print(f"Temporal coherence: {metrics['temporal_coherence']:.3f}")
print(f"Compression efficiency: {metrics['ordering_efficiency']:.3f}")
Key Benefits:
- 4-8% compression improvement over random frame ordering
- Automatic optimal insertion of new frames based on hierarchical similarity
- Real-time optimization of existing video files without quality loss
- Temporal coherence analysis for neighboring frame relationships
Multi-Modal Similarity Detection:
# Hybrid search combining video features and hierarchical indices
search_engine = VideoEnhancedSearchEngine(video_storage)
# Compare different search methods
comparison = search_engine.compare_search_methods(
query_model,
methods=['hierarchical', 'video_features', 'hybrid']
)
# Analyze individual similarity components
for result in hybrid_results:
print(f"Video features: {result.video_similarity_score:.3f}")
print(f"Hierarchical: {result.hierarchical_similarity_score:.3f}")
print(f"Combined: {result.similarity_score:.3f}") # Weighted combination
Computer Vision Algorithms:
- ORB Keypoint Detection: Structural feature matching for architectural similarity
- Template Matching: Direct pattern correlation for parameter distribution analysis
- Histogram Comparison: Statistical similarity of parameter value distributions
- SSIM Analysis: Structural similarity index for fine-grained comparison
- Temporal Coherence: Context-aware scoring using neighboring frame relationships
Constant Memory Processing:
# Process models larger than available RAM
encoder = StreamingHuggingFaceEncoder(chunk_size=2048)
# Stream model parameters without loading full model
for chunk, layer_info, progress in encoder.stream_model_parameters("gpt2-xl"):
print(f"Processing {layer_info}: {progress.progress_percent:.1f}% complete")
# Memory usage remains constant regardless of model size
Streaming Advantages:
- 93% memory reduction for large models (T5-3B: 6.8GB β 0.5GB)
- Layer-by-layer processing without full model loading
- Real-time progress tracking with parameter counts and processing rates
- Resume capability for interrupted encoding processes
Method Selection Matrix:
Use Case | Recommended Method | Memory | Speed | Accuracy | Best For |
---|---|---|---|---|---|
Large Model Collections | Video Storage | Constant | Fast | Excellent | Model archives, version control |
Memory-Constrained | Streaming Processing | O(1) | Medium | Excellent | Edge computing, cloud cost optimization |
Production Search | Hybrid Search | Medium | Balanced | Highest | Similarity search, model discovery |
Fast Filtering | Hierarchical Search | Low | Fastest | Good | Initial candidate selection |
Small Models | Batch Processing | High | Fastest | Excellent | Development, prototyping |
Performance Scaling:
Model Size | Traditional Memory | Streaming Memory | Speed Impact | Recommendation |
---|---|---|---|---|
<100M params | 0.4GB | 0.5GB | +5% | Traditional |
100M-1B params | 2-8GB | 0.5GB | +15% | Streaming |
1B-10B params | 8-40GB | 0.5GB | +25% | Streaming |
>10B params | Memory Error | 0.5GB | N/A | Streaming Only |
Run the included benchmarks to evaluate performance on your hardware:
# Core quantization benchmarks
hilbert-benchmark --quick # Basic performance test
hilbert-benchmark --industry-comparison # Compare with Pinecone, FAISS
hilbert-benchmark --large-scale --size 1GB # Scalability testing
# Video storage benchmarks
python examples/video_frame_ordering_demo.py # Frame ordering optimization
python examples/temporal_compression_optimization_demo.py # Compression analysis
# HuggingFace integration benchmarks
python examples/huggingface_video_encoder.py --benchmark # Model encoding performance
python examples/model_similarity_search_demo.py # Cross-model similarity
python examples/search_performance_comparison.py # Search method comparison
# Streaming processing benchmarks
python examples/streaming_huggingface_encoder.py --model bert-base-uncased --benchmark
python examples/streaming_vs_batch_comparison.py # Memory usage analysis
python examples/streaming_memory_benchmark.py # Large model processing
# Hybrid search benchmarks
python examples/hybrid_search_demo.py # Multi-method comparison
python examples/parallel_video_search_demo.py # Parallel processing performance
Core Performance:
- Quantization speed and compression ratios
- Search accuracy vs industry standards (Pinecone, FAISS)
- Memory usage and scalability limits
Video Storage:
- Temporal compression benefits (4-8% improvement)
- Frame ordering optimization impact
- Video codec performance comparison
HuggingFace Integration:
- Parameter extraction speed across model architectures
- Cross-model similarity accuracy
- Model registry and metadata performance
Streaming Processing:
- Memory efficiency for large models (93% reduction)
- Processing speed vs batch methods
- Chunk size optimization analysis
Search Methods:
- Hierarchical vs video features vs hybrid accuracy
- Parallel processing scalability
- Temporal coherence impact on results
from hilbert_quantization import HilbertQuantizer, CompressionConfig
# Custom configuration
config = CompressionConfig(
quality=0.8, # Higher quality = better accuracy, larger size
preserve_index_row=True, # Preserve important structural information
)
quantizer = HilbertQuantizer(config=config)
# Performance tuning
quantizer.update_configuration(
similarity_threshold=0.1, # Lower = more results
max_results=20, # Maximum results to return
)
Hilbert Quantization combines multiple advanced techniques for optimal performance:
- Hilbert Curve Mapping: Maps high-dimensional parameters to 2D space while preserving spatial locality
- Hierarchical Indexing: Multi-level indices embedded directly in image representations for progressive filtering
- Video Compression: MPEG-AI compression with temporal coherence optimization for 4-8% additional compression
- Computer Vision Search: ORB features, template matching, and SSIM analysis for detailed similarity detection
- Streaming Processing: Layer-by-layer parameter extraction with constant memory usage
HuggingFace Model β Streaming Parameter Extraction β Hilbert Curve Mapping
β
Hierarchical Index Generation
β
Video Frame Creation with Temporal Ordering
β
MPEG Compression (8.2x smaller)
β
Video Storage System
β
Hybrid Search Engine (Video Features + Hierarchical Indices)
β
Weighted Similarity Scoring with Temporal Coherence
β
Ranked Results (3.1ms average)
Frame Ordering Optimization:
- Models stored as video frames ordered by hierarchical index similarity
- Temporal coherence analysis identifies optimal insertion points for new frames
- 4-8% compression improvement through intelligent frame sequencing
- Real-time optimization of existing video files without quality degradation
Multi-Modal Search:
- Video Features (60% weight): Computer vision algorithms for structural similarity
- Hierarchical Indices (40% weight): Fast spatial filtering for candidate selection
- Temporal Coherence: Neighboring frame analysis for context-aware scoring
- Parallel Processing: Multi-threaded search across video files for performance
Memory-Efficient Architecture:
- Layer-by-layer parameter extraction without loading full models
- Constant O(1) memory usage regardless of model size (93% memory reduction)
- Chunk-based encoding with configurable sizes for optimal performance
- Real-time progress tracking and resume capability for interrupted processes
- API Reference - Complete API documentation with examples
- Quick Start Guide - Get started in minutes
- Complete Usage Guide - Comprehensive feature overview
- Video Features Guide - Video storage and search capabilities
- Temporal Compression Guide - Frame ordering optimization
- Video Search Examples - Multi-modal similarity search
- HuggingFace Guide - Model integration and parameter extraction
- Model Registry Examples - Model tracking and similarity search
- Cross-Architecture Search - Find similar models across architectures
- Streaming Guide - Memory-efficient processing
- Streaming Examples - Real-world streaming scenarios
- Memory Optimization - Large model processing strategies
- Performance Monitoring - System performance analysis
- Parallel Processing - Multi-threaded search optimization
- Configuration Management - Advanced configuration options
We welcome contributions! Please see our Contributing Guide for details.
git clone https://github.com/tylerlhess/hilbert-quantization.git
cd hilbert-quantization
pip install -e ".[dev]"
pre-commit install
pytest # Run all tests
pytest -m "not slow" # Skip slow tests
pytest --cov # Run with coverage
This project is licensed under the MIT License - see the LICENSE file for details.
- Hilbert Curves & Space-Filling Curves: Foundational research in spatial locality preservation
- MPEG Video Compression: Advanced compression techniques adapted for parameter storage
- Computer Vision Algorithms: ORB, SSIM, and template matching for similarity detection
- HuggingFace Transformers: Model architecture and parameter extraction methodologies
- Streaming Processing: Memory-efficient algorithms for large-scale model processing
- Vector Database Community: Performance optimization and indexing techniques
- Temporal Coherence Research: Video frame ordering and compression optimization methods
- π Bug Reports
- π‘ Feature Requests
- π§ Email Support
Made with β€οΈ for the AI/ML community