MemoryWeave Benchmarking Guide

This guide explains how to run benchmarks to evaluate and compare MemoryWeave's retrieval capabilities, with a special focus on comparing the contextual fabric approach with traditional retrieval methods.

Unified Benchmark System

MemoryWeave now uses a unified benchmark system that can be configured through YAML files. This makes it easier to run consistent, reproducible benchmarks.

# Run a benchmark with its configuration file
uv run python run_benchmark.py --config configs/contextual_fabric_benchmark.yaml

# Override number of memories
uv run python run_benchmark.py --config configs/memory_retrieval_benchmark.yaml --memories 200

# Specify custom output file
uv run python run_benchmark.py --config configs/baseline_comparison.yaml --output my_results.json

# Enable debug output
uv run python run_benchmark.py --config configs/contextual_fabric_benchmark.yaml --debug

Available Benchmarks

MemoryWeave includes several benchmarking tools:

Contextual Fabric Benchmark: Compares contextual fabric against hybrid BM25+vector retrieval
Memory Retrieval Benchmark: Evaluates different retrieval configurations
Baseline Comparison: Compares against industry standard BM25 and vector search
Synthetic Benchmark: Tests performance on generated datasets with controlled properties

Running the Contextual Fabric Benchmark

The contextual fabric benchmark is the most comprehensive evaluation, specifically designed to showcase the advantages of the contextual fabric architecture over traditional retrieval methods.

Quick Run

# Run using the unified benchmark system
uv run python run_benchmark.py --config configs/contextual_fabric_benchmark.yaml

Legacy Script (For Backward Compatibility)

To run the contextual fabric benchmark with all memory sizes (20, 100, and 500) and generate visualizations using the legacy script:

# Run the benchmark script (handles everything)
./run_contextual_fabric_benchmark.sh

This script will:

Run benchmarks with 20, 100, and 500 memories
Save results to benchmark_results/contextual_fabric_*.json
Generate visualizations in evaluation_charts/contextual_fabric_*/

Understanding the Results

The contextual fabric benchmark evaluates several retrieval capabilities:

Conversation Context: Can the system use conversation history to improve retrieval?
Temporal Context: Can the system find memories based on temporal references?
Associative Links: Can the system find related memories through associative links?
Activation Patterns: Does the system boost recently used or important memories?
Episodic Memory: Can the system retrieve memories from the same episode?

For each test case, the benchmark compares:

Contextual Fabric Strategy: Our advanced approach using the contextual fabric architecture
Hybrid BM25+Vector Strategy: A baseline representing current industry-standard approaches

The main metrics reported are:

Precision: Percentage of retrieved results that are relevant
Recall: Percentage of relevant results that were retrieved
F1 Score: Harmonic mean of precision and recall

Running Other Benchmarks

Memory Retrieval Benchmark

This benchmark evaluates different retrieval configurations:

# Run the memory retrieval benchmark with the unified system
uv run python run_benchmark.py --config configs/memory_retrieval_benchmark.yaml

# Legacy method (deprecated)
uv run python -m benchmarks.memory_retrieval_benchmark

Baseline Comparison

Compare against industry-standard BM25 and vector search:

# Run with the unified benchmark system
uv run python run_benchmark.py --config configs/baseline_comparison.yaml

# Legacy method (deprecated)
uv run python run_baseline_comparison.py --dataset sample_baseline_dataset.json --config baselines_config.yaml

Synthetic Benchmark

Test with generated datasets with controlled properties:

# Run with the unified benchmark system
uv run python run_benchmark.py --config configs/synthetic_benchmark.yaml

# Legacy method (deprecated)
uv run python run_synthetic_benchmark.py

Visualizing Benchmark Results

For all benchmarks, you can create visualizations:

# Visualizations are automatically generated when running benchmarks
# To only visualize existing results:
python examples/visualize_results.py results.json

# Create comparison charts
python examples/visualize_results.py --compare result1.json result2.json --output comparison.png

Customizing Benchmarks

You can customize benchmark parameters by creating or modifying configuration files:

Example: Customizing the Contextual Fabric Benchmark

Create a new configuration file configs/contextual_fabric_custom.yaml:

name: "Contextual Fabric Custom"
type: "contextual_fabric"
description: "Custom contextual fabric evaluation with tuned parameters"
memories: 200
embedding_dim: 384
output_file: "benchmark_results/contextual_fabric_custom.json"
visualize: true
parameters:
  contextual_fabric_strategy:
    confidence_threshold: 0.1
    similarity_weight: 0.7
    associative_weight: 0.1
    temporal_weight: 0.1
    activation_weight: 0.1
    max_associative_hops: 2
  baseline_strategy:
    confidence_threshold: 0.1
    vector_weight: 0.4
    bm25_weight: 0.6
    bm25_b: 0.5
    bm25_k1: 1.5

Then run it:

uv run python run_benchmark.py --config configs/contextual_fabric_custom.yaml

Example: Customizing Memory Retrieval Configurations

Create a new configuration file configs/memory_retrieval_custom.yaml:

name: "Memory Retrieval Custom Configurations"
type: "memory_retrieval"
description: "Tests optimized configurations"
memories: 300
queries: 30
output_file: "benchmark_results/memory_retrieval_custom.json"
visualize: true
configurations:
  - name: "Precision-Focused"
    retriever_type: "components"
    confidence_threshold: 0.5
    semantic_coherence_check: true
    adaptive_retrieval: true
    use_two_stage_retrieval: true
    query_type_adaptation: true
  
  - name: "Recall-Focused"
    retriever_type: "components"
    confidence_threshold: 0.1
    semantic_coherence_check: false
    adaptive_retrieval: true
    use_two_stage_retrieval: true
    query_type_adaptation: true

Using the DynamicContextAdapter in Benchmarks

If you want to evaluate the newly implemented DynamicContextAdapter:

from memoryweave.components.dynamic_context_adapter import DynamicContextAdapter

# Create and configure the adapter
dynamic_adapter = DynamicContextAdapter()
dynamic_adapter.initialize({
    "adaptation_strength": 1.0,
    "enable_memory_size_adaptation": True
})

# In your benchmark loop, add this step before retrieval:
adaptation_context = dynamic_adapter.process_query(
    query, 
    {
        "memory_store": memory_store,
        "primary_query_type": query_type
    }
)

# Then use the adapted parameters in retrieval
retrieval_context = {
    "query": query,
    "query_embedding": query_embedding,
    **adaptation_context
}
results = retrieval_strategy.retrieve(query_embedding, top_k=5, context=retrieval_context)

Troubleshooting

BM25 Warnings

When running benchmarks with synthetic data, you may see warnings like:

WARNING:root:BM25 retrieval failed: minimal could not be calculated, returning default

These are expected with synthetic test data and don't indicate a problem. The hybrid strategy will fall back to vector retrieval when BM25 indexing fails.

Memory Usage

Benchmarks with large memory sizes (500+) may require significant memory. If you encounter memory issues:

Reduce the number of memories:

uv run python run_benchmark.py --config configs/contextual_fabric_benchmark.yaml --memories 200

Run with a smaller embedding dimension (if configurable in your benchmark):

# In your config file
embedding_dim: 128  # Reduced from default 384

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark_guide.md

benchmark_guide.md

MemoryWeave Benchmarking Guide

Unified Benchmark System

Available Benchmarks

Running the Contextual Fabric Benchmark

Quick Run

Legacy Script (For Backward Compatibility)

Understanding the Results

Running Other Benchmarks

Memory Retrieval Benchmark

Baseline Comparison

Synthetic Benchmark

Visualizing Benchmark Results

Customizing Benchmarks

Example: Customizing the Contextual Fabric Benchmark

Example: Customizing Memory Retrieval Configurations

Using the DynamicContextAdapter in Benchmarks

Troubleshooting

BM25 Warnings

Memory Usage

Files

benchmark_guide.md

Latest commit

History

benchmark_guide.md

File metadata and controls

MemoryWeave Benchmarking Guide

Unified Benchmark System

Available Benchmarks

Running the Contextual Fabric Benchmark

Quick Run

Legacy Script (For Backward Compatibility)

Understanding the Results

Running Other Benchmarks

Memory Retrieval Benchmark

Baseline Comparison

Synthetic Benchmark

Visualizing Benchmark Results

Customizing Benchmarks

Example: Customizing the Contextual Fabric Benchmark

Example: Customizing Memory Retrieval Configurations

Using the DynamicContextAdapter in Benchmarks

Troubleshooting

BM25 Warnings

Memory Usage