Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecursiveRules integration into OverlapRefinery #152

Open
wants to merge 4 commits into
base: development
Choose a base branch
from

Conversation

Sankgreall
Copy link

As discussed in #150.

Key changes

  • Added a new "recursive" method alongside the existing "static" method for calculating overlap

  • The class now provides three methods for calculating overlap:

    • Exact (using tokenizer)
    • Approximate (using text length ratios)
    • Recursive (using hierarchical rules for natural boundaries)
  • min_tokens and rules have been added as new parameters. But only used for recursive refining.

  • _find_boundary_with_rules(), _find_primary_boundary_context(), and _find_forward_boundary_context() are the key functions for applying RecursiveRules.

  • Split the refinement logic into separate methods for static and hierarchical approaches:

    • _refine_prefix_static() and _refine_suffix_static() for the original method
    • _refine_prefix_hierarchical() and _refine_suffix_hierarchical() for the new recursive method
  • A new _count_tokens() method added

Tests

  • test_recursive_refinery_initialization verifies proper initialization with recursive mode parameters and validates error cases for invalid configurations.

  • test_recursive_refinery_with_rules checks that the refinery properly applies custom recursive rules to hierarchical chunks and maintains appropriate context sizes.

  • test_recursive_refinery_boundary_detection ensures the refinery correctly identifies paragraph breaks and other structural boundaries in hierarchical text.

  • test_recursive_refinery_whitespace_fallback validates that the refinery falls back to whitespace-based splitting when no explicit delimiters are found.

  • test_recursive_refinery_with_small_chunk confirms correct handling of chunks smaller than the minimum token size and verifies exact token counting. NOTE: default behaviour is to return the entire chunk in cases where the chunk is smaller than the specified min_tokens.

  • test_recursive_refinery_suffix_mode tests the refinery's suffix mode operation when adding context from subsequent chunks.

  • test_recursive_refinery_merge_context validates that context is properly merged into chunk text when the merge_context option is enabled.

  • test_recursive_refinery_empty_input verifies proper handling of empty input lists.

  • test_recursive_refinery_single_chunk ensures correct processing of single-chunk inputs without context.

OverlapRefinery state diagram

The diagram below demonstrates the configuration options now available with OverlapRefinery.

stateDiagram-v2
    [*] --> Initialization
    
    state Initialization {
        [*] --> ValidateParams
        ValidateParams --> SetTokenizer: Has tokenizer
        ValidateParams --> ApproximateMode: No tokenizer
        
        SetTokenizer --> ExactCounting: approximate=False
        SetTokenizer --> ApproximateMode: approximate=True
        
        state ApproximateMode {
            UseCharRatio: Use char/token ratio
        }
        
        state ExactCounting {
            UseTokenizer: Use exact tokenizer
        }
    }
    
    state Processing {
        [*] --> ValidateChunks
        ValidateChunks --> MethodSelection
        
        state MethodSelection {
            [*] --> Static: method="static"
            [*] --> Recursive: method="recursive"
            
            state Static {
                [*] --> StaticPrefix: mode="prefix"
                [*] --> StaticSuffix: mode="suffix"
            }
            
            state Recursive {
                [*] --> RecursivePrefix: mode="prefix"
                [*] --> RecursiveSuffix: mode="suffix"
                
                state RecursivePrefix {
                    HierarchicalBackward: Find boundaries backwards
                }
                
                state RecursiveSuffix {
                    HierarchicalForward: Find boundaries forwards
                }
            }
        }
    }
    
    state ChunkHandling {
        [*] --> ChunkTypeCheck
        
        ChunkTypeCheck --> SemanticProcessing: SemanticChunk
        ChunkTypeCheck --> SentenceProcessing: SentenceChunk
        ChunkTypeCheck --> TokenProcessing: Basic Chunk
        
        state TokenProcessing {
            [*] --> ExactTokens: Has tokenizer
            [*] --> ApproximateTokens: No tokenizer
        }
    }
    
    Initialization --> Processing: Start refine()
    Processing --> ChunkHandling: For each chunk
    ChunkHandling --> [*]: Context added
    
    note right of ChunkHandling
        If merge_context=True:
        Update chunk text & indices
    end note
Loading

bhavnicksm and others added 4 commits January 8, 2025 23:03
[FEAT] Support `return_type` with `texts` output type
- Added private `_HierarchicalRefinery` class to `OverlapRefinery`
- Removed private class and merged with OverlapRefinery

- Defined tests for new recursive overlap functionality
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants