RecursiveRules integration into OverlapRefinery #152
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As discussed in #150.
Key changes
Added a new "recursive" method alongside the existing "static" method for calculating overlap
The class now provides three methods for calculating overlap:
min_tokens
andrules
have been added as new parameters. But only used forrecursive
refining._find_boundary_with_rules()
,_find_primary_boundary_context()
, and_find_forward_boundary_context()
are the key functions for applying RecursiveRules.Split the refinement logic into separate methods for static and hierarchical approaches:
A new
_count_tokens()
method addedTests
test_recursive_refinery_initialization
verifies proper initialization with recursive mode parameters and validates error cases for invalid configurations.test_recursive_refinery_with_rules
checks that the refinery properly applies custom recursive rules to hierarchical chunks and maintains appropriate context sizes.test_recursive_refinery_boundary_detection
ensures the refinery correctly identifies paragraph breaks and other structural boundaries in hierarchical text.test_recursive_refinery_whitespace_fallback
validates that the refinery falls back to whitespace-based splitting when no explicit delimiters are found.test_recursive_refinery_with_small_chunk
confirms correct handling of chunks smaller than the minimum token size and verifies exact token counting. NOTE: default behaviour is to return the entire chunk in cases where the chunk is smaller than the specifiedmin_tokens
.test_recursive_refinery_suffix_mode
tests the refinery's suffix mode operation when adding context from subsequent chunks.test_recursive_refinery_merge_context
validates that context is properly merged into chunk text when themerge_context
option is enabled.test_recursive_refinery_empty_input
verifies proper handling of empty input lists.test_recursive_refinery_single_chunk
ensures correct processing of single-chunk inputs without context.OverlapRefinery state diagram
The diagram below demonstrates the configuration options now available with OverlapRefinery.