Optimize _byte_pair_merge to o(m log n) using heap-based candidate selection #442
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
-> Replaced the
O(m·n)sequential merge scan with a heap-driven algorithm that maintains candidate merges in a max-heap keyed by rank, updating only local neighbors on each merge.-> This yields
m·log nbehavior where:m: number of merges andn: is the number of initial symbolsKey changes:
_byte_pair_merge, maintaining a linked-list of live nodes and per-position versions to avoid stale heap entries.compute_rank_atand updates only affected neighbors after each merge._byte_pair_mergeboundaries.Complexity:
Before: repeated linear scans → approximately
O(m·n)in worst-case merges.After: heap operations per merge →
O(m·log n), withO(n)initialization.