Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I benchmarked three branches (
main
,experiments
and #4) with this patch on a file with 100k lines.I got the best results with code from #4 using
AccurateSrxTextIterator
clocked at 141.82s.tokenize_text
from nlp_uk clocks at 54.58s. If I disable word segmentation (remove-u -w
) nlp_uk clocks at 35.324s.This code generates a bit less sentences on that dataset: 84452 from nlp_uk vs 84018 in choppa. See sentence-wise diff.
Here's the full log: https://gist.github.com/proger/2fa3ead52dc78b7d582cd356d9f423e9