Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking against nlp_uk #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

proger
Copy link
Contributor

@proger proger commented Oct 15, 2022

I benchmarked three branches (main, experiments and #4) with this patch on a file with 100k lines.

I got the best results with code from #4 using AccurateSrxTextIterator clocked at 141.82s. tokenize_text from nlp_uk clocks at 54.58s. If I disable word segmentation (remove -u -w) nlp_uk clocks at 35.324s.

This code generates a bit less sentences on that dataset: 84452 from nlp_uk vs 84018 in choppa. See sentence-wise diff.

Here's the full log: https://gist.github.com/proger/2fa3ead52dc78b7d582cd356d9f423e9

- --line-by-line is now opt-in
- options to choose iterators and their hyperparams for benchmarking
@proger proger changed the title Benchmarking with more command line arguments Benchmarking vs nlp_uk Oct 15, 2022
@proger proger changed the title Benchmarking vs nlp_uk Benchmarking against nlp_uk Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant