Benchmarking against nlp_uk #5

proger · 2022-10-15T15:02:41Z

I benchmarked three branches (main, experiments and #4) with this patch on a file with 100k lines.

I got the best results with code from #4 using AccurateSrxTextIterator clocked at 141.82s. tokenize_text from nlp_uk clocks at 54.58s. If I disable word segmentation (remove -u -w) nlp_uk clocks at 35.324s.

This code generates a bit less sentences on that dataset: 84452 from nlp_uk vs 84018 in choppa. See sentence-wise diff.

Here's the full log: https://gist.github.com/proger/2fa3ead52dc78b7d582cd356d9f423e9

- --line-by-line is now opt-in - options to choose iterators and their hyperparams for benchmarking

main: run segmeter on all text at once, add arguments:

794a444

- --line-by-line is now opt-in - options to choose iterators and their hyperparams for benchmarking

proger changed the title ~~Benchmarking with more command line arguments~~ Benchmarking vs nlp_uk Oct 15, 2022

proger changed the title ~~Benchmarking vs nlp_uk~~ Benchmarking against nlp_uk Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking against nlp_uk #5

Benchmarking against nlp_uk #5

proger commented Oct 15, 2022 •

edited

Loading

Benchmarking against nlp_uk #5

Are you sure you want to change the base?

Benchmarking against nlp_uk #5

Conversation

proger commented Oct 15, 2022 • edited Loading

proger commented Oct 15, 2022 •

edited

Loading