Enhancement: --jobs to process files in parallel #3261

yarikoptic · 2023-12-23T00:55:09Z

On some large repositories, I felt like it must have been faster if codespell at least parallelized across files since it seems to be CPU bound. So I decided to test and came up with this PR. Unfortunately it does not really scale as much as I was hoping but it seems to do provide 2-3 times savings on a limited set of examples I have tried.

E.g. on a portion of gitlab I decided to check today - can take nearly 3 times less time with --jobs 5

❯ for j in 0 1 2 3 4 5 10 15 20; do echo $j; (builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app >/dev/null ); done
0
codespell -J $j app > /dev/null  9.90s user 0.13s system 99% cpu 10.026 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  9.91s user 0.13s system 99% cpu 10.040 total
1
codespell -J $j app > /dev/null  10.33s user 0.14s system 101% cpu 10.352 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  10.34s user 0.14s system 101% cpu 10.361 total
2
codespell -J $j app > /dev/null  11.16s user 0.22s system 198% cpu 5.723 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  11.17s user 0.22s system 198% cpu 5.732 total
3
codespell -J $j app > /dev/null  12.75s user 0.29s system 279% cpu 4.670 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  12.76s user 0.29s system 278% cpu 4.679 total
4
codespell -J $j app > /dev/null  14.25s user 0.38s system 356% cpu 4.107 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  14.25s user 0.39s system 355% cpu 4.115 total
5
codespell -J $j app > /dev/null  15.15s user 0.45s system 441% cpu 3.532 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  15.15s user 0.46s system 440% cpu 3.541 total
10
codespell -J $j app > /dev/null  20.05s user 1.10s system 559% cpu 3.781 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  20.06s user 1.10s system 558% cpu 3.789 total
15
codespell -J $j app > /dev/null  22.55s user 1.43s system 536% cpu 4.466 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  22.56s user 1.43s system 535% cpu 4.476 total
20
codespell -J $j app > /dev/null  24.11s user 1.71s system 479% cpu 5.389 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  24.12s user 1.71s system 478% cpu 5.398 total

I have not tried to figure out why scalability is not that great -- after all CPU vise codespell does consume 100% cpu on all subprocesses easily, but run time does not scale down proportionally - likely the overhead of a multiprocess call per file is too high.

I have tried to keep git history of commits somewhat clean so if so desired, earlier commits refactorings could be submitted independently first to RF main to centralize invocation of parse_file.

If decided to proceed, need to

decide how to test: could be just a sample invocation with some non-0 --jobs or could be a matrix run on CI which would default it to some number for all the runs (if we e.g. allow default to be set via env variable to be instead of 0)
ATM (if I just make default jobs to 1 ) testing is also seems to be not really compatible since operates through introspection of capsys.stderr of current process and seems to be lacking the one from subprocesses.
check if need to lock on printing to stderr and ensure flushing, so separate processes do not interleave printouts somehow (not sure if possible but not see why not)

I would be happy to see your timings on some sample projects.

Might be a good reason to include into this PR

Refactoring of main() -- I had to simply disable some complexity checks from ruff. E.g.
- some repetitive patterns (e.g. print("ERROR: error message...); print_usage(); return EX_USAGE) could be condensed/centralized within a helper (e.g. error_usage("error message")) if the entire function changes the way it operates a little (e.g. not return but raise exception to be handled above with return if so desired)
- may be extracting some logic into helper functions (e.g. build_dictionaries)
Refactor parse_file to separate out display from logic -- now prints are straight in the code instead of being e.g. collected/rendered by something (MVC pattern) -- might even help eventually someone to RF tests if so becomes desired to avoid introspection of stderr but rather operate on actual results
"desire" which is easier to enable in this PR: collect timings per file, and if there are some abnormal files (take e.g. a few times more than median time) -- report them since now some times codespell just hangs on me while processing some "binaries" until I figure out which ones. Here we could just report at the end and facilitate addition of skips.

that could actually be done in a separate PR if so desired, but overall it would reduce complexity estimates for main and thus avoid noqa in this PR

Subsequent changes would make it only more complex etc -- it needs proper de-spagettification

…ion of parse_file

RF right away to use sum(map()) to reduce number of statements

…ded function

yarikoptic added 6 commits December 22, 2023 17:35

Disable complexity checks/warnings for the main()

7bab5c7

Subsequent changes would make it only more complex etc -- it needs proper de-spagettification

RF: move finding files into an embeded function to centralize invocat…

8cce2e7

…ion of parse_file

RF: provide closure for parse_file to pass only filename

016b730

RF right away to use sum(map()) to reduce number of statements

TEMP: Tried with multiprocessing -- not good since can't pickle embed…

a6081f2

…ded function

RF: Define file level FileParser class to pass options to parse_file

045231e

Provide CLI option -J|--jobs to control number of jobs

abebc4f

DimitriPapadopoulos added the enhancement label Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: --jobs to process files in parallel #3261

Enhancement: --jobs to process files in parallel #3261

yarikoptic commented Dec 23, 2023

Enhancement: --jobs to process files in parallel #3261

Are you sure you want to change the base?

Enhancement: --jobs to process files in parallel #3261

Conversation

yarikoptic commented Dec 23, 2023