Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: --jobs to process files in parallel #3261

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

yarikoptic
Copy link
Contributor

On some large repositories, I felt like it must have been faster if codespell at least parallelized across files since it seems to be CPU bound. So I decided to test and came up with this PR. Unfortunately it does not really scale as much as I was hoping but it seems to do provide 2-3 times savings on a limited set of examples I have tried.

E.g. on a portion of gitlab I decided to check today - can take nearly 3 times less time with --jobs 5
for j in 0 1 2 3 4 5 10 15 20; do echo $j; (builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app >/dev/null ); done
0
codespell -J $j app > /dev/null  9.90s user 0.13s system 99% cpu 10.026 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  9.91s user 0.13s system 99% cpu 10.040 total
1
codespell -J $j app > /dev/null  10.33s user 0.14s system 101% cpu 10.352 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  10.34s user 0.14s system 101% cpu 10.361 total
2
codespell -J $j app > /dev/null  11.16s user 0.22s system 198% cpu 5.723 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  11.17s user 0.22s system 198% cpu 5.732 total
3
codespell -J $j app > /dev/null  12.75s user 0.29s system 279% cpu 4.670 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  12.76s user 0.29s system 278% cpu 4.679 total
4
codespell -J $j app > /dev/null  14.25s user 0.38s system 356% cpu 4.107 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  14.25s user 0.39s system 355% cpu 4.115 total
5
codespell -J $j app > /dev/null  15.15s user 0.45s system 441% cpu 3.532 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  15.15s user 0.46s system 440% cpu 3.541 total
10
codespell -J $j app > /dev/null  20.05s user 1.10s system 559% cpu 3.781 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  20.06s user 1.10s system 558% cpu 3.789 total
15
codespell -J $j app > /dev/null  22.55s user 1.43s system 536% cpu 4.466 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  22.56s user 1.43s system 535% cpu 4.476 total
20
codespell -J $j app > /dev/null  24.11s user 1.71s system 479% cpu 5.389 total
( builtin cd /home/yoh/proj/misc/gitlab/gitlab; time codespell -J $j app > ; )  24.12s user 1.71s system 478% cpu 5.398 total

I have not tried to figure out why scalability is not that great -- after all CPU vise codespell does consume 100% cpu on all subprocesses easily, but run time does not scale down proportionally - likely the overhead of a multiprocess call per file is too high.

I have tried to keep git history of commits somewhat clean so if so desired, earlier commits refactorings could be submitted independently first to RF main to centralize invocation of parse_file.

If decided to proceed, need to

  • decide how to test: could be just a sample invocation with some non-0 --jobs or could be a matrix run on CI which would default it to some number for all the runs (if we e.g. allow default to be set via env variable to be instead of 0)
  • ATM (if I just make default jobs to 1 ) testing is also seems to be not really compatible since operates through introspection of capsys.stderr of current process and seems to be lacking the one from subprocesses.
  • check if need to lock on printing to stderr and ensure flushing, so separate processes do not interleave printouts somehow (not sure if possible but not see why not)

I would be happy to see your timings on some sample projects.

Might be a good reason to include into this PR

  • Refactoring of main() -- I had to simply disable some complexity checks from ruff. E.g.
    • some repetitive patterns (e.g. print("ERROR: error message...); print_usage(); return EX_USAGE) could be condensed/centralized within a helper (e.g. error_usage("error message")) if the entire function changes the way it operates a little (e.g. not return but raise exception to be handled above with return if so desired)
    • may be extracting some logic into helper functions (e.g. build_dictionaries)
  • Refactor parse_file to separate out display from logic -- now prints are straight in the code instead of being e.g. collected/rendered by something (MVC pattern) -- might even help eventually someone to RF tests if so becomes desired to avoid introspection of stderr but rather operate on actual results
  • "desire" which is easier to enable in this PR: collect timings per file, and if there are some abnormal files (take e.g. a few times more than median time) -- report them since now some times codespell just hangs on me while processing some "binaries" until I figure out which ones. Here we could just report at the end and facilitate addition of skips.

that could actually be done in a separate PR if so desired, but overall it would reduce complexity estimates for main and thus avoid noqa in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants