Enhancement: --jobs to process files in parallel #3261
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On some large repositories, I felt like it must have been faster if codespell at least parallelized across files since it seems to be CPU bound. So I decided to test and came up with this PR. Unfortunately it does not really scale as much as I was hoping but it seems to do provide 2-3 times savings on a limited set of examples I have tried.
E.g. on a portion of gitlab I decided to check today - can take nearly 3 times less time with --jobs 5
I have not tried to figure out why scalability is not that great -- after all CPU vise codespell does consume 100% cpu on all subprocesses easily, but run time does not scale down proportionally - likely the overhead of a multiprocess call per file is too high.
I have tried to keep git history of commits somewhat clean so if so desired, earlier commits refactorings could be submitted independently first to RF main to centralize invocation of parse_file.
If decided to proceed, need to
--jobs
or could be a matrix run on CI which would default it to some number for all the runs (if we e.g. allow default to be set via env variable to be instead of0
)1
) testing is also seems to be not really compatible since operates through introspection of capsys.stderr of current process and seems to be lacking the one from subprocesses.I would be happy to see your timings on some sample projects.
Might be a good reason to include into this PR
main()
-- I had to simply disable some complexity checks from ruff. E.g.print("ERROR: error message...); print_usage(); return EX_USAGE
) could be condensed/centralized within a helper (e.g.error_usage("error message")
) if the entire function changes the way it operates a little (e.g. notreturn
but raise exception to be handled above withreturn
if so desired)build_dictionaries
)parse_file
to separate out display from logic -- nowprint
s are straight in the code instead of being e.g. collected/rendered by something (MVC pattern) -- might even help eventually someone to RF tests if so becomes desired to avoid introspection of stderr but rather operate on actual resultscodespell
just hangs on me while processing some "binaries" until I figure out which ones. Here we could just report at the end and facilitate addition of skips.that could actually be done in a separate PR if so desired, but overall it would reduce complexity estimates for
main
and thus avoidnoqa
in this PR