Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python API support #3419

Open
nthykier opened this issue May 5, 2024 · 7 comments · May be fixed by #3425
Open

Python API support #3419

nthykier opened this issue May 5, 2024 · 7 comments · May be fixed by #3425

Comments

@nthykier
Copy link
Contributor

nthykier commented May 5, 2024

Hi,

Would you be open to supporting a public stable python API for codespell. Ideally for me, one where I as a consumer can feed codespell with words I want spellchecked and then is given back a valid or invalid, here are the corrections available. If you auto-correct, use this choice (or "Not safe for auto-correcting" if that is the data available).

My use case is that I am working on a Language Server (LSP, https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification) and I want to provide spell checking. So far, I have relied on hunspell because it had Python bindings but its accuracy on technical documents with the dictionary I found leaves a bit to be wanted.

In my use case, being able to identify the exact range of the problem is an absolute necessity as my code need to provide the text range for the editor, such that it can show the user exactly where in the open document the problem is. If word-by-word checking is not supported, then API can be pass lines of text to be spellchecked provided the result identifies exactly in the range where the problem is (that is, I need start + end index).

In this case, I would probably also need the API docs to state a bit about why the line by line text is important, because I might need to extract the underlying text from formatting to create the a synthetic line to be spellchecked. As an example, my current code tries to identify and hide common code/file references like usr/share/foo. In a word-by-word check, I just skip the the spellcheck call for that word. But if I need to pass a line of text to codespell, I would need to removed the ignored and here it is relevant to know how to do that replacement such that the user does not get a false positive because I attempted to avoid another false-positive.

Alternatively, a parser for the dictionaries plus the underlying dictionaries might also be an option a "light weight API" assuming they are easier to keep stable.

I have noted that codespell can do spell checking from stdin to stdout. However, that is a bit too heavy handed for me to easily replace my hunspell integration.

That is my wishlist. :) Would this be something that you would be open to supporting?

Note: By stable API, I assumed nothing was stable since __all__ = ["_script_main", "main", "__version__"] does not look it contains reusable APIs.

@DimitriPapadopoulos
Copy link
Collaborator

DimitriPapadopoulos commented May 5, 2024

Exposing a function to check words or lines, one that is actually used by codespell itself, might affect performance. On the other hand:

Do you agree refactoring is a necessary step to expose functions that could be added to a public API later on ? If so, could you have a look at #3276, see if it's a step in your direction, and create a new pull request from it? A test to measure performance would be a great addition. I am not the maintainer, but I think these steps might convince @larsoner.

I think you would need to refactor near the code that starts line processing:

for i, line in enumerate(lines):

and near the code that splits lines into words:
check_matches = extract_words_iter(line, word_regex, ignore_word_regex)

Alternatively, the lightweight API to access dictionaries sounds like a good alternative indeed.

@nthykier
Copy link
Contributor Author

nthykier commented May 5, 2024

Thanks for the feedback and the suggestions.

I could be convinced to do a PR for this change, but I would prefer to have alignment up front what the expectations would be to determine if that is within my capacity to deal with it and matches the amount of time I am willing to invest. I am hoping we can meet in the middle obviously, but I would rather know upfront so there is no disappointment on either side down the line. Like, if there is going to be a performance test, what is expectations on "before vs. after" performance, what kind of corpus would you consider a "valid performance test" (are we talking micro-benchmark or real-life use-cases or both; which versions of python count, etc.)

For clarity, I read phrases like "would be nice" or "a great addition" as "optional and would not prevent merge if omitted".

I guess that means I am waiting for @larsoner to respond on the expectations. :)

@larsoner
Copy link
Member

I think as long as the changes are reviewable and there isn't much performance hit we're okay. Something like a 10% performance hit would be kind of a bummer. But I would also be surprised if there were one based on briefly thinking about the code.

One approach would be to make a quick attempt at this with the necessary refactoring and some new Python API, and some basic tests in a new codespell_lib/tests/test_api.py or similar. Hopefully that's not too much work based on what @DimitriPapadopoulos outlined above? And then we can check to make sure performance isn't too bad, then iterate on the API as proposed in the PR.

@DimitriPapadopoulos
Copy link
Collaborator

@larsoner I wonder whether this will eventually require some file/directory renaming. All files currently live under codespell_lib. However, Python modules usually put Python code under codespell or src/codespell. I wonder why this is not the case here, and whether we should change that.

@larsoner
Copy link
Member

Can't remember why it was codespell_lib. It could be a reasonable time to mkdir src && git mv codespell src/ and adjust pyproject.toml. I think to change this we'd probably want the release to be 3.0 but that would be okay.

It can/should be done in a separate PR though

@nthykier
Copy link
Contributor Author

nthykier commented May 17, 2024

Ok, thanks for the input so far. Up to 10% performance is quite a leeway indeed and like larsoner, I doubt we will hit it. Nevertheless, I will establish a baseline that I will use.

I did not see any references to how we decide the baseline. For now, I will be using https://sherlock-holm.es/stories/plain-text/cano.txt as a base text. Since what we are concerned with is the cost of refactoring into having a spellcheck_word function that is called per word (consider the name a description more than a target at this point), I did not think it relevant to look at different file types. Let me know if you have reason to believe otherwise.

I am describing my method below. Ideally, none of this would be new to you if you have prior experience with performance measurements of this kind.

The setup:

mkdir performance-corpus/
cd performance-corpus
wget https://sherlock-holm.es/stories/plain-text/cano.txt
# Create some extra copies (a single file takes less than 0.5 second; adding a bit more makes it easier to test)
for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done

# Setup `_version.py`, so we can run directly from git checkout.
echo '__version__ = "do not crash, please"' > codespell_lib/_version.py

The resulting corpus is 79329 bytes. The wc command line tool estimates that each file has 76764 lines (including many empty ones) and 657438 word (in wc definition of a word). The codespell code flags about 175 words as possible typos in this text (that is, per file).

The python version I used for these numbers:

$ python3 --version
Python 3.11.9

Running the test:

# Test, repeated at least 3 times, fastest result is chosen;
# Usually slower times are caused by unrelated jitter - at least on laptops/desktops that
# also does other work
time PYTHONPATH=. python3 -m codespell_lib performance-corpus/ >/dev/null

# One run with `-m profile` to see where the bottlenecks are.
# Takes considerably longer (x10-x20), so it is not done on the full corpus
PYTHONPATH=. python3 -m profile -o baseline.pstat  -m codespell_lib performance-corpus/cano.txt

Baseline results

On my hardware, the performance test on the baseline is recorded as 5.6 seconds (0m5.600s).

The bottleneck profiling has the following key numbers as far as I can tell:

  • Total time 3.030s as baseline (2196160 function calls (2195259 primitive calls) in 3.030 seconds)
  • The code spend 0.503s (~1/6) in build_dict (split into 2 calls)
  • The code spend 2.472s (~5/6) in parse_file in one call [only one file was examined].
  • About of half of parse_file's runtime is spend in parse_file (1.178/.2.472). The other half is then assumed to come from callers.
  • The time spent in "".lower() is 0.433s (~1/6) and another ~1/6 is spend in regex operations (group() + search() + finditer()).

These numbers were extracted by an interactive session with the following commands (as examples):

import pstats
s = pstats.Stats("baseline.pstat")
s.sort_stats(2).print_stats(20)
s.print_callees('parse_file')

Types of corrections

Per file, we get 175 suggestions. These are split into the following groups:

  • Cased (capitalized or all upper): 41 (~23%)
  • Non-cased: 134 (~77%)
  • Corrections with multiple candidates: 77 (44%)
    • Cased with multiple candidate corrections: 37 (21% of the 175 corrections)
    • Non-cased with multiple candidate corrections: 40 (~23% of the 175 corrections)
  • Corrections with reason: 0

Some ratios:

  • Cased / non-cased: ~30% (40/134).
    • Strong bias towards non-cased (seems reasonable since most words are lower-cased)
  • Multiple candidates cased / Multi candidate non-cased corrections: 93% (37/40).
    • This one is probably too close to 1:1. Given the strong bias towards non-cased, one would expect a similar bias here. This is a potential weakness with the chosen corpus.
  • Multiple candidates / Single candidate corrections: 79% (77/98).
    • This one is close to 1:1, so we get decent coverage of both cases.

nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
When the spelling dictionaries are loaded, previously the correction
line was just stored in memory as a simple text. Through out the code,
callers would then have to deal with the `data` attribute, correctly
`split()` + `strip()` it. With this change, the dictionary parsing
code now encapsulates this problem.

The auto-correction works from the assumption that there is only one
candidate. This assumption is invariant and seem to be properly
maintained in the code. Therefore, we can just pick the first
candidate word when doing a correction.

In the code, the following name changes are performed:

 * `Misspelling.data` -> `Misspelling.candidates`
 * `fixword` -> `candidates` when used for multiple candidates
   (`fixword` remains for when it is a correction)

On performance:

Performance-wise, this change moves computation from "checking" time
to "startup" time.  The performance cost does not appear to be
noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on
the ratio of cased vs. non-cased corrections with multiple candidates
in mind.

The all lowercase typo is now slightly more expensive (it was passed
throughout `fix_case` and fed directly into the `print` in the
original code. In the new code, it will always need a `join`).  There
are still an overweight of lower-case only corrections in general, so
the unconditional `.join` alone is not sufficient to affect the
performance noticeably.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
@nthykier nthykier linked a pull request May 17, 2024 that will close this issue
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
When the spelling dictionaries are loaded, previously the correction
line was just stored in memory as a simple text. Through out the code,
callers would then have to deal with the `data` attribute, correctly
`split()` + `strip()` it. With this change, the dictionary parsing
code now encapsulates this problem.

The auto-correction works from the assumption that there is only one
candidate. This assumption is invariant and seem to be properly
maintained in the code. Therefore, we can just pick the first
candidate word when doing a correction.

In the code, the following name changes are performed:

 * `Misspelling.data` -> `Misspelling.candidates`
 * `fixword` -> `candidates` when used for multiple candidates
   (`fixword` remains for when it is a correction)

On performance:

Performance-wise, this change moves computation from "checking" time
to "startup" time.  The performance cost does not appear to be
noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on
the ratio of cased vs. non-cased corrections with multiple candidates
in mind.

The all lowercase typo is now slightly more expensive (it was passed
throughout `fix_case` and fed directly into the `print` in the
original code. In the new code, it will always need a `join`).  There
are still an overweight of lower-case only corrections in general, so
the unconditional `.join` alone is not sufficient to affect the
performance noticeably.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 17, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
@DimitriPapadopoulos
Copy link
Collaborator

DimitriPapadopoulos commented May 22, 2024

Perhaps it would be worth checking with more files, so that application startup time remains small compared to individual file checking.

I guess 20 files ought to be enough;

for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done

nthykier added a commit to nthykier/codespell that referenced this issue May 25, 2024
When the spelling dictionaries are loaded, previously the correction
line was just stored in memory as a simple text. Through out the code,
callers would then have to deal with the `data` attribute, correctly
`split()` + `strip()` it. With this change, the dictionary parsing
code now encapsulates this problem.

The auto-correction works from the assumption that there is only one
candidate. This assumption is invariant and seem to be properly
maintained in the code. Therefore, we can just pick the first
candidate word when doing a correction.

In the code, the following name changes are performed:

 * `Misspelling.data` -> `Misspelling.candidates`
 * `fixword` -> `candidates` when used for multiple candidates
   (`fixword` remains for when it is a correction)

On performance:

Performance-wise, this change moves computation from "checking" time
to "startup" time.  The performance cost does not appear to be
noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on
the ratio of cased vs. non-cased corrections with multiple candidates
in mind.

The all lowercase typo is now slightly more expensive (it was passed
throughout `fix_case` and fed directly into the `print` in the
original code. In the new code, it will always need a `join`).  There
are still an overweight of lower-case only corrections in general, so
the unconditional `.join` alone is not sufficient to affect the
performance noticeably.
nthykier added a commit to nthykier/codespell that referenced this issue May 25, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 25, 2024
When the spelling dictionaries are loaded, previously the correction
line was just stored in memory as a simple text. Through out the code,
callers would then have to deal with the `data` attribute, correctly
`split()` + `strip()` it. With this change, the dictionary parsing
code now encapsulates this problem.

The auto-correction works from the assumption that there is only one
candidate. This assumption is invariant and seem to be properly
maintained in the code. Therefore, we can just pick the first
candidate word when doing a correction.

In the code, the following name changes are performed:

 * `Misspelling.data` -> `Misspelling.candidates`
 * `fixword` -> `candidates` when used for multiple candidates
   (`fixword` remains for when it is a correction)

On performance:

Performance-wise, this change moves computation from "checking" time
to "startup" time.  The performance cost does not appear to be
noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on
the ratio of cased vs. non-cased corrections with multiple candidates
in mind.

The all lowercase typo is now slightly more expensive (it was passed
throughout `fix_case` and fed directly into the `print` in the
original code. In the new code, it will always need a `join`).  There
are still an overweight of lower-case only corrections in general, so
the unconditional `.join` alone is not sufficient to affect the
performance noticeably.
nthykier added a commit to nthykier/codespell that referenced this issue May 25, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 25, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
nthykier added a commit to nthykier/codespell that referenced this issue May 25, 2024
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants