Python API support #3419

nthykier · 2024-05-05T06:46:57Z

Hi,

Would you be open to supporting a public stable python API for codespell. Ideally for me, one where I as a consumer can feed codespell with words I want spellchecked and then is given back a valid or invalid, here are the corrections available. If you auto-correct, use this choice (or "Not safe for auto-correcting" if that is the data available).

My use case is that I am working on a Language Server (LSP, https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification) and I want to provide spell checking. So far, I have relied on hunspell because it had Python bindings but its accuracy on technical documents with the dictionary I found leaves a bit to be wanted.

In my use case, being able to identify the exact range of the problem is an absolute necessity as my code need to provide the text range for the editor, such that it can show the user exactly where in the open document the problem is. If word-by-word checking is not supported, then API can be pass lines of text to be spellchecked provided the result identifies exactly in the range where the problem is (that is, I need start + end index).

In this case, I would probably also need the API docs to state a bit about why the line by line text is important, because I might need to extract the underlying text from formatting to create the a synthetic line to be spellchecked. As an example, my current code tries to identify and hide common code/file references like usr/share/foo. In a word-by-word check, I just skip the the spellcheck call for that word. But if I need to pass a line of text to codespell, I would need to removed the ignored and here it is relevant to know how to do that replacement such that the user does not get a false positive because I attempted to avoid another false-positive.

Alternatively, a parser for the dictionaries plus the underlying dictionaries might also be an option a "light weight API" assuming they are easier to keep stable.

I have noted that codespell can do spell checking from stdin to stdout. However, that is a bit too heavy handed for me to easily replace my hunspell integration.

That is my wishlist. :) Would this be something that you would be open to supporting?

Note: By stable API, I assumed nothing was stable since __all__ = ["_script_main", "main", "__version__"] does not look it contains reusable APIs.

The text was updated successfully, but these errors were encountered:

DimitriPapadopoulos · 2024-05-05T09:14:01Z

Exposing a function to check words or lines, one that is actually used by codespell itself, might affect performance. On the other hand:

the performance hit is just speculation without actual tests to measure performance,
exposing functions would probably reduce the complexity of parse_file(), a long-overdue target (see Some refactoring in code_spell (extract method for long method code smell) #3276).

Do you agree refactoring is a necessary step to expose functions that could be added to a public API later on ? If so, could you have a look at #3276, see if it's a step in your direction, and create a new pull request from it? A test to measure performance would be a great addition. I am not the maintainer, but I think these steps might convince @larsoner.

I think you would need to refactor near the code that starts line processing:

codespell/codespell_lib/_codespell.py

Line 967 in 0b09d75

for i, line in enumerate(lines):

and near the code that splits lines into words:

codespell/codespell_lib/_codespell.py

Line 989 in 0b09d75

check_matches = extract_words_iter(line, word_regex, ignore_word_regex)

Alternatively, the lightweight API to access dictionaries sounds like a good alternative indeed.

nthykier · 2024-05-05T10:34:34Z

Thanks for the feedback and the suggestions.

I could be convinced to do a PR for this change, but I would prefer to have alignment up front what the expectations would be to determine if that is within my capacity to deal with it and matches the amount of time I am willing to invest. I am hoping we can meet in the middle obviously, but I would rather know upfront so there is no disappointment on either side down the line. Like, if there is going to be a performance test, what is expectations on "before vs. after" performance, what kind of corpus would you consider a "valid performance test" (are we talking micro-benchmark or real-life use-cases or both; which versions of python count, etc.)

For clarity, I read phrases like "would be nice" or "a great addition" as "optional and would not prevent merge if omitted".

I guess that means I am waiting for @larsoner to respond on the expectations. :)

larsoner · 2024-05-15T15:13:58Z

I think as long as the changes are reviewable and there isn't much performance hit we're okay. Something like a 10% performance hit would be kind of a bummer. But I would also be surprised if there were one based on briefly thinking about the code.

One approach would be to make a quick attempt at this with the necessary refactoring and some new Python API, and some basic tests in a new codespell_lib/tests/test_api.py or similar. Hopefully that's not too much work based on what @DimitriPapadopoulos outlined above? And then we can check to make sure performance isn't too bad, then iterate on the API as proposed in the PR.

DimitriPapadopoulos · 2024-05-15T16:58:08Z

@larsoner I wonder whether this will eventually require some file/directory renaming. All files currently live under codespell_lib. However, Python modules usually put Python code under codespell or src/codespell. I wonder why this is not the case here, and whether we should change that.

larsoner · 2024-05-15T17:17:35Z

Can't remember why it was codespell_lib. It could be a reasonable time to mkdir src && git mv codespell src/ and adjust pyproject.toml. I think to change this we'd probably want the release to be 3.0 but that would be okay.

It can/should be done in a separate PR though

nthykier · 2024-05-17T07:17:00Z

Ok, thanks for the input so far. Up to 10% performance is quite a leeway indeed and like larsoner, I doubt we will hit it. Nevertheless, I will establish a baseline that I will use.

I did not see any references to how we decide the baseline. For now, I will be using https://sherlock-holm.es/stories/plain-text/cano.txt as a base text. Since what we are concerned with is the cost of refactoring into having a spellcheck_word function that is called per word (consider the name a description more than a target at this point), I did not think it relevant to look at different file types. Let me know if you have reason to believe otherwise.

I am describing my method below. Ideally, none of this would be new to you if you have prior experience with performance measurements of this kind.

The setup:

mkdir performance-corpus/
cd performance-corpus
wget https://sherlock-holm.es/stories/plain-text/cano.txt
# Create some extra copies (a single file takes less than 0.5 second; adding a bit more makes it easier to test)
for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done

# Setup `_version.py`, so we can run directly from git checkout.
echo '__version__ = "do not crash, please"' > codespell_lib/_version.py

The resulting corpus is 79329 bytes. The wc command line tool estimates that each file has 76764 lines (including many empty ones) and 657438 word (in wc definition of a word). The codespell code flags about 175 words as possible typos in this text (that is, per file).

The python version I used for these numbers:

$ python3 --version
Python 3.11.9

Running the test:

# Test, repeated at least 3 times, fastest result is chosen;
# Usually slower times are caused by unrelated jitter - at least on laptops/desktops that
# also does other work
time PYTHONPATH=. python3 -m codespell_lib performance-corpus/ >/dev/null

# One run with `-m profile` to see where the bottlenecks are.
# Takes considerably longer (x10-x20), so it is not done on the full corpus
PYTHONPATH=. python3 -m profile -o baseline.pstat  -m codespell_lib performance-corpus/cano.txt

Baseline results

On my hardware, the performance test on the baseline is recorded as 5.6 seconds (0m5.600s).

The bottleneck profiling has the following key numbers as far as I can tell:

Total time 3.030s as baseline (2196160 function calls (2195259 primitive calls) in 3.030 seconds)
The code spend 0.503s (~1/6) in build_dict (split into 2 calls)
The code spend 2.472s (~5/6) in parse_file in one call [only one file was examined].
About of half of parse_file's runtime is spend in parse_file (1.178/.2.472). The other half is then assumed to come from callers.
The time spent in "".lower() is 0.433s (~1/6) and another ~1/6 is spend in regex operations (group() + search() + finditer()).

These numbers were extracted by an interactive session with the following commands (as examples):

import pstats
s = pstats.Stats("baseline.pstat")
s.sort_stats(2).print_stats(20)
s.print_callees('parse_file')

Types of corrections

Per file, we get 175 suggestions. These are split into the following groups:

Cased (capitalized or all upper): 41 (~23%)
Non-cased: 134 (~77%)
Corrections with multiple candidates: 77 (44%)
- Cased with multiple candidate corrections: 37 (21% of the 175 corrections)
- Non-cased with multiple candidate corrections: 40 (~23% of the 175 corrections)
Corrections with reason: 0

Some ratios:

Cased / non-cased: ~30% (40/134).
- Strong bias towards non-cased (seems reasonable since most words are lower-cased)
Multiple candidates cased / Multi candidate non-cased corrections: 93% (37/40).
- This one is probably too close to 1:1. Given the strong bias towards non-cased, one would expect a similar bias here. This is a potential weakness with the chosen corpus.
Multiple candidates / Single candidate corrections: 79% (77/98).
- This one is close to 1:1, so we get decent coverage of both cases.

When the spelling dictionaries are loaded, previously the correction line was just stored in memory as a simple text. Through out the code, callers would then have to deal with the `data` attribute, correctly `split()` + `strip()` it. With this change, the dictionary parsing code now encapsulates this problem. The auto-correction works from the assumption that there is only one candidate. This assumption is invariant and seem to be properly maintained in the code. Therefore, we can just pick the first candidate word when doing a correction. In the code, the following name changes are performed: * `Misspelling.data` -> `Misspelling.candidates` * `fixword` -> `candidates` when used for multiple candidates (`fixword` remains for when it is a correction) On performance: Performance-wise, this change moves computation from "checking" time to "startup" time. The performance cost does not appear to be noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on the ratio of cased vs. non-cased corrections with multiple candidates in mind. The all lowercase typo is now slightly more expensive (it was passed throughout `fix_case` and fed directly into the `print` in the original code. In the new code, it will always need a `join`). There are still an overweight of lower-case only corrections in general, so the unconditional `.join` alone is not sufficient to affect the performance noticeably.

The changes to provide a public API had some performance related costs of about 1% runtime. There is no trivial way to offset this any further without undermining the API we are building. However, we can pull performance-related shenanigans to compenstate for the cost introduced. The codespell codebase unsurprisingly spends a vast majority of its runtime in various regex related code such as `search` and `finditer`. The best way to optimize runtime spend in regexes is to not do a regex in the first place, since the regex engine has a rather steep overhead over regular string primitives (that is the cost of flexibility). If the regex rarely matches and there is a very easy static substring that can be used to rule out the match, then you can speed up the code by using `substring in string` as a conditional to skip the regex. This is assuming the regex is used enough for the performance to matter. An obvious choice here falls on the `codespell:ignore` regex, because it has a very distinctive substring in the form of `codespell:ignore`, which will rule out almost all lines that will not match. With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus mentioned in codespell-project#3419.

When the spelling dictionaries are loaded, previously the correction line was just stored in memory as a simple text. Through out the code, callers would then have to deal with the `data` attribute, correctly `split()` + `strip()` it. With this change, the dictionary parsing code now encapsulates this problem. The auto-correction works from the assumption that there is only one candidate. This assumption is invariant and seem to be properly maintained in the code. Therefore, we can just pick the first candidate word when doing a correction. In the code, the following name changes are performed: * `Misspelling.data` -> `Misspelling.candidates` * `fixword` -> `candidates` when used for multiple candidates (`fixword` remains for when it is a correction) On performance: Performance-wise, this change moves computation from "checking" time to "startup" time. The performance cost does not appear to be noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on the ratio of cased vs. non-cased corrections with multiple candidates in mind. The all lowercase typo is now slightly more expensive (it was passed throughout `fix_case` and fed directly into the `print` in the original code. In the new code, it will always need a `join`). There are still an overweight of lower-case only corrections in general, so the unconditional `.join` alone is not sufficient to affect the performance noticeably.

The changes to provide a public API had some performance related costs of about 1% runtime. There is no trivial way to offset this any further without undermining the API we are building. However, we can pull performance-related shenanigans to compenstate for the cost introduced. The codespell codebase unsurprisingly spends a vast majority of its runtime in various regex related code such as `search` and `finditer`. The best way to optimize runtime spend in regexes is to not do a regex in the first place, since the regex engine has a rather steep overhead over regular string primitives (that is the cost of flexibility). If the regex rarely matches and there is a very easy static substring that can be used to rule out the match, then you can speed up the code by using `substring in string` as a conditional to skip the regex. This is assuming the regex is used enough for the performance to matter. An obvious choice here falls on the `codespell:ignore` regex, because it has a very distinctive substring in the form of `codespell:ignore`, which will rule out almost all lines that will not match. With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus mentioned in codespell-project#3419.

DimitriPapadopoulos · 2024-05-22T15:25:37Z

~~Perhaps it would be worth checking with more files, so that application startup time remains small compared to individual file checking.~~

I guess 20 files ought to be enough;

for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done

When the spelling dictionaries are loaded, previously the correction line was just stored in memory as a simple text. Through out the code, callers would then have to deal with the `data` attribute, correctly `split()` + `strip()` it. With this change, the dictionary parsing code now encapsulates this problem. The auto-correction works from the assumption that there is only one candidate. This assumption is invariant and seem to be properly maintained in the code. Therefore, we can just pick the first candidate word when doing a correction. In the code, the following name changes are performed: * `Misspelling.data` -> `Misspelling.candidates` * `fixword` -> `candidates` when used for multiple candidates (`fixword` remains for when it is a correction) On performance: Performance-wise, this change moves computation from "checking" time to "startup" time. The performance cost does not appear to be noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on the ratio of cased vs. non-cased corrections with multiple candidates in mind. The all lowercase typo is now slightly more expensive (it was passed throughout `fix_case` and fed directly into the `print` in the original code. In the new code, it will always need a `join`). There are still an overweight of lower-case only corrections in general, so the unconditional `.join` alone is not sufficient to affect the performance noticeably.

The changes to provide a public API had some performance related costs of about 1% runtime. There is no trivial way to offset this any further without undermining the API we are building. However, we can pull performance-related shenanigans to compenstate for the cost introduced. The codespell codebase unsurprisingly spends a vast majority of its runtime in various regex related code such as `search` and `finditer`. The best way to optimize runtime spend in regexes is to not do a regex in the first place, since the regex engine has a rather steep overhead over regular string primitives (that is the cost of flexibility). If the regex rarely matches and there is a very easy static substring that can be used to rule out the match, then you can speed up the code by using `substring in string` as a conditional to skip the regex. This is assuming the regex is used enough for the performance to matter. An obvious choice here falls on the `codespell:ignore` regex, because it has a very distinctive substring in the form of `codespell:ignore`, which will rule out almost all lines that will not match. With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus mentioned in codespell-project#3419.

When the spelling dictionaries are loaded, previously the correction line was just stored in memory as a simple text. Through out the code, callers would then have to deal with the `data` attribute, correctly `split()` + `strip()` it. With this change, the dictionary parsing code now encapsulates this problem. The auto-correction works from the assumption that there is only one candidate. This assumption is invariant and seem to be properly maintained in the code. Therefore, we can just pick the first candidate word when doing a correction. In the code, the following name changes are performed: * `Misspelling.data` -> `Misspelling.candidates` * `fixword` -> `candidates` when used for multiple candidates (`fixword` remains for when it is a correction) On performance: Performance-wise, this change moves computation from "checking" time to "startup" time. The performance cost does not appear to be noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on the ratio of cased vs. non-cased corrections with multiple candidates in mind. The all lowercase typo is now slightly more expensive (it was passed throughout `fix_case` and fed directly into the `print` in the original code. In the new code, it will always need a `join`). There are still an overweight of lower-case only corrections in general, so the unconditional `.join` alone is not sufficient to affect the performance noticeably.

The changes to provide a public API had some performance related costs of about 1% runtime. There is no trivial way to offset this any further without undermining the API we are building. However, we can pull performance-related shenanigans to compenstate for the cost introduced. The codespell codebase unsurprisingly spends a vast majority of its runtime in various regex related code such as `search` and `finditer`. The best way to optimize runtime spend in regexes is to not do a regex in the first place, since the regex engine has a rather steep overhead over regular string primitives (that is the cost of flexibility). If the regex rarely matches and there is a very easy static substring that can be used to rule out the match, then you can speed up the code by using `substring in string` as a conditional to skip the regex. This is assuming the regex is used enough for the performance to matter. An obvious choice here falls on the `codespell:ignore` regex, because it has a very distinctive substring in the form of `codespell:ignore`, which will rule out almost all lines that will not match. With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus mentioned in codespell-project#3419.

DimitriPapadopoulos added the enhancement label May 5, 2024

nthykier linked a pull request May 17, 2024 that will close this issue

Add Python spellchecking API #3425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python API support #3419

Python API support #3419

nthykier commented May 5, 2024

DimitriPapadopoulos commented May 5, 2024 •

edited

Loading

nthykier commented May 5, 2024

larsoner commented May 15, 2024

DimitriPapadopoulos commented May 15, 2024

larsoner commented May 15, 2024

nthykier commented May 17, 2024 •

edited

Loading

DimitriPapadopoulos commented May 22, 2024 •

edited

Loading

Python API support #3419

Python API support #3419

Comments

nthykier commented May 5, 2024

DimitriPapadopoulos commented May 5, 2024 • edited Loading

nthykier commented May 5, 2024

larsoner commented May 15, 2024

DimitriPapadopoulos commented May 15, 2024

larsoner commented May 15, 2024

nthykier commented May 17, 2024 • edited Loading

The setup:

Running the test:

Baseline results

Types of corrections

DimitriPapadopoulos commented May 22, 2024 • edited Loading

DimitriPapadopoulos commented May 5, 2024 •

edited

Loading

nthykier commented May 17, 2024 •

edited

Loading

DimitriPapadopoulos commented May 22, 2024 •

edited

Loading