Add Python spellchecking API #3425

nthykier · 2024-05-17T15:50:05Z

Adding a Spellchecker API

The proposed PR is a follow up to #3419 that introduces a Spellchecker class. This PR deliberate does not close #3419, since the public package name should probably be changed first and an __all__ provided in its __init__,py. In its simplest form, a consumer would do something like this:

# Per #3419, we may want to shuffle around the top-level name so leaving this one "undefined" at the moment
from ... import Spellchecker

# Loads "clear" + " rare" by default, use `Spellchecker(builtin_dictionaries=[...])` for choose.
s = Spellchecker()
text = """\
some
long text
with
tpyos
"""
# Provide a tokenizer (simple one for the sake of the example)
# The codespell tokenizer is a bit more involved, see `line_tokenizer_factory`
tokenizer = re.compile(r"[^ ]+").finditer
lines = text.splitlines()
for line in lines:
    s.spellcheck_line(line, tokenizer)

There is also a check_lower_cased_word method for single word checks which as a part of refactoring. I left it in because it has a simpler API if you have words you want to have checked rather than lines.

Reviewing this PR

I have optimized the review the PR by reviewing each commit individually with the aim of understanding how we got to the end result. It comes with the following commits:

$ git log origin/master..HEAD | git shortlog
Niels Thykier (11):
      Refactor: Move some code to new files for reuse
      Replace `data: str` with `candidates: Sequence[str]`
      Refactor dictionary into a new `Spellchecker` class
      Refactor line tokenization to simplify an outer loop
      Rewrite line spellchecking and move most of it into the `Spellchecker`
      De-indent loop body (whitespace-only change)
      Make `Spellchecker()` load builtin dictionaries by default
      Support non-regex based tokens for `spellcheck_line`
      Speed up spellchecking by ignoring whitespace-only lines
      Move `codespell:ignore` check into `Spellchecker`
      Speed up `codespell:ignore` check by skipping the regex in most cases

Each commit is written with the intend to be a small self-contained change. This also means that some commits are just moving code around or even seem a bit sub-optimal with following commits just using them as stepping stone. If you see something in a commit that you want changed, please note that it might be improved in a later commit. I still kept this as one PR since some of the commits ends up with a considerable performance regression in that commit, which is then resolved in later commits. This is all a trade off between keeping commits as standalone and reviewable as possible vs. being able to cut the PR at any given point time. I preferred the former.

Note that GitHub's PR review function (like Approve or Request Changes) operates on the entire PR even though you review it commit by commit.

How the API interfaces with codespell related activities

Dictionary loads

The API by default just loads the default builtin dictionaries. The constructor accepts a sequence/list of names such as clear, rare. After that, the caller can manually load custom dictionaries via load_dictionary_from_file. This is the default mode and aimed at the API consumers to quickly get started with the common case.

The codespell command is not a common case. Here, dictionaries can be loaded in any other. To facilitate that, the following flow can be used:

s = Spellchecker(builtin_dictionaries=None)
s.load_dictionary_from_file("my-early-loaded-custom-dictionary.txt")
s.load_builtin_dictionaries(["clear", "rare", "informal"])
s.load_dictionary_from_file("my-late-loaded-custom-dictionary.txt")

Various exclusions or regex based matches/ignores

Exclusions features supported directly in the API:

Inline ignores (codespell:ignore). Fully handled by default, caller can opt-out.
Ignore words, ignore words cased. Partially supported. Caller must parse/provide these to the spellchecker either at initialization or the relevant API call.

Not supported directly by the API in its current form. Caller must facilitate these:

Exclude lines (-X). Caller should load exclude the line instead.
All regex based matches / ignores. The caller should set these up and provide them in the tokenizer.
(The codespell tokenizer "factory" plus related default regexes could become part of the API if desired)
--uri-ignore-words-list: Hidden inside the tokenizer.

What is to be included in the API

These are the items that I would see go into the __all__ of the codespell API in a later commit / PR.

Spellchecker (introduced in this PR). All its methods would become public API.
Misspelling (existing class). All its data fields would become public API (or at least,candidates + fix would be neede for me). Also, we should probably rename it to Correction or DictionaryResult, etc., since it is not a misspelling but the codespell correctional data.
- I would personally recommend that this class becomes read-only at an API level. However, that would require that codespell uses a different way to remember its interactively chosen corrections.
DetectedMisspelling (introduced in this PR). Subject to renaming of the class or its properties.
LineTokenizer + Token + their generic constraint (T) for typing purposes.

That is the basic API I envisioned and I would need. This API would still have a considerable amount of "bring your own" code, notably the tokenizer is something people will probably struggle with. My project has its own with its own rules for how things should be split into tokens (like ignoring quoted words), so this was not a goal for me. However, it could still be important for you, since it will affect more casual API consumers that did not solve this problem ahead of time. :)

An alternative to more API would to have more examples of how to emulate the codespell command line tool.

Performance cost of the API

It surprised me a bit that at some point the total regression exceeded 10% on the corpus I used for testing (see #3419 for the details on the performance test setup). It was restored to an ~8% regression before the first Speed up commit.

The first Speed up commit reduces the regression to <= 1.5% (as in, the ~0.060s ballpark) The second Speed up commit was "unnecessary show off" to get below the baseline. 😆

These commits speed up commits are technically a "compensation" changes. As an example, I believe you would get a similar boost of performance by retrofitting the second speed up commit on the existing code base for a similar win. The first speed up commit is partly a compensation, partly a counter of the regression since the "overhead per line" got measurably higher (accordingly, reducing the number of redundant lines now makes the improvement considerably more worth it).

The end result is 5.6s -> 4.9s with an API and 1½-2 performance related compensations.

(Side note for GitHub: This also closes #3434.)

larsoner · 2024-05-24T16:59:34Z

@nthykier let me know when I should look. I at least just approved CI runs.

One option to make it so that I don't have to approve them would be to make some other small PR that we merge. For example it could be just your couple of speed improvements. It would make an ugly rebase issue here, though, so understandable if you'd rather make a different simple PR (or just keep waiting fro me to click "approve" on the runs).

No new code is introduced; only existing code is shuffled around and the functions moved are unchanged as well.

When the spelling dictionaries are loaded, previously the correction line was just stored in memory as a simple text. Through out the code, callers would then have to deal with the `data` attribute, correctly `split()` + `strip()` it. With this change, the dictionary parsing code now encapsulates this problem. The auto-correction works from the assumption that there is only one candidate. This assumption is invariant and seem to be properly maintained in the code. Therefore, we can just pick the first candidate word when doing a correction. In the code, the following name changes are performed: * `Misspelling.data` -> `Misspelling.candidates` * `fixword` -> `candidates` when used for multiple candidates (`fixword` remains for when it is a correction) On performance: Performance-wise, this change moves computation from "checking" time to "startup" time. The performance cost does not appear to be noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on the ratio of cased vs. non-cased corrections with multiple candidates in mind. The all lowercase typo is now slightly more expensive (it was passed throughout `fix_case` and fed directly into the `print` in the original code. In the new code, it will always need a `join`). There are still an overweight of lower-case only corrections in general, so the unconditional `.join` alone is not sufficient to affect the performance noticeably.

This is as close to a 1:1 conversion as possible. It might change whhen we get to designing the API. The callers have been refactored to only perform the lookup once. This was mostly to keep the code more readable. The performance cost does seem noticable, which is unsurprising. This method has a higher cost towards non-matches which is the most common case. This commit causes the performance to drop roughly 10% on its and we are now slower than the goal.

The refactor is a stepping stone towards the next commit where the inner loop is moved to the `Spellchecker`.

With this rewrite, performance improved slightly and is now down to 7% slower than the baseline (6s vs. 5.6s). There is deliberate an over-indentation left in this commit, since that makes this commit easier to review (without ignoring space changes).

Deliberately in a separate. There are no functional changes, but there are some reformatting changes (line merges) as a consequence of the de-indent.

The `Spellchecker` only needs the `group` method from the `re.Match`. With a bit of generics and typing protocols, we can make the `Spellchecker` work with any token type that has a `group()` method. The `codespell` command line tool still assumes `re.Match` but it can get that via its own line tokenizer, so it all works out for everyone.

The new API has introduced extra overhead per line being spellchecked. One way of optimizing out this overhead, is to spellcheck fewer lines. An obvious choice here, is to optimize out empty and whitespace-only lines, since they will not have any typos at all (on account of not having any words). A side-effect of this change is that we now spellcheck lines with trailing whitespace stripped. Semantically, this gives the same result (per "whitespace never has typos"). Performance-wise, it is faster in theory because the strings are now shorter (since we were calling `.rstrip()` anyway). In pratice, I am not sure we are going to find any real corpus where the trailing whitespace is noteworthy from a performance point of view. On the performance corpus from codespell-project#3491, this takes out ~0.4s of runtime brining us down to slightly above the 5.6s that made the baseline.

This makes the API automatically avoid some declared false-positives that the command line tool would also filter.

The changes to provide a public API had some performance related costs of about 1% runtime. There is no trivial way to offset this any further without undermining the API we are building. However, we can pull performance-related shenanigans to compenstate for the cost introduced. The codespell codebase unsurprisingly spends a vast majority of its runtime in various regex related code such as `search` and `finditer`. The best way to optimize runtime spend in regexes is to not do a regex in the first place, since the regex engine has a rather steep overhead over regular string primitives (that is the cost of flexibility). If the regex rarely matches and there is a very easy static substring that can be used to rule out the match, then you can speed up the code by using `substring in string` as a conditional to skip the regex. This is assuming the regex is used enough for the performance to matter. An obvious choice here falls on the `codespell:ignore` regex, because it has a very distinctive substring in the form of `codespell:ignore`, which will rule out almost all lines that will not match. With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus mentioned in codespell-project#3419.

Per review comment.

nthykier · 2024-06-01T07:49:39Z

@nthykier let me know when I should look. I at least just approved CI runs.

One option to make it so that I don't have to approve them would be to make some other small PR that we merge. For example it could be just your couple of speed improvements. It would make an ugly rebase issue here, though, so understandable if you'd rather make a different simple PR (or just keep waiting fro me to click "approve" on the runs).

Thanks.

This PR should be ready for review now. :)

larsoner

Otherwise looks pretty straightforward !

larsoner · 2024-06-02T21:31:34Z

codespell_lib/_spellchecker.py

+
+class UnknownBuiltinDictionaryError(ValueError):
+    def __init__(self, name: str) -> None:
+        super().__init__(f"Unknown built-in dictionary: {name}")


I think codecov is right, these classes aren't used anywhere. Omit for now?

I originally had a ValueError with an error message, but that triggers a linter warning about using long error messages with exceptions and that it is better to set the message in the __init__ (which requires the subclass shown). Therefore, omitting it triggers a red CI and is not an option unless you want me to tweak the linter to skip that check.

Alternative, I can add a test for it, since that would solve the codecov issue while avoiding the linter issue.

To clarify, we (or at least I) do not see UnknownBuiltinDictionaryError raised or caught in the current code, and we cannot find load_builtin_dictionaries either. Why do you need to define it in this PR?

In any case, the linter rule you refer to has to be TRY003. Without seeing the code that raises or catches the exception, I cannot easily suggest ways to pacify the linter. However, if you only raise the exception once, isn't ValueError sufficient?

message = f"Unknown built-in dictionary: {name}" raise ValueError(message)

DimitriPapadopoulos · 2024-06-03T19:13:38Z

After this nice refactoring work, I suspect the maximum value of some complexity metrics might be lowered:

codespell/pyproject.toml

Lines 169 to 174 in 3760e61

    
           [tool.ruff.lint.pylint] 
        
           allow-magic-value-types = ["bytes", "int", "str",] 
        
           max-args = 13 
        
           max-branches = 46 
        
           max-returns = 11 
        
           max-statements = 119

nthykier · 2024-06-05T08:19:25Z

After this nice refactoring work, I suspect the maximum of some complexity metrics might be lowered:

codespell/pyproject.toml

Lines 169 to 174 in 3760e61

[tool.ruff.lint.pylint]

allow-magic-value-types = ["bytes", "int", "str",]

max-args = 13

max-branches = 46

max-returns = 11

max-statements = 119

Quite possibly. What do you (as a project) expect here? That I lower them to the default values for pylint, see if that works and re-add the keys needed for the CI to pass lowered the the exact amount required to have the CI pass?

DimitriPapadopoulos · 2024-06-05T09:21:58Z

Quite possibly. What do you (as a project) expect here? That I lower them to the default values for pylint, see if that works and re-add the keys needed for the CI to pass lowered the the exact amount required to have the CI pass?

Exactly. The default values won't work, but pylint will show you new lower values.

That can of course be deferred to a later PR or left to other maintainers - at your convenience.

nthykier force-pushed the spellchecker-api branch 9 times, most recently from e0e9bca to f9eaba2 Compare May 17, 2024 20:17

nthykier marked this pull request as ready for review May 17, 2024 20:17

nthykier requested review from larsoner and peternewman as code owners May 17, 2024 20:17

DimitriPapadopoulos mentioned this pull request May 22, 2024

release a new version ? #3387

Closed

nthykier force-pushed the spellchecker-api branch from e606739 to b766592 Compare May 25, 2024 07:09

nthykier mentioned this pull request May 25, 2024

Refactor: Move some code to new files for reuse #3434

Merged

nthykier added 2 commits May 25, 2024 07:38

Refactor: Move some code to new files for reuse

b28a5a3

No new code is introduced; only existing code is shuffled around and the functions moved are unchanged as well.

nthykier force-pushed the spellchecker-api branch 2 times, most recently from aea338c to b3034fa Compare May 25, 2024 08:24

nthykier added 8 commits May 25, 2024 08:27

Refactor line tokenization to simplify an outer loop

ef5096c

The refactor is a stepping stone towards the next commit where the inner loop is moved to the `Spellchecker`.

De-indent loop body (whitespace-only / reformatting-only change)

8bd3517

Deliberately in a separate. There are no functional changes, but there are some reformatting changes (line merges) as a consequence of the de-indent.

Move codespell:ignore check into Spellchecker

3c08c9b

This makes the API automatically avoid some declared false-positives that the command line tool would also filter.

nthykier force-pushed the spellchecker-api branch from b3034fa to ce280c9 Compare May 25, 2024 08:29

Refactor: Rename spellchecker.py to _spellchecker.py

ae0e8d2

Per review comment.

Merge remote-tracking branch 'origin/master'

37d4b38

larsoner reviewed Jun 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Python spellchecking API #3425

Add Python spellchecking API #3425

nthykier commented May 17, 2024 •

edited

Loading

larsoner commented May 24, 2024

nthykier commented Jun 1, 2024

larsoner left a comment

larsoner Jun 2, 2024

nthykier Jun 5, 2024

DimitriPapadopoulos Jun 5, 2024

DimitriPapadopoulos commented Jun 3, 2024 •

edited

Loading

nthykier commented Jun 5, 2024

DimitriPapadopoulos commented Jun 5, 2024 •

edited

Loading

Add Python spellchecking API #3425

Are you sure you want to change the base?

Add Python spellchecking API #3425

Conversation

nthykier commented May 17, 2024 • edited Loading

Adding a Spellchecker API

Reviewing this PR

How the API interfaces with codespell related activities

Dictionary loads

Various exclusions or regex based matches/ignores

What is to be included in the API

Performance cost of the API

larsoner commented May 24, 2024

nthykier commented Jun 1, 2024

larsoner left a comment

Choose a reason for hiding this comment

larsoner Jun 2, 2024

Choose a reason for hiding this comment

nthykier Jun 5, 2024

Choose a reason for hiding this comment

DimitriPapadopoulos Jun 5, 2024

Choose a reason for hiding this comment

DimitriPapadopoulos commented Jun 3, 2024 • edited Loading

nthykier commented Jun 5, 2024

DimitriPapadopoulos commented Jun 5, 2024 • edited Loading

nthykier commented May 17, 2024 •

edited

Loading

DimitriPapadopoulos commented Jun 3, 2024 •

edited

Loading

DimitriPapadopoulos commented Jun 5, 2024 •

edited

Loading