Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python spellchecking API #3425

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

nthykier
Copy link
Contributor

@nthykier nthykier commented May 17, 2024

Adding a Spellchecker API

The proposed PR is a follow up to #3419 that introduces a Spellchecker class. This PR deliberate does not close #3419, since the public package name should probably be changed first and an __all__ provided in its __init__,py. In its simplest form, a consumer would do something like this:

# Per #3419, we may want to shuffle around the top-level name so leaving this one "undefined" at the moment
from ... import Spellchecker

# Loads "clear" + " rare" by default, use `Spellchecker(builtin_dictionaries=[...])` for choose.
s = Spellchecker()
text = """\
some
long text
with
tpyos
"""
# Provide a tokenizer (simple one for the sake of the example)
# The codespell tokenizer is a bit more involved, see `line_tokenizer_factory`
tokenizer = re.compile(r"[^ ]+").finditer
lines = text.splitlines()
for line in lines:
    s.spellcheck_line(line, tokenizer)

There is also a check_lower_cased_word method for single word checks which as a part of refactoring. I left it in because it has a simpler API if you have words you want to have checked rather than lines.

Reviewing this PR

I have optimized the review the PR by reviewing each commit individually with the aim of understanding how we got to the end result. It comes with the following commits:

$ git log origin/master..HEAD | git shortlog
Niels Thykier (11):
      Refactor: Move some code to new files for reuse
      Replace `data: str` with `candidates: Sequence[str]`
      Refactor dictionary into a new `Spellchecker` class
      Refactor line tokenization to simplify an outer loop
      Rewrite line spellchecking and move most of it into the `Spellchecker`
      De-indent loop body (whitespace-only change)
      Make `Spellchecker()` load builtin dictionaries by default
      Support non-regex based tokens for `spellcheck_line`
      Speed up spellchecking by ignoring whitespace-only lines
      Move `codespell:ignore` check into `Spellchecker`
      Speed up `codespell:ignore` check by skipping the regex in most cases

Each commit is written with the intend to be a small self-contained change. This also means that some commits are just moving code around or even seem a bit sub-optimal with following commits just using them as stepping stone. If you see something in a commit that you want changed, please note that it might be improved in a later commit. I still kept this as one PR since some of the commits ends up with a considerable performance regression in that commit, which is then resolved in later commits. This is all a trade off between keeping commits as standalone and reviewable as possible vs. being able to cut the PR at any given point time. I preferred the former.

Note that GitHub's PR review function (like Approve or Request Changes) operates on the entire PR even though you review it commit by commit.

How the API interfaces with codespell related activities

Dictionary loads

The API by default just loads the default builtin dictionaries. The constructor accepts a sequence/list of names such as clear, rare. After that, the caller can manually load custom dictionaries via load_dictionary_from_file. This is the default mode and aimed at the API consumers to quickly get started with the common case.

The codespell command is not a common case. Here, dictionaries can be loaded in any other. To facilitate that, the following flow can be used:

s = Spellchecker(builtin_dictionaries=None)
s.load_dictionary_from_file("my-early-loaded-custom-dictionary.txt")
s.load_builtin_dictionaries(["clear", "rare", "informal"])
s.load_dictionary_from_file("my-late-loaded-custom-dictionary.txt")

Various exclusions or regex based matches/ignores

Exclusions features supported directly in the API:

  • Inline ignores (codespell:ignore). Fully handled by default, caller can opt-out.
  • Ignore words, ignore words cased. Partially supported. Caller must parse/provide these to the spellchecker either at initialization or the relevant API call.

Not supported directly by the API in its current form. Caller must facilitate these:

  • Exclude lines (-X). Caller should load exclude the line instead.
  • All regex based matches / ignores. The caller should set these up and provide them in the tokenizer.
    (The codespell tokenizer "factory" plus related default regexes could become part of the API if desired)
  • --uri-ignore-words-list: Hidden inside the tokenizer.

What is to be included in the API

These are the items that I would see go into the __all__ of the codespell API in a later commit / PR.

  1. Spellchecker (introduced in this PR). All its methods would become public API.
  2. Misspelling (existing class). All its data fields would become public API (or at least,candidates + fix would be neede for me). Also, we should probably rename it to Correction or DictionaryResult, etc., since it is not a misspelling but the codespell correctional data.
    • I would personally recommend that this class becomes read-only at an API level. However, that would require that codespell uses a different way to remember its interactively chosen corrections.
  3. DetectedMisspelling (introduced in this PR). Subject to renaming of the class or its properties.
  4. LineTokenizer + Token + their generic constraint (T) for typing purposes.

That is the basic API I envisioned and I would need. This API would still have a considerable amount of "bring your own" code, notably the tokenizer is something people will probably struggle with. My project has its own with its own rules for how things should be split into tokens (like ignoring quoted words), so this was not a goal for me. However, it could still be important for you, since it will affect more casual API consumers that did not solve this problem ahead of time. :)

An alternative to more API would to have more examples of how to emulate the codespell command line tool.

Performance cost of the API

It surprised me a bit that at some point the total regression exceeded 10% on the corpus I used for testing (see #3419 for the details on the performance test setup). It was restored to an ~8% regression before the first Speed up commit.

The first Speed up commit reduces the regression to <= 1.5% (as in, the ~0.060s ballpark) The second Speed up commit was "unnecessary show off" to get below the baseline. 😆

These commits speed up commits are technically a "compensation" changes. As an example, I believe you would get a similar boost of performance by retrofitting the second speed up commit on the existing code base for a similar win. The first speed up commit is partly a compensation, partly a counter of the regression since the "overhead per line" got measurably higher (accordingly, reducing the number of redundant lines now makes the improvement considerably more worth it).

The end result is 5.6s -> 4.9s with an API and 1½-2 performance related compensations.

(Side note for GitHub: This also closes #3434.)

@nthykier nthykier force-pushed the spellchecker-api branch 9 times, most recently from e0e9bca to f9eaba2 Compare May 17, 2024 20:17
@nthykier nthykier marked this pull request as ready for review May 17, 2024 20:17
@larsoner
Copy link
Member

@nthykier let me know when I should look. I at least just approved CI runs.

One option to make it so that I don't have to approve them would be to make some other small PR that we merge. For example it could be just your couple of speed improvements. It would make an ugly rebase issue here, though, so understandable if you'd rather make a different simple PR (or just keep waiting fro me to click "approve" on the runs).

No new code is introduced; only existing code is shuffled around and
the functions moved are unchanged as well.
When the spelling dictionaries are loaded, previously the correction
line was just stored in memory as a simple text. Through out the code,
callers would then have to deal with the `data` attribute, correctly
`split()` + `strip()` it. With this change, the dictionary parsing
code now encapsulates this problem.

The auto-correction works from the assumption that there is only one
candidate. This assumption is invariant and seem to be properly
maintained in the code. Therefore, we can just pick the first
candidate word when doing a correction.

In the code, the following name changes are performed:

 * `Misspelling.data` -> `Misspelling.candidates`
 * `fixword` -> `candidates` when used for multiple candidates
   (`fixword` remains for when it is a correction)

On performance:

Performance-wise, this change moves computation from "checking" time
to "startup" time.  The performance cost does not appear to be
noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on
the ratio of cased vs. non-cased corrections with multiple candidates
in mind.

The all lowercase typo is now slightly more expensive (it was passed
throughout `fix_case` and fed directly into the `print` in the
original code. In the new code, it will always need a `join`).  There
are still an overweight of lower-case only corrections in general, so
the unconditional `.join` alone is not sufficient to affect the
performance noticeably.
@nthykier nthykier force-pushed the spellchecker-api branch 2 times, most recently from aea338c to b3034fa Compare May 25, 2024 08:24
This is as close to a 1:1 conversion as possible. It might change
whhen we get to designing the API. The callers have been refactored to
only perform the lookup once. This was mostly to keep the code more
readable.

The performance cost does seem noticable, which is unsurprising. This
method has a higher cost towards non-matches which is the most common
case.  This commit causes the performance to drop roughly 10% on its
and we are now slower than the goal.
The refactor is a stepping stone towards the next commit where the
inner loop is moved to the `Spellchecker`.
With this rewrite, performance improved slightly and is now down to 7%
slower than the baseline (6s vs. 5.6s).

There is deliberate an over-indentation left in this commit, since
that makes this commit easier to review (without ignoring space
changes).
Deliberately in a separate. There are no functional changes, but there
are some reformatting changes (line merges) as a consequence of the
de-indent.
The `Spellchecker` only needs the `group` method from the `re.Match`.
With a bit of generics and typing protocols, we can make the
`Spellchecker` work with any token type that has a `group()` method.

The `codespell` command line tool still assumes `re.Match` but it can
get that via its own line tokenizer, so it all works out for everyone.
The new API has introduced extra overhead per line being spellchecked.
One way of optimizing out this overhead, is to spellcheck fewer lines.
An obvious choice here, is to optimize out empty and whitespace-only
lines, since they will not have any typos at all (on account of not
having any words).

A side-effect of this change is that we now spellcheck lines with
trailing whitespace stripped. Semantically, this gives the same result
(per "whitespace never has typos"). Performance-wise, it is faster in
theory because the strings are now shorter (since we were calling
`.rstrip()` anyway). In pratice, I am not sure we are going to find
any real corpus where the trailing whitespace is noteworthy from a
performance point of view.

On the performance corpus from codespell-project#3491, this takes out ~0.4s of
runtime brining us down to slightly above the 5.6s that made the
baseline.
This makes the API automatically avoid some declared false-positives
that the command line tool would also filter.
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
@nthykier
Copy link
Contributor Author

nthykier commented Jun 1, 2024

@nthykier let me know when I should look. I at least just approved CI runs.

One option to make it so that I don't have to approve them would be to make some other small PR that we merge. For example it could be just your couple of speed improvements. It would make an ugly rebase issue here, though, so understandable if you'd rather make a different simple PR (or just keep waiting fro me to click "approve" on the runs).

Thanks.

This PR should be ready for review now. :)

Copy link
Member

@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise looks pretty straightforward !


class UnknownBuiltinDictionaryError(ValueError):
def __init__(self, name: str) -> None:
super().__init__(f"Unknown built-in dictionary: {name}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think codecov is right, these classes aren't used anywhere. Omit for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had a ValueError with an error message, but that triggers a linter warning about using long error messages with exceptions and that it is better to set the message in the __init__ (which requires the subclass shown). Therefore, omitting it triggers a red CI and is not an option unless you want me to tweak the linter to skip that check.

Alternative, I can add a test for it, since that would solve the codecov issue while avoiding the linter issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, we (or at least I) do not see UnknownBuiltinDictionaryError raised or caught in the current code, and we cannot find load_builtin_dictionaries either. Why do you need to define it in this PR?

In any case, the linter rule you refer to has to be TRY003. Without seeing the code that raises or catches the exception, I cannot easily suggest ways to pacify the linter. However, if you only raise the exception once, isn't ValueError sufficient?

    message = f"Unknown built-in dictionary: {name}"
    raise ValueError(message)

@DimitriPapadopoulos
Copy link
Collaborator

DimitriPapadopoulos commented Jun 3, 2024

After this nice refactoring work, I suspect the maximum value of some complexity metrics might be lowered:

codespell/pyproject.toml

Lines 169 to 174 in 3760e61

[tool.ruff.lint.pylint]
allow-magic-value-types = ["bytes", "int", "str",]
max-args = 13
max-branches = 46
max-returns = 11
max-statements = 119

@nthykier
Copy link
Contributor Author

nthykier commented Jun 5, 2024

After this nice refactoring work, I suspect the maximum of some complexity metrics might be lowered:

codespell/pyproject.toml

Lines 169 to 174 in 3760e61

[tool.ruff.lint.pylint]
allow-magic-value-types = ["bytes", "int", "str",]
max-args = 13
max-branches = 46
max-returns = 11
max-statements = 119

Quite possibly. What do you (as a project) expect here? That I lower them to the default values for pylint, see if that works and re-add the keys needed for the CI to pass lowered the the exact amount required to have the CI pass?

@DimitriPapadopoulos
Copy link
Collaborator

DimitriPapadopoulos commented Jun 5, 2024

Quite possibly. What do you (as a project) expect here? That I lower them to the default values for pylint, see if that works and re-add the keys needed for the CI to pass lowered the the exact amount required to have the CI pass?

Exactly. The default values won't work, but pylint will show you new lower values.

That can of course be deferred to a later PR or left to other maintainers - at your convenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python API support
3 participants