Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single word / part of sentence correction #9

Open
lumpidu opened this issue Jan 13, 2021 · 2 comments
Open

Single word / part of sentence correction #9

lumpidu opened this issue Jan 13, 2021 · 2 comments

Comments

@lumpidu
Copy link

lumpidu commented Jan 13, 2021

I want to use Greynir-Correct for correction of non-whole sentences, i.e. in extreme cases single words. What method or options should I use to make that possible ?

Currently, when using the tokenize() method with option only_ci=True, it complains about the following:

Maðurin      Z002     Orð á að byrja á hástaf: 'maðurin'
Maðurinn     Z002     Orð á að byrja á hástaf: 'maðurinn'

Sample code:

from reynir_correct import tokenize

texts = ["maðurin", "maðurinn" ]

for t in texts:
    g = tokenize(t, only_ci=True)
    for t in g:
        if t.txt:
            print(f"{t.txt:12} {t.error_code:8} {t.error_description}")
@vthorsteinsson
Copy link
Member

Interesting question, and this may well be a use case that we should support better. As is, the code is mostly oriented towards review of continuous text, typically whole sentences.

The code that checks the spelling of a single token is basically around this line. The call to spelling.Corrector.correct() can optionally be provided with a context, i.e. preceding tokens that will then be used to adjust the correction probabilities based on a trigram language model.

See also the short test function at the bottom of spelling.py.

@lumpidu
Copy link
Author

lumpidu commented Jan 13, 2021

At least the documentation of tokenize() doesn't state assumptions about the text structure in contrast to the documentation of the methods check() or check_single(). Yes this use case exists e.g. for spell checking of web input forms, where often only single words or short text terms are entered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants