Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use guess_assess to determine confidence? #109

Closed
Mr0grog opened this issue Oct 5, 2023 · 7 comments
Closed

Use guess_assess to determine confidence? #109

Mr0grog opened this issue Oct 5, 2023 · 7 comments

Comments

@Mr0grog
Copy link
Contributor

Mr0grog commented Oct 5, 2023

In compat.detect, the confidence is always 0.99:

# chardetng does not return a confidence value
# This is the value which is unconditionally returned
DEFAULT_CONFIDENCE: Final[float] = 0.99

It’s possible I’m misunderstanding things, but it seems like you could use guess_assess() instead of guess() to at least determine high vs. low confidence. From the chardetng docs:

Same as guess(), but also returns a Boolean indicating whether the guessed encoding had a higher score than at least one other candidate. If this method returns false, the guessed encoding is likely to be wrong.

On a related note, it would be nice if shortcuts.detect() returned the boolean from guess_assess, too (or had another method that did so), since the aliases buried in the shortcuts module seem pretty important, and possibly prone to change. Otherwise a user has to know about them if they want to safely use guess_assess instead. (Or the aliases could be documented and moved to a more accessible place.)

@john-parton
Copy link
Owner

Thanks for taking the time to comment.

If you're integrating this into a new project, my recommendation is to directly use the class bindings: https://chardetng-py.readthedocs.io/en/latest/class_reference.html

Your code should make a determination of how important the higher_score return value is.

However, I do agree that we could return two different confidence values depending on the value of higher_score. Perhaps is higher_score is False, the confidence value can be 0.01 and if it's True it can be 0.99. Thoughts?

I would happily accept a pull request that updates the documentation to drive people to use the class bindings. They're actually really nice, and you can do some things with it that other libraries can't even do. For instance, you can read a file in fixed sized chunks and feed them into the detector chunk-by-chunk, get a detected encoding, and then put the cursor back at the start of the to start reading it as text. In that way, you can cap the memory use of the application.The other python libraries that do charset detection require you to read the entire file into memory.

@john-parton
Copy link
Owner

Basic outline of the proposed change. https://github.com/john-parton/chardetng-py/pull/110/files

Needs docs

@Mr0grog
Copy link
Contributor Author

Mr0grog commented Oct 5, 2023

However, I do agree that we could return two different confidence values depending on the value of higher_score. Perhaps is higher_score is False, the confidence value can be 0.01 and if it's True it can be 0.99. Thoughts?

That makes sense to me!

If you're integrating this into a new project, my recommendation is to directly use the class bindings: https://chardetng-py.readthedocs.io/en/latest/class_reference.html

I guess my concern here was that, if I do this, I lose the aliases support from the shortcut:

ALIASES: Final[Dict[str, str]] = {
"windows-874": "cp874",
}

The code and discussion in #11 read like those are important because some_bytes.decode(encoding_name) might not work like you’d expect without them. (Am I getting that wrong?) It also seems like that’s a list that might change, and I’d like to not have to remember to check the source on every release to see if they are different and update my code that doesn’t use the shortcut to match (or reach down into chardetng_py.shortcuts.ALIASES, which feels like an implementation detail I shouldn’t rely on).

Maybe the right answer there is to get the aliases out of the shortcut module and into src/lib.rs? (Or some new Python code that wraps the Rust EncodingDetector instead of directly exposing it?)


Also: I didn’t realize there were better docs on ReadTheDocs! 👍 (I had figured out how to use EncodingDetector by using the source.) It might help more people find them if you call that out directly in the text of the README (not just the little status shield at the top, which also currently lists ReadTheDocs as failing) and maybe set it as the website URL for the repo on GitHub and as a project link on PyPI (you can set up the links on PyPI using [project.urls] in pyproject.tomlurllib3 is a good example).

@john-parton
Copy link
Owner

Perhaps it would be best if the aliasing were done in rust, in the glue code I have. Good points in making it clearer that there's docs on readthedocs. I'll go create some separate issues for those.

@john-parton
Copy link
Owner

Here's a PR that moves the ALIASES logic into lib.rs #112

@john-parton
Copy link
Owner

I merged in some changes to get the docs building again and added a clear link to the documentation in the readme.

@john-parton
Copy link
Owner

Closed by #110

Documentation has been updated to reflect the change. I'm not sure 0.99 and 0.01 are the best values, but if your application really cares, it should probably use the EncodingDetector class directly and use the bool it returns in guess_assess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants