Use guess_assess to determine confidence? #109

Mr0grog · 2023-10-05T00:52:29Z

In compat.detect, the confidence is always 0.99:

chardetng-py/python/chardetng_py/compat.py

Lines 16 to 18 in f90b454

    
           # chardetng does not return a confidence value 
        
           # This is the value which is unconditionally returned 
        
           DEFAULT_CONFIDENCE: Final[float] = 0.99

It’s possible I’m misunderstanding things, but it seems like you could use guess_assess() instead of guess() to at least determine high vs. low confidence. From the chardetng docs:

Same as guess(), but also returns a Boolean indicating whether the guessed encoding had a higher score than at least one other candidate. If this method returns false, the guessed encoding is likely to be wrong.

On a related note, it would be nice if shortcuts.detect() returned the boolean from guess_assess, too (or had another method that did so), since the aliases buried in the shortcuts module seem pretty important, and possibly prone to change. Otherwise a user has to know about them if they want to safely use guess_assess instead. (Or the aliases could be documented and moved to a more accessible place.)

The text was updated successfully, but these errors were encountered:

john-parton · 2023-10-05T01:20:08Z

Thanks for taking the time to comment.

If you're integrating this into a new project, my recommendation is to directly use the class bindings: https://chardetng-py.readthedocs.io/en/latest/class_reference.html

Your code should make a determination of how important the higher_score return value is.

However, I do agree that we could return two different confidence values depending on the value of higher_score. Perhaps is higher_score is False, the confidence value can be 0.01 and if it's True it can be 0.99. Thoughts?

I would happily accept a pull request that updates the documentation to drive people to use the class bindings. They're actually really nice, and you can do some things with it that other libraries can't even do. For instance, you can read a file in fixed sized chunks and feed them into the detector chunk-by-chunk, get a detected encoding, and then put the cursor back at the start of the to start reading it as text. In that way, you can cap the memory use of the application.The other python libraries that do charset detection require you to read the entire file into memory.

john-parton · 2023-10-05T01:36:04Z

Basic outline of the proposed change. https://github.com/john-parton/chardetng-py/pull/110/files

Needs docs

Mr0grog · 2023-10-05T02:57:50Z

However, I do agree that we could return two different confidence values depending on the value of higher_score. Perhaps is higher_score is False, the confidence value can be 0.01 and if it's True it can be 0.99. Thoughts?

That makes sense to me!

If you're integrating this into a new project, my recommendation is to directly use the class bindings: https://chardetng-py.readthedocs.io/en/latest/class_reference.html

I guess my concern here was that, if I do this, I lose the aliases support from the shortcut:

chardetng-py/python/chardetng_py/shortcuts.py

Lines 7 to 9 in 65e0a5a

    
           ALIASES: Final[Dict[str, str]] = { 
        
               "windows-874": "cp874", 
        
           }

The code and discussion in #11 read like those are important because some_bytes.decode(encoding_name) might not work like you’d expect without them. (Am I getting that wrong?) It also seems like that’s a list that might change, and I’d like to not have to remember to check the source on every release to see if they are different and update my code that doesn’t use the shortcut to match (or reach down into chardetng_py.shortcuts.ALIASES, which feels like an implementation detail I shouldn’t rely on).

Maybe the right answer there is to get the aliases out of the shortcut module and into src/lib.rs? (Or some new Python code that wraps the Rust EncodingDetector instead of directly exposing it?)

Also: I didn’t realize there were better docs on ReadTheDocs! 👍 (I had figured out how to use EncodingDetector by using the source.) It might help more people find them if you call that out directly in the text of the README (not just the little status shield at the top, which also currently lists ReadTheDocs as failing) and maybe set it as the website URL for the repo on GitHub and as a project link on PyPI (you can set up the links on PyPI using [project.urls] in pyproject.toml — urllib3 is a good example).

john-parton · 2023-10-05T03:10:45Z

Perhaps it would be best if the aliasing were done in rust, in the glue code I have. Good points in making it clearer that there's docs on readthedocs. I'll go create some separate issues for those.

john-parton · 2023-10-05T03:24:57Z

Here's a PR that moves the ALIASES logic into lib.rs #112

john-parton · 2023-10-10T19:33:30Z

I merged in some changes to get the docs building again and added a clear link to the documentation in the readme.

john-parton · 2023-10-18T20:03:05Z

Closed by #110

Documentation has been updated to reflect the change. I'm not sure 0.99 and 0.01 are the best values, but if your application really cares, it should probably use the EncodingDetector class directly and use the bool it returns in guess_assess

john-parton mentioned this issue Oct 5, 2023

Hacked together high/low confidence. #110

Merged

john-parton mentioned this issue Oct 5, 2023

Add clearer links to documentation #111

Closed

2 tasks

john-parton closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use guess_assess to determine confidence? #109

Use guess_assess to determine confidence? #109

Mr0grog commented Oct 5, 2023

john-parton commented Oct 5, 2023

john-parton commented Oct 5, 2023

Mr0grog commented Oct 5, 2023 •

edited

Loading

john-parton commented Oct 5, 2023

john-parton commented Oct 5, 2023

john-parton commented Oct 10, 2023

john-parton commented Oct 18, 2023

Use guess_assess to determine confidence? #109

Use guess_assess to determine confidence? #109

Comments

Mr0grog commented Oct 5, 2023

john-parton commented Oct 5, 2023

john-parton commented Oct 5, 2023

Mr0grog commented Oct 5, 2023 • edited Loading

john-parton commented Oct 5, 2023

john-parton commented Oct 5, 2023

john-parton commented Oct 10, 2023

john-parton commented Oct 18, 2023

Mr0grog commented Oct 5, 2023 •

edited

Loading