-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use guess_assess to determine confidence? #109
Comments
Thanks for taking the time to comment. If you're integrating this into a new project, my recommendation is to directly use the class bindings: https://chardetng-py.readthedocs.io/en/latest/class_reference.html Your code should make a determination of how important the However, I do agree that we could return two different confidence values depending on the value of I would happily accept a pull request that updates the documentation to drive people to use the class bindings. They're actually really nice, and you can do some things with it that other libraries can't even do. For instance, you can read a file in fixed sized chunks and feed them into the detector chunk-by-chunk, get a detected encoding, and then put the cursor back at the start of the to start reading it as text. In that way, you can cap the memory use of the application.The other python libraries that do charset detection require you to read the entire file into memory. |
Basic outline of the proposed change. https://github.com/john-parton/chardetng-py/pull/110/files Needs docs |
That makes sense to me!
I guess my concern here was that, if I do this, I lose the aliases support from the shortcut: chardetng-py/python/chardetng_py/shortcuts.py Lines 7 to 9 in 65e0a5a
The code and discussion in #11 read like those are important because Maybe the right answer there is to get the aliases out of the shortcut module and into Also: I didn’t realize there were better docs on ReadTheDocs! 👍 (I had figured out how to use |
Perhaps it would be best if the aliasing were done in rust, in the glue code I have. Good points in making it clearer that there's docs on readthedocs. I'll go create some separate issues for those. |
Here's a PR that moves the ALIASES logic into lib.rs #112 |
I merged in some changes to get the docs building again and added a clear link to the documentation in the readme. |
Closed by #110 Documentation has been updated to reflect the change. I'm not sure 0.99 and 0.01 are the best values, but if your application really cares, it should probably use the |
In
compat.detect
, the confidence is always 0.99:chardetng-py/python/chardetng_py/compat.py
Lines 16 to 18 in f90b454
It’s possible I’m misunderstanding things, but it seems like you could use
guess_assess()
instead ofguess()
to at least determine high vs. low confidence. From the chardetng docs:On a related note, it would be nice if
shortcuts.detect()
returned the boolean fromguess_assess
, too (or had another method that did so), since the aliases buried in the shortcuts module seem pretty important, and possibly prone to change. Otherwise a user has to know about them if they want to safely useguess_assess
instead. (Or the aliases could be documented and moved to a more accessible place.)The text was updated successfully, but these errors were encountered: