-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128) #151
Comments
Thanks for reporting @catqaq. The code works for me, so I suspect it's a platform-specific problem. Can you confirm your operating system? If you set the If this the above works, I think I would be able to fix it by specifying utf8 encoding in places where it's not currently specified (e.g., the |
Thanks! Here is my environment info (from "transformers-cli env"):
I tried set the PYTHONUTF8=1 environment variable followed by https://stackoverflow.com/questions/50933194/how-do-i-set-the-pythonutf8-environment-variable-to-enable-utf-8-encoding-by-def. But i still got the same error. |
Fascinating that PYTHONUTF8 doesn't work-- thanks for testing the encoding fix. Out of curiosity, can you test: import locale
print(locale.getpreferredencoding()) |
So i tried to fix this followed by https://stackoverflow.com/questions/44344458/why-does-locale-getpreferredencoding-return-ansi-x3-4-1968-instead-of-utf-8.
|
Gotcha- thanks! I've opened a PR (#152) that should properly set the encoding everywhere. I'd like to test it a bit more before merging though because it touches a lot of files. (It looks like in most situations it probably doesn't matter though, given the type of data stored in the files.) |
Okay, we've been suffering without some easy-to-use IR dataset interface for a long time, thanks for your excellent work! |
Thanks! |
Describe the bug
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128)
Affected dataset(s)
'msmarco-passage/train'
To Reproduce
Steps to reproduce the behavior:
Just run the official demo code:
`import ir_datasets
if name == "main":
dataset = ir_datasets.load('msmarco-passage/train')
# Documents
for doc in dataset.docs_iter():
print(doc)
`
Expected behavior
get normal output
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: