Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128) #151

Closed
catqaq opened this issue Jan 12, 2022 · 8 comments
Labels
bug Something isn't working

Comments

@catqaq
Copy link

catqaq commented Jan 12, 2022

Describe the bug
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128)

Affected dataset(s)
'msmarco-passage/train'
To Reproduce
Steps to reproduce the behavior:
Just run the official demo code:
`import ir_datasets

if name == "main":
dataset = ir_datasets.load('msmarco-passage/train')
# Documents
for doc in dataset.docs_iter():
print(doc)
`

Expected behavior
get normal output

Additional context
Add any other context about the problem here.

@catqaq catqaq added the bug Something isn't working label Jan 12, 2022
@seanmacavaney
Copy link
Collaborator

Thanks for reporting @catqaq. The code works for me, so I suspect it's a platform-specific problem. Can you confirm your operating system?

If you set the PYTHONUTF8=1 environment variable, do you no longer get the error?

If this the above works, I think I would be able to fix it by specifying utf8 encoding in places where it's not currently specified (e.g., the TextIOWrappers here).

@catqaq
Copy link
Author

catqaq commented Jan 13, 2022

Thanks! Here is my environment info (from "transformers-cli env"):

  • transformers version: 4.15.0
  • Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-debian-buster-sid
  • Python version: 3.6.13
  • PyTorch version (GPU?): 1.10.1+cu102 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

I tried set the PYTHONUTF8=1 environment variable followed by https://stackoverflow.com/questions/50933194/how-do-i-set-the-pythonutf8-environment-variable-to-enable-utf-8-encoding-by-def. But i still got the same error.

@catqaq
Copy link
Author

catqaq commented Jan 13, 2022

I don't know why setting the PYTHONUTF8=1 environment variable did not work, but setting utf8 in TextIOWrapper works for me.
if self.stream is None: if isinstance(self.dlc, list): self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc[self.stream_idx].stream()),encoding='utf-8') else: self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()),encoding='utf-8')
oh, the code display is a bit messy.
image

@seanmacavaney
Copy link
Collaborator

Fascinating that PYTHONUTF8 doesn't work-- thanks for testing the encoding fix.

Out of curiosity, can you test:

import locale
print(locale.getpreferredencoding())

@catqaq
Copy link
Author

catqaq commented Jan 14, 2022

Fascinating that PYTHONUTF8 doesn't work-- thanks for testing the encoding fix.

Out of curiosity, can you test:

import locale
print(locale.getpreferredencoding())

oh,i got ANSI:
image

So i tried to fix this followed by https://stackoverflow.com/questions/44344458/why-does-locale-getpreferredencoding-return-ansi-x3-4-1968-instead-of-utf-8.

  1. apt install locales-all
  2. export LANG="en_US.UTF-8"

Then i got the utf8 encoding:
image

@seanmacavaney
Copy link
Collaborator

Gotcha- thanks! I've opened a PR (#152) that should properly set the encoding everywhere. I'd like to test it a bit more before merging though because it touches a lot of files. (It looks like in most situations it probably doesn't matter though, given the type of data stored in the files.)

@catqaq
Copy link
Author

catqaq commented Jan 14, 2022

Okay, we've been suffering without some easy-to-use IR dataset interface for a long time, thanks for your excellent work!

@seanmacavaney
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants