UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128) #151

catqaq · 2022-01-12T02:33:02Z

Describe the bug
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128)

Affected dataset(s)
'msmarco-passage/train'
To Reproduce
Steps to reproduce the behavior:
Just run the official demo code:
`import ir_datasets

if name == "main":
dataset = ir_datasets.load('msmarco-passage/train')
# Documents
for doc in dataset.docs_iter():
print(doc)
`

Expected behavior
get normal output

Additional context
Add any other context about the problem here.

seanmacavaney · 2022-01-12T09:34:26Z

Thanks for reporting @catqaq. The code works for me, so I suspect it's a platform-specific problem. Can you confirm your operating system?

If you set the PYTHONUTF8=1 environment variable, do you no longer get the error?

If this the above works, I think I would be able to fix it by specifying utf8 encoding in places where it's not currently specified (e.g., the TextIOWrappers here).

catqaq · 2022-01-13T02:18:12Z

Thanks! Here is my environment info (from "transformers-cli env"):

transformers version: 4.15.0
Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-debian-buster-sid
Python version: 3.6.13
PyTorch version (GPU?): 1.10.1+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

I tried set the PYTHONUTF8=1 environment variable followed by https://stackoverflow.com/questions/50933194/how-do-i-set-the-pythonutf8-environment-variable-to-enable-utf-8-encoding-by-def. But i still got the same error.

catqaq · 2022-01-13T02:33:59Z

I don't know why setting the PYTHONUTF8=1 environment variable did not work, but setting utf8 in TextIOWrapper works for me.
if self.stream is None: if isinstance(self.dlc, list): self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc[self.stream_idx].stream()),encoding='utf-8') else: self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()),encoding='utf-8')
oh, the code display is a bit messy.

seanmacavaney · 2022-01-13T09:28:05Z

Fascinating that PYTHONUTF8 doesn't work-- thanks for testing the encoding fix.

Out of curiosity, can you test:

import locale
print(locale.getpreferredencoding())

catqaq · 2022-01-14T02:48:54Z

Fascinating that PYTHONUTF8 doesn't work-- thanks for testing the encoding fix.

Out of curiosity, can you test:
import locale
print(locale.getpreferredencoding())

oh，i got ANSI:

So i tried to fix this followed by https://stackoverflow.com/questions/44344458/why-does-locale-getpreferredencoding-return-ansi-x3-4-1968-instead-of-utf-8.

apt install locales-all
export LANG="en_US.UTF-8"

Then i got the utf8 encoding:

seanmacavaney · 2022-01-14T09:34:32Z

Gotcha- thanks! I've opened a PR (#152) that should properly set the encoding everywhere. I'd like to test it a bit more before merging though because it touches a lot of files. (It looks like in most situations it probably doesn't matter though, given the type of data stored in the files.)

catqaq · 2022-01-14T12:27:42Z

Okay, we've been suffering without some easy-to-use IR dataset interface for a long time, thanks for your excellent work!

seanmacavaney · 2022-01-14T12:47:40Z

Thanks!

catqaq added the bug Something isn't working label Jan 12, 2022

seanmacavaney mentioned this issue Jan 13, 2022

encoding fixes #152

Open

catqaq closed this as completed Mar 1, 2022

seanmacavaney mentioned this issue Sep 3, 2022

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs #208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128) #151

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128) #151

catqaq commented Jan 12, 2022 •

edited

Loading

seanmacavaney commented Jan 12, 2022

catqaq commented Jan 13, 2022

catqaq commented Jan 13, 2022 •

edited

Loading

seanmacavaney commented Jan 13, 2022

catqaq commented Jan 14, 2022 •

edited

Loading

seanmacavaney commented Jan 14, 2022

catqaq commented Jan 14, 2022

seanmacavaney commented Jan 14, 2022

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128) #151

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128) #151

Comments

catqaq commented Jan 12, 2022 • edited Loading

seanmacavaney commented Jan 12, 2022

catqaq commented Jan 13, 2022

catqaq commented Jan 13, 2022 • edited Loading

seanmacavaney commented Jan 13, 2022

catqaq commented Jan 14, 2022 • edited Loading

seanmacavaney commented Jan 14, 2022

catqaq commented Jan 14, 2022

seanmacavaney commented Jan 14, 2022

catqaq commented Jan 12, 2022 •

edited

Loading

catqaq commented Jan 13, 2022 •

edited

Loading

catqaq commented Jan 14, 2022 •

edited

Loading