Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to resolve "UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte" #14

Open
Yahsaswi opened this issue May 9, 2017 · 1 comment

Comments

@Yahsaswi
Copy link

Yahsaswi commented May 9, 2017

I have some large text files which have such characters and i wish to ignore such characters and proceede with the sentToVec conversion .. I see the below error , please help me fix this .
File "kfold1.py", line 34, in
model = Sent2Vec(LineSentence(sent_file), model_file=input_file + '.model')
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 800, in init
self.reset_sent_vec(sentences)
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 809, in reset_sent_vec
for sent in sentences:
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 1113, in iter
yield utils.to_unicode(line).split()
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte

@sumehta
Copy link

sumehta commented May 24, 2017

I get the same error when I try to load the model using,
model = Word2Vec.load_word2vec_format('test.txt.model', binary=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants