Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not able to get data from csv file to train network in "train-theano.py" #14

Open
totial opened this issue Feb 3, 2017 · 3 comments
Open

Comments

@totial
Copy link

totial commented Feb 3, 2017

Hey, Im having troubles getting the data to train the RNN. Specifically on this line:
sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
if I open the file as 'rb' i get the error:

_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

and if I open it up with 'r' i get:

sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])

AttributeError: 'str' object has no attribute 'decode'

Im not sure wich is the very basic idea to train the NN with strings or binary codes (guess binary codes).
thanks for your time!

@GoingMyWay
Copy link

Maybe your Python version is 3.x, the code below runs without error under Python 2.7

with open('data/reddit-comments-2015-08.csv', 'rb') as f:
    reader = csv.reader(f, skipinitialspace=True)
    reader.next()
    # Split full comments into sentences
    sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
    # Append SENTENCE_START and SENTENCE_END
    sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]

@chrischang80
Copy link

You can remove ".decode('utf-8')" and try again.

@Pavonlo
Copy link

Pavonlo commented Feb 7, 2019

You can remove ".decode('utf-8')" and try again.

Yes, you must remove this but a couple of other changes are also required so that entire line becomes -
with open('data/reddit-comments-2015-08.csv', 'rt', encoding="utf8") as f:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants