-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode ecodeError while parsing the PDF files. #17
Comments
yes, thats true @adityardesai |
Thanks for letting us know @manalishah . But I tried the patch given and again same error I am seeing. Am I missing any steps, apart from adding |
can you upload any one such pdf file that gives you this error? I can replicate the issue and try to resolve it. @adityardesai |
Sure @manalishah . Attached is the sample file. I just added tokenized = nltk.word_tokenize(content.decode("utf-8")) to the server.py and re-run the REST server and again same error. |
Hi
I am using NLTKRest server to parse few of the PDF files from Polar Trec Data and get the required NER quantities. But for most of the PDF files I am seeing the following error from the REST server.
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128) // Werkzeug Debugger "
Command used is
curl -X POST -d "PDF TEXT in STRING" http://localhost:8888/nltk.
Error file is attached as well.
nltkrest.txt
The text was updated successfully, but these errors were encountered: