-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash with URLs containing non ascii characters #6
Comments
Reproducted by crawling url https://en.wikipedia.org/wiki/Svenska_Antipiratbyrån
I'm on last master commit 0a0d50f |
I added an "except UnicodeDecodeError" in _browse_from_links with some messages to Debug and Info logs and now it's running to check if I get some info. |
I dug further into this. The problem is not with the encoding of the file noisy.py (set with # -- coding: utf-8 -- at the beginning of the file for instance as a good practice) but with the default or detected encoding. The page at http://www.kopimi.com/ is in fact encoded in cp1252 as stated in its header
It seems that the module requests is aware of the encoding, but for some reason it fails to identify properly the encoding of the page.
I guess that a failure is expected when the content of the page is decoded to from its falsely identified ISO-8859-1 encoding to unicode prior to further transformation/analysis. My suggestion is to detect a decoding error of links in _is_blacklisted. A diff file is attached. The drawback is that some links will be dropped, but it might be an acceptable trade-off against a much more complex encoding resilient parser. For the record, a clean protection of the decoding would be something something like url = unicode(url.decode(errors = "replace")) but at the end of the day the link will likely not work and it would need to be protected as unicode() was dropped in Python3 For Python3, I do not know if the parsing is more efficient, or the conversion "fails silently" |
Why not just skip URLs that cause errors, since this program is designed to make noise in the HTTP world, it seems pointless to fix issues like this instead of just skipping to the next URL and continuing on with the sequence. |
Dropping the troublesome links is indeed the approach taken in #10 |
It seems to have worked for me since the last PR #10 |
My pleasure. It seems that I messed up with the associated pull request(s). I'll resubmit one with the proper fix. |
INFO:root:Visiting http://lady.163.com/special/photo-search/#q=陈学冬 |
@wxdczd was your code manually patched with https://github.com/1tayH/noisy/pull/16/files or is it the vanilla head from https://github.com/1tayH/noisy ? |
I am using the original. https:// https://github.com/1tayH/noisy |
@wxdczd OK, so it is very likely that the encountered issue is addressed by the proposed pull request. If urgent, you might want to give a try to the master branch my fork. |
@Arduous Again, it crashed. |
@wxdczd |
INFO:root:Visiting http://ent.163.com/photoview/00AJ0003/659065.html |
Greetings. I'm using patched code from https://github.com/1tayH/noisy/pull/16/files (using Python 2.7. -
|
Hello, I see that the target webpage is not declaring its encoding properly. Unfortunately I am not able to reproduce. Could you indicate the version of the "requests" library by running "python -m requests.help"? The error masquerading was recently improved, and any exception from "requests" should get caught at line 177. |
Greetings.
|
Thank you. It seems I should have requested you to run "python3 -m requests.help" so that the correct interpreter is is used, but the error might not come from there as your python2 install seems up to date. I am unfortunately not able to reproduce. Are you able to reproduce easily the issue, or did it happen randomly? In the earlier case, could you share your config.json? What is the output of |
Seems i can't reproduce it easily, but here is another:
this is my config (it has 1mil links, so the packed archive is about 6mb. |
Thanks, I'll be back in a few days, but in the while.... |
Seems working non-stop for 4 days with no errors. |
My interpreter, as seen below is Python 2.7.
Here is the last URLs visited, and the traceback of the exception
The correct link is https://en.wikipedia.org/wiki/Piratbyr%C3%A5n
I was not able to confirm, but I think that the problem is coming from:
noisy/noisy.py
Line 19 in ae70264
where latin-1 is mandated. Wouldn't a standard utf-8 approach work better?
Adding
# -*- coding: utf-8 -*-
at the top of noisy.py should do the trick. It would tell Python 2 to work with UTF-8, and be transparent to Python 3
The text was updated successfully, but these errors were encountered: