-
-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Title Strange Characters issue when reading RSS XML files not encoded in utf-8 #478
Comments
Then providing your changes may probably save others' time so they are more willing to dive into the issue. |
I think the feed has nothing to do with windows-1252. $ chardet index.xml
index.xml: utf-8 with confidence 0.99
Unreproducible. In [1]: import feedparser
In [2]: for entry in feedparser.parse('https://www.lfhacks.com/index.xml').entries:
...: if entry.link == 'https://www.lfhacks.com/tech/python-find-positive/':
...: print(repr(entry.title))
...:
'Python 找出序列里的符合要求的元素' Indeed, in some cases, a UTF-8 feed may be decoded using windows-1252 or iso-8859-2 (probably because it is corrupt, and will be fixed by #421). But that's simply not your case.
The feed itself is a valid UTF-8 XML, no conversion is needed. |
Thank you @Rongronggg9 . Indeed, as you said, I found that the problem was caused by me using I will close this issue. |
Please note that it would be better to pass from io import BytesIO
import feedparser
import requests
r = requests.get(URL)
with BytesIO(r.content) as bio:
d = feedparser.parse(bio, response_headers=r.headers)
... |
For example, when the RSS XML file encoding is windows-1252, if the last byte of the ttile field text value is a blank character, such as 0xA0, which is NSBP, it will be deleted by the strip() function, resulting in the problem of strange characters.
Here is a feed URL, https://www.lfhacks.com/index.xml Some of the article titles inside will become garbled, such as the link https://www.lfhacks.com/tech/python-find-positive/ The corresponding article title will become 'strange characters' because the last byte of its window-1252 encoding is 0xA0 and will be deleted.
I have modified the code to no longer call strip() to remove blank characters at both ends. As a temporary solution, it can meet the needs of my project. But I think the fundamental solution to the problem is to first convert the entire XML file into UTF-8 encoding and then parse it. I hope developers can fix this issue as soon as possible.
thank you.
The text was updated successfully, but these errors were encountered: