Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Title Strange Characters issue when reading RSS XML files not encoded in utf-8 #478

Closed
laiyonghao opened this issue Sep 24, 2024 · 4 comments

Comments

@laiyonghao
Copy link

laiyonghao commented Sep 24, 2024

For example, when the RSS XML file encoding is windows-1252, if the last byte of the ttile field text value is a blank character, such as 0xA0, which is NSBP, it will be deleted by the strip() function, resulting in the problem of strange characters.

Here is a feed URL, https://www.lfhacks.com/index.xml Some of the article titles inside will become garbled, such as the link https://www.lfhacks.com/tech/python-find-positive/ The corresponding article title will become 'strange characters' because the last byte of its window-1252 encoding is 0xA0 and will be deleted.

I have modified the code to no longer call strip() to remove blank characters at both ends. As a temporary solution, it can meet the needs of my project. But I think the fundamental solution to the problem is to first convert the entire XML file into UTF-8 encoding and then parse it. I hope developers can fix this issue as soon as possible.

thank you.

@Rongronggg9
Copy link
Contributor

Rongronggg9 commented Sep 24, 2024

I have modified the code

fix this issue as soon as possible.

Then providing your changes may probably save others' time so they are more willing to dive into the issue.

@Rongronggg9
Copy link
Contributor

Rongronggg9 commented Sep 24, 2024

when the RSS XML file encoding is windows-1252

lfhacks.com/index.xml

I think the feed has nothing to do with windows-1252.

$ chardet index.xml
index.xml: utf-8 with confidence 0.99

lfhacks.com/tech/python-find-positive The corresponding article title will become 'strange characters'

Unreproducible.

In [1]: import feedparser

In [2]: for entry in feedparser.parse('https://www.lfhacks.com/index.xml').entries:
   ...:     if entry.link == 'https://www.lfhacks.com/tech/python-find-positive/':
   ...:         print(repr(entry.title))
   ...: 
'Python 找出序列里的符合要求的元素'

Indeed, in some cases, a UTF-8 feed may be decoded using windows-1252 or iso-8859-2 (probably because it is corrupt, and will be fixed by #421). But that's simply not your case.

I think the fundamental solution to the problem is to first convert the entire XML file into UTF-8 encoding

The feed itself is a valid UTF-8 XML, no conversion is needed.

@laiyonghao
Copy link
Author

laiyonghao commented Sep 25, 2024

Thank you @Rongronggg9 .

Indeed, as you said, feedparser is not the cause of the problem.

I found that the problem was caused by me using requests.get (URL).text to retrieve the file content. There won't be any problem using the parse(URL) directly. There won't be any problem using requests.get (URL).content, as it is caused by requests guessing encoding errors when there is insufficient information.

I will close this issue.

@Rongronggg9
Copy link
Contributor

Rongronggg9 commented Sep 25, 2024

it is caused by requests guessing encoding

feedparser can guess encoding more properly because the functionality is implemented according to various RFCs related to web feeds. Thus, it is completely unnecessary to let requests do this in place.

There won't be any problem using requests.get (URL).content

Please note that it would be better to pass response_headers along to make some internal functionalities, including encoding detection, work more smoothly. E.g.,

from io import BytesIO

import feedparser
import requests

r = requests.get(URL)
with BytesIO(r.content) as bio:
    d = feedparser.parse(bio, response_headers=r.headers)
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants