Title Strange Characters issue when reading RSS XML files not encoded in utf-8 #478

laiyonghao · 2024-09-24T16:08:38Z

For example, when the RSS XML file encoding is windows-1252, if the last byte of the ttile field text value is a blank character, such as 0xA0, which is NSBP, it will be deleted by the strip() function, resulting in the problem of strange characters.

Here is a feed URL, https://www.lfhacks.com/index.xml Some of the article titles inside will become garbled, such as the link https://www.lfhacks.com/tech/python-find-positive/ The corresponding article title will become 'strange characters' because the last byte of its window-1252 encoding is 0xA0 and will be deleted.

I have modified the code to no longer call strip() to remove blank characters at both ends. As a temporary solution, it can meet the needs of my project. But I think the fundamental solution to the problem is to first convert the entire XML file into UTF-8 encoding and then parse it. I hope developers can fix this issue as soon as possible.

thank you.

Rongronggg9 · 2024-09-24T19:33:41Z

I have modified the code

fix this issue as soon as possible.

Then providing your changes may probably save others' time so they are more willing to dive into the issue.

Rongronggg9 · 2024-09-24T19:59:56Z

when the RSS XML file encoding is windows-1252

lfhacks.com/index.xml

I think the feed has nothing to do with windows-1252.

$ chardet index.xml
index.xml: utf-8 with confidence 0.99

lfhacks.com/tech/python-find-positive The corresponding article title will become 'strange characters'

Unreproducible.

In [1]: import feedparser

In [2]: for entry in feedparser.parse('https://www.lfhacks.com/index.xml').entries:
   ...:     if entry.link == 'https://www.lfhacks.com/tech/python-find-positive/':
   ...:         print(repr(entry.title))
   ...: 
'Python 找出序列里的符合要求的元素'

Indeed, in some cases, a UTF-8 feed may be decoded using windows-1252 or iso-8859-2 (probably because it is corrupt, and will be fixed by #421). But that's simply not your case.

I think the fundamental solution to the problem is to first convert the entire XML file into UTF-8 encoding

The feed itself is a valid UTF-8 XML, no conversion is needed.

laiyonghao · 2024-09-25T04:06:38Z

Thank you @Rongronggg9 .

Indeed, as you said, feedparser is not the cause of the problem.

I found that the problem was caused by me using requests.get (URL).text to retrieve the file content. There won't be any problem using the parse(URL) directly. There won't be any problem using requests.get (URL).content, as it is caused by requests guessing encoding errors when there is insufficient information.

I will close this issue.

Rongronggg9 · 2024-09-25T18:14:21Z

it is caused by requests guessing encoding

feedparser can guess encoding more properly because the functionality is implemented according to various RFCs related to web feeds. Thus, it is completely unnecessary to let requests do this in place.

There won't be any problem using requests.get (URL).content

Please note that it would be better to pass response_headers along to make some internal functionalities, including encoding detection, work more smoothly. E.g.,

from io import BytesIO

import feedparser
import requests

r = requests.get(URL)
with BytesIO(r.content) as bio:
    d = feedparser.parse(bio, response_headers=r.headers)
...

laiyonghao closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Title Strange Characters issue when reading RSS XML files not encoded in utf-8 #478

Title Strange Characters issue when reading RSS XML files not encoded in utf-8 #478

laiyonghao commented Sep 24, 2024 •

edited

Loading

Rongronggg9 commented Sep 24, 2024 •

edited

Loading

Rongronggg9 commented Sep 24, 2024 •

edited

Loading

laiyonghao commented Sep 25, 2024 •

edited

Loading

Rongronggg9 commented Sep 25, 2024 •

edited

Loading

Title Strange Characters issue when reading RSS XML files not encoded in utf-8 #478

Title Strange Characters issue when reading RSS XML files not encoded in utf-8 #478

Comments

laiyonghao commented Sep 24, 2024 • edited Loading

Rongronggg9 commented Sep 24, 2024 • edited Loading

Rongronggg9 commented Sep 24, 2024 • edited Loading

laiyonghao commented Sep 25, 2024 • edited Loading

Rongronggg9 commented Sep 25, 2024 • edited Loading

laiyonghao commented Sep 24, 2024 •

edited

Loading

Rongronggg9 commented Sep 24, 2024 •

edited

Loading

Rongronggg9 commented Sep 24, 2024 •

edited

Loading

laiyonghao commented Sep 25, 2024 •

edited

Loading

Rongronggg9 commented Sep 25, 2024 •

edited

Loading