Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding iso-8859-2 #409

Open
maciexZ opened this issue Aug 2, 2020 · 1 comment
Open

encoding iso-8859-2 #409

maciexZ opened this issue Aug 2, 2020 · 1 comment

Comments

@maciexZ
Copy link

maciexZ commented Aug 2, 2020

Hi, I have a problem with encoding on a webpage encoded iso-8859-2.

If I go:

session = HTMLSession()
page = session.get('https://bonito.pl/bestsellery')

page.html is without Polish letters.
I tried to workaround it:

from urllib.request import urlopen
textPage = urlopen("https://bonito.pl/bestsellery")
textPage = textPage.read().decode( "ISO-8859-2")   //Polish letters properly decoded
page = HTML(html=textPage, default_encoding="ISO-8859-2") //page.html still without Polish letters

and also

textPage = textPage.read().decode( "ISO-8859-2").encode("UTF-8") 
page = HTML(html=textPage, default_encoding="UTF-8")  

but without solving the problem.

If I use BeautifulSoup:

page = BeautifulSoup(textPage,"lxml",from_encoding="utf-8")
page = HTML(html=page.html)

Polish letters are properly decoded and visible.

I definitely prefer request_html over bs4, so I would be very grateful for help. How can solve this issue?
Thanks!

@snowmanjx
Copy link

Have you tried to change the page.html.encoding to utf-8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants