Parsing non-UTF-8 pages #6

edevil · 2017-05-31T15:11:41Z

Parsing pages not written in UTF-8 currently produces errors:

> %HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
> Html5ever.parse(body)

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }', src/libcore/result.rs:859
note: Run with `RUST_BACKTRACE=1` for a backtrace.
{:error, "called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }"}

In this case this XML feed has the encoding in the xml preeamble:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Can I get around this problem or can the library be fixed to handle this situation?

The text was updated successfully, but these errors were encountered:

mischov · 2017-05-31T17:27:55Z

I'll leave the broader question of "can the library be fixed to handle this situation?" to Hans, but-

Can I get around this problem

Yeah, to some definition of get around.

body
|> Codepagex.to_string!(:iso_8859_1)
|> Html5ever.parse()

edevil · 2017-06-01T11:22:43Z

Thanks, @mischov!

hansihe · 2017-06-01T11:45:11Z

Going to keep this open, I would still like to find a proper solution for this.

As far as I can tell, html5ever does not support detecting encoding yet. See this issue.

philss mentioned this issue Jun 1, 2017

Encoding is not taken into account when parsing file philss/floki#116

Closed

edevil closed this as completed Jun 1, 2017

hansihe reopened this Jun 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing non-UTF-8 pages #6

Parsing non-UTF-8 pages #6

edevil commented May 31, 2017

mischov commented May 31, 2017 •

edited

Loading

edevil commented Jun 1, 2017

hansihe commented Jun 1, 2017

Parsing non-UTF-8 pages #6

Parsing non-UTF-8 pages #6

Comments

edevil commented May 31, 2017

mischov commented May 31, 2017 • edited Loading

edevil commented Jun 1, 2017

hansihe commented Jun 1, 2017

mischov commented May 31, 2017 •

edited

Loading