Text containing < or > are decoded to < or > symbols when parsed #71

GovardhanNag · 2021-12-14T14:14:19Z

We are using CSSBox DOM parser for parsing the HTML source, here is the implementation:

try (DocumentSource docSource = new StreamDocumentSource(JAFIOUtils.toInputStream(htmlSource),
null, "text/html;charset=UTF-8")) {
LOGGER.error("Before parse "+htmlSource);
// Parse the input document
DOMSource parser = new DefaultDOMSource(docSource);
Document doc = parser.parse();
LOGGER.error("After parse "+doc.getFirstChild().getTextContent());
}

For example lets consider the input source or htmlSource is <style></style>Test User <[email protected]>
After parsing the output will be Test User <[email protected]>.

Here the text content which contains email field enclosed with < and > are decoded to < and >, but as per our requirement, the parser should not decode < and > to < and >.

How to retain the text as it is without decoding or encoding text in this case, @radkovo could you please provide the solution for this issue?

The text was updated successfully, but these errors were encountered:

GovardhanNag · 2021-12-21T11:05:23Z

Hi @radkovo ,
Could you please provide any solution to the issue - #71 (comment)
Thanks in advance.

rasmusfaber mentioned this issue Apr 13, 2023

Properly escape contents of text nodes and use self-closing tags for empty nodes #82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text containing < or > are decoded to < or > symbols when parsed #71

Text containing < or > are decoded to < or > symbols when parsed #71

GovardhanNag commented Dec 14, 2021

GovardhanNag commented Dec 21, 2021 •

edited

Loading

Text containing &lt; or &gt; are decoded to < or > symbols when parsed #71

Text containing &lt; or &gt; are decoded to < or > symbols when parsed #71

Comments

GovardhanNag commented Dec 14, 2021

GovardhanNag commented Dec 21, 2021 • edited Loading

Text containing < or > are decoded to < or > symbols when parsed #71

Text containing < or > are decoded to < or > symbols when parsed #71

GovardhanNag commented Dec 21, 2021 •

edited

Loading