Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text containing &lt; or &gt; are decoded to < or > symbols when parsed #71

Open
GovardhanNag opened this issue Dec 14, 2021 · 1 comment

Comments

@GovardhanNag
Copy link

Hi @radkovo ,

We are using CSSBox DOM parser for parsing the HTML source, here is the implementation:

try (DocumentSource docSource = new StreamDocumentSource(JAFIOUtils.toInputStream(htmlSource),
null, "text/html;charset=UTF-8")) {
LOGGER.error("Before parse "+htmlSource);
// Parse the input document
DOMSource parser = new DefaultDOMSource(docSource);
Document doc = parser.parse();
LOGGER.error("After parse "+doc.getFirstChild().getTextContent());
}

For example lets consider the input source or htmlSource is <style></style>Test User &lt;[email protected]&gt;
After parsing the output will be Test User <[email protected]>.

Here the text content which contains email field enclosed with &lt; and &gt; are decoded to < and >, but as per our requirement, the parser should not decode &lt; and &gt; to < and >.

How to retain the text as it is without decoding or encoding text in this case, @radkovo could you please provide the solution for this issue?

@GovardhanNag
Copy link
Author

GovardhanNag commented Dec 21, 2021

Hi @radkovo ,
Could you please provide any solution to the issue - #71 (comment)
Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant