Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade CSSbox to use active NekoHTML? #72

Open
soundasleep opened this issue Mar 17, 2022 · 3 comments
Open

Upgrade CSSbox to use active NekoHTML? #72

soundasleep opened this issue Mar 17, 2022 · 3 comments

Comments

@soundasleep
Copy link
Contributor

It seems that CssBox is set up to use nekohtml 1.9.22, of which development seems to have ceased in 2015.

Nekohtml has since been forked into a new project https://github.com/HtmlUnit/htmlunit-neko which has active development (2.59.0 was released 14 days ago).

Is there any appetite for upgrading Cssbox to use the newer nekohtml?

Alternatively is there a way to configure a local installation to use this nekohtml instead?

(I ask because I'm hitting some weird bugs that I think are due to nekohtml's parsing.)

@soundasleep
Copy link
Contributor Author

With a bit of hacking I've found a way to configure a local installation (through Gradle) to use nekohtml 2.59.

Configure your build.gradle to exclude the transitive dependency:

implementation("net.sf.cssbox:cssbox:$cssboxVersion") {
  exclude group: "net.sourceforge.nekohtml", module: "nekohtml" 
}
implementation "net.sourceforge.htmlunit:neko-htmlunit:$nekoHtmlUnitVersion"

And then it looks like the only change one needs to make is to not use DefaultDOMSource:

public class BetterDOMSource extends DOMSource {
	public BetterDOMSource(DocumentSource src) {
		super(src);
	}

	@Override
	public Document parse() throws SAXException, IOException {
		DOMParser parser = new DOMParser(new HTMLConfiguration(););
		parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
		if (charset != null)
			parser.setProperty("http://cyberneko.org/html/properties/default-encoding", charset);
		parser.parse(new org.xml.sax.InputSource(getDocumentSource().getInputStream()));
		return parser.getDocument();
	}
}

And use this source to load your Documents instead:

ByteArrayInputStream is = new ByteArrayInputStream(html.getBytes(Charset.forName("UTF-8")));
StreamDocumentSource source = new StreamDocumentSource(is, url, "text/html");

DOMSource parser = new BetterDOMSource(source);
Document document = parser.parse();

@miurahr
Copy link
Contributor

miurahr commented Oct 29, 2023

I think we can use org.htmlunit:htmlunit-neko:3.6.0 for CSSBox project.
It solves CVE-2022-29546 and CVE-2022-28366.

I've changed DefaultDOMSource like

    public Document parse() throws SAXException, IOException
    {
        DOMParser parser = new DOMParser(HTMLDocumentImpl.class);
        parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
        if (charset != null)
            parser.setProperty("http://cyberneko.org/html/properties/default-encoding", charset);
        parser.parse(new org.xml.sax.InputSource(getDocumentSource().getInputStream()));
        return parser.getDocument();
    }

@miurahr
Copy link
Contributor

miurahr commented Oct 29, 2023

I've proposed the change to [email protected] in Apr. 2023, and further update today.
#81

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants