Skip to content

Make attribute value and text content escaping more conforming #304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

PolariTOON
Copy link

Fixes #216
Supersedes and closes #217

The HTML spec defines which characters to escape when serializing HTML. This got changed recently to also serialize < and > in attribute values.

This aligns with the DOM Parsing and Serialization spec which defines its own rule for escaping attribute values and for escaping text nodes, both requiring these two characters to always be escaped.

HTML also requires non-breaking spaces to be escaped (in HTML).
Also browsers de facto serialize tabs, line feeds and carriage returns in attribute values in XML (there is no spec for this, but there is a web platform test). JSDOM does this too.

This PR implements those requirements.

This PR also brings another fix about <title> elements which used to be considered a special case when serializing in this codebase, while i couldn't find anything in the spec about this (the <title> element is not in this list). Instead, we customize its innerHTML setter to escape everything, which prevents child elements to be added to them through this property (for HTML documents only).

To match `XMLSerializer` for XML documents and `element.outerHTML` for HTML documents.
@PolariTOON PolariTOON force-pushed the attribute-serialization branch from d2ca98c to 38595ed Compare May 29, 2025 17:07
@PolariTOON PolariTOON force-pushed the attribute-serialization branch from 38595ed to f67a19c Compare May 29, 2025 17:10
@PolariTOON PolariTOON force-pushed the attribute-serialization branch from f67a19c to 6aa67ee Compare May 29, 2025 17:17
@WebReflection
Copy link
Owner

WebReflection commented May 29, 2025

I think this is a breaking change … and title doesn’t need escaping or it results as title escaped … if you’re after JSDOM compliance I think you should stop trying to make LinkeDOM JSDOM compliant but if you need for real world, not testing related reasons, these changes, please be careful with suggested changes because this library is not just a mock target, it can render in workers and edge cloud scenarios

@WebReflection
Copy link
Owner

to clarify, this is a perfectly valid document.title every browser will understand:

<title>1 <and>&</and> 2</title>

it's a special node that acts like style or script (a data node) and I am not sure it needs fixing as it is right now ... true that if html-escape should consider non breaking spaces too I need to fix this library too and these changes will be welcome but it's weird nobody ever complained to date about that library not catching the &nbsp; explicitly which is why I think this might be a breaking thing in here ... do you have concrete examples where the current state breaks the Web? Thank you!

@PolariTOON
Copy link
Author

PolariTOON commented May 30, 2025

My general use case for the three PRs i have opened is actually the same one. I'm doing SSR of XHTML and SVG documents and until now i was indeed using JSDOM for this. However i'm starting to face performance issues, as JSDOM implements many things i don't need. Basically all i need is parsing, basic DOM manipulation (setting attributes and adding nodes) and serializing the whole thing, and LinkeDOM seems to fit my needs. Thanks for making it!

The data i generate is versioned though, so while this not about testing, i indeed want to minimize the diffs, and together, the PRs i have opened achieve avoiding having any diff on my data.

That said, the particular issues i'm trying to fix with this PR are:

  • generating malformed XHTML documents that browsers can't open, due to < and > not being escaped in attribute values.

  • attribute values containing tabs and newlines not eventually roundtripping, resulting in a data loss. XML parsers in browsers normalize unescaped tabs and line feeds as spaces. I'm displaying attributes with CSS, so i want to keep these characters and that requires escaping them, as specced.

  • some XSS issues with the current <title> implementation, that i realized while looking at this test:

    const {document} = parseHTML("<html><head><title>Title</title></head></html>");
    document.title = "</title><script>alert(1)</script>";
    // or document.querySelector("title").textContent = "</title><script>alert(1)</script>";
    // or document.querySelector("title").innerHTML = "</title><script>alert(1)</script>";
    document.toString(); // <html><head><title></title><script>alert(1)</script></title></head></html>

    This PR fixes the three XSS issues above by:

    • not overriding titleElement.toString()
    • not delegating titleElement.innerHTML to titleElement.textContent
    • escaping values provided to titleElement.innerHTML's setter

    That last item is precisely the special-case you're talking about in the HTML parser: most of the time you don't have to (double) escape the title content yourself as the engine (LinkeDOM here) is supposed to do it for you.
    Should i add this example as testcases?

While looking at the spec, i also fixed other escaping cases i encountered. &nbsp; is just one of these so I don't have any use case for it to be honest (as i'm focusing on XHTML rather than on HTML).

@WebReflection
Copy link
Owner

JSDOM implements many things i don't need

this is the key and reason this project is fast ... so let's talk about this: why would you need attributes insertion order, as example or why should I add complexity for things nobody practically cares about?

I am afraid I'll be a push back for anything not really interesting, relevant, solving anything useful and so on ... there is a bug? PR welcome ... there is a not 100% strict behavior nobody cares about or use in the real world? PR not welcome as I don't care maintaining those things and neither should you.

@WebReflection
Copy link
Owner

I think there are different things to consider here:

  • XSS on the server feels like FUD to me ... you produce the output and if you do that after your client side users pass arbitrary strings to produce such output you have way more issues than yourself creating XSS to your own site ... it's like stating that if you have secrets in your server these might leak ... well, of course if you don't know what library you are using, otherwise that's just not how XSS works as XSS requires foreign unknown embedding of code, here, on the backend, there's not such thing unless you implement that
  • attributes should be escaped or not breaking ... this is the only fix worth it to me
  • title might sanitizaiton but I need tests that confirm current scenario won't break

So if we could simplify and split the MR in different precise fixes with tests, that'd be awesome ... otherwise it'll stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Characters (<) and (>) escaped incorrectly in xml attributes
2 participants