<th> is stripped even though it's an allowed tag if not part of full table markup #110

robertc · 2017-03-02T16:30:52Z

sanitizer.Sanitize("<th>Header</th>")

Result is Header. However:

sanitizer.Sanitize("<table><tr><th>Header</th></tr></table>")

Result is <table><tbody><tr><th>Header</th></tr></tbody></table>

My expectation was that when parsing a fragment rather than a document the full structure would not be required.

The text was updated successfully, but these errors were encountered:

mganss · 2017-03-02T17:38:12Z

In the first example, the th is not stripped by the sanitizer but by the HTML parser because it's invalid HTML. This has nothing to do with document vs. fragment, i.e. it would be the same situation if you had a stray th in a full document. The parser behaves identically as in a browser:

var e = document.createElement("div");
e.innerHTML = "<th>Header</th>";
// e.innerHTML === "Header"

panetta-net-au · 2017-04-21T06:23:31Z

Are you saying that HtmlSanitizer isn't intended to parse fragments?

The context matters for a fragment because your example works fine with a tr instead of div

var e = document.createElement("tr");
e.innerHTML = "<th>Header</th>";
// e.innerHTML === "<th>Header</th>"

If fragment parsing is intended, then you would probably need to add a context parameter to the Santize method.

sanitizer.Sanitize("<th>Header</th>", context: "tr")

or

sanitizer.Sanitize("<th>Header</th>", context: new [] { "table", "tr" })

Or, a user would have to workaround by manually adding the full context to the string before calling the Sanitize method and removing it after.

I have no idea if this is feasible - just thought I'd make a comment from an API perspective.
Obviously if the parser isn't intended to handle fragments, that's a design decision and the workaround would be the way to go.

tiesont · 2017-04-21T07:46:16Z

@panetta-net-au Well, the HTML parser is AngleSharp, which in theory is supposed to give you the same "corrected" markup as most browsers (meaning it will try to correct fragments if possible). A <th> by itself would not be valid markup, so it gets stripped. I assume you'd see the same behavior with orphaned <li>, <dt>, or <dd> elements, for example.

panetta-net-au · 2017-04-21T08:01:24Z

I suppose it depends how you define a fragment. If a fragment is a node or nodes that can exist as a direct descendent of body - then yeah, I agree - but if a fragment is a node that can exist anywhere in the tree, then I'd say they're not currently supported.

Regardless, the workaround is probably to wrap your orphaned fragments in the context they'll be used in before sending it through AngleSharp.

Whether that's the responsibility of the user or this library - I think that is the debate.

IMHO, if the user has access to the anglesharp document before and after sanitisation, that's probably enough to do what I suggested earlier.

mganss · 2017-04-21T09:59:27Z

@panetta-net-au You're right, currently the context of fragments is always <body>...</body>. If you want another context, the workaround is to wrap the input with your context before sanitization and then strip it afterwards:

var sanitizer = new HtmlSanitizer();
var html = @"<th>Header</th>";
var actual = sanitizer.Sanitize("<table>" + html + "</table>");
actual = actual.Remove(0, "<table><tbody><tr>".Length);
actual = actual.Remove(actual.Length - "</tr></tbody></table>".Length);

I have no plans to add a context feature to the library, but PRs are always welcome 😉 It might just be a couple of lines of code. I know AngleSharp accepts a context parameter when parsing fragments.

Xyncgas · 2021-04-15T07:29:37Z

I mean you can just use regex, regardless, I fee like this behavior was to prevent tag poisoning.

mganss added the enhancement label Aug 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<th> is stripped even though it's an allowed tag if not part of full table markup #110

<th> is stripped even though it's an allowed tag if not part of full table markup #110

robertc commented Mar 2, 2017

mganss commented Mar 2, 2017

panetta-net-au commented Apr 21, 2017

tiesont commented Apr 21, 2017

panetta-net-au commented Apr 21, 2017

mganss commented Apr 21, 2017

Xyncgas commented Apr 15, 2021

<th> is stripped even though it's an allowed tag if not part of full table markup #110

<th> is stripped even though it's an allowed tag if not part of full table markup #110

Comments

robertc commented Mar 2, 2017

mganss commented Mar 2, 2017

panetta-net-au commented Apr 21, 2017

tiesont commented Apr 21, 2017

panetta-net-au commented Apr 21, 2017

mganss commented Apr 21, 2017

Xyncgas commented Apr 15, 2021