Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Introduce class for sourcing block attributes from HTML #46345

Draft
wants to merge 19 commits into
base: trunk
Choose a base branch
from

Conversation

dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Dec 7, 2022

What?

Part of #44410.

Makes it possible to source some block attributes from a chunk of HTML given the attribute definitions (as passed in registerBlock())

Why?

Who doesn't want sourced block attributes from the PHP side?

How?

Uses the WP_HTML_Tag_Processor and some basic assumptions about HTML structure to source attributes iff:

  • it can confidently parse the CSS selector in the attribute definition
  • it can construct a query to get that information out of the html

}

public static function parse_selector( $s, $at = 0 ) {
$selectors = explode( ',', $s );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd treat the separator more carefully in case this gets applied to a selector like a[href*=","]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the explode will be executed every time this function runs, including the call in line 254 below where we already expect to deal with a single selector. only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks. still very rough, and in fact I want to move a lot of code around here, but I wanted to get some basic structure together to see how to build this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved support for this into the main parser function and am aborting on unsupported attribute queries, so we shouldn't have this bug anymore

}

// @TODO: Hashes don't have to start with `nmstart` so this might reject valid hash names.
$identifier = self::parse_css_identifier( $s, $at );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it only consumes the first css identifier at the moment, like figure in figure video.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks right 😉

though to be fair if you look up at line 275 above it, it's acknowledged that we don't support the descendant selector yet

Comment on lines 123 to 148
while ( $tags->next_tag() ) {
foreach ( $selector as $s ) {
if ( 'element' === $s['type'] && $tags->get_tag() === strtoupper( $s['identifier'] ) ) {
return $tags;
}

// @TODO: $tags->has_class() would be _really_ handy here.
if ( 'class' === $s['type'] && preg_match( "~\b{$s['identifier']}\b~", $tags->get_attribute( 'class' ) ) ) {
return $tags;
}

if ( 'hash' === $s['type'] && $s['identifier'] === $tags->get_attribute( 'id' ) ) {
return $tags;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively reimplements $tags->next_tag. I'm tempted to say it could be included in the next_tag API as follows:

$criteria = [];
foreach ( $selectors as $s ) {
    $criterion = [];
    if ( 'element' === $s['type'] ) {
      $criteria[] = [ 'tag_name' => strtoupper( $s['identifier'] ) ];
    } else if ( 'class' === $s['type'] ) {
      $criteria[] = [ 'class_name' => $s['identifier'] ];
    } else if ( 'hash' === $s['type'] ) {
      $criteria[] = [ 'id' => $s['identifier'] ];
    }
}
$tags->next_tag( [
    'or' => $criteria
] );

However, I can see how this may need more CSS-specific matchers that don't make sense in the core API.

At the same time, this method has duplicate logic for a special case when only one selector is received. I'd do this instead:

public static function select( $selector, $html ) {
	$tags = new WP_HTML_Naive_Processor( $html );
	if ( ! array_is_list( $selector ) ) {
		$selector = [ $selector ];
	}

	while ( $tags->next_tag() ) {
		foreach ( $selector as $s ) {
			// ...
		}
	}

	return null;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on one note, it's not duplicate logic, as in the case of the individual selector we can tell the tag processor to jump to the next tag you specify.

on another note, if you squint…

there's something very potentially powerful here that could be moved into the tag processor. spent some time today reading the CSS selector spec…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for another reason I've actually collapsed it to your suggestion; getting has_attribute support (such as with a[download]) was going to start getting cumbersome, so now it's just one big loop.

}
list( $inner_selector, $next_at ) = $inner;
$inner_selector['combinator'] = $combinator;
$selector['then'] = $inner_selector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the output like a linked list of selectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right. we need to represent selectors in some way that captures the recursive or monadic behavior of the combinators. we have to use the first segment, or simple selector sequence, to find a given location in the document, and then from that starting point and in whatever relation to it is appropriate given the combinator, find the next one.

we could probably do this as a list, but I already had set the top level as a list to represent the comma-separated selectors_group, which is special because the OR operation here only exists at the top level.

recursing seemed easiest since I could presumably pass a single object to the query generator and it would contain the next simple selector sequence in the chain, or nothing at all.

haven't built that part of the query generator yet though.

Copy link
Contributor

@adamziel adamziel Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about maybe a list of lists, as in

[
   [$segment1, $segment2, $segment3],
   [$segment1, $segment2, $segment3]
]

We'll still need to use nesting, though, for things like :not(span, a), so maybe better to embrace it right away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet sold on supporting :not() at this stage.
The list works, but the nesting is convenient because we don't have to track where in the list we are.

What do you see as the benefit of the flat list?

* nonascii [\240-\377]
* escape {unicode}|\\[^\r\n\f0-9a-f]
* unicode \\{h}{1,6}(\r\n|[ \t\r\n\f])?
* h [0-9a-f]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h -> hex? For readability – I got confused for a moment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a direct copy from the spec. yes, h is there for a hex digit

* not, that `<p>` itself is a special element.
*/

class WP_HTML_Processor extends WP_HTML_Tag_Processor {
Copy link
Contributor

@adamziel adamziel Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also works with tags, it's even in the method names – I'd be slightly confused to see two classes with such similar names doing similar things. If there isn't anything Unsafe here, should we just bring these methods to the parent class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not unsafe right now because the functions have been renamed to imply they are making assumptions (and those names might change)

but there's still a difference between this and the tag processor that I wrestle with. this is "safe" in that it will preserve the HTML syntax by relying on the tag processor. this is "unsafe" in that it allows one to break the semantic of the HTML by performing operations that would require multiple steps in a DOM, such as removing a range of HTML tags or tag openers or tag closers. it's also "unsafe" in that even though it preserves the syntax, it may not do what you expect it to, even in the renamed functions.

<p>Before<p>Inside</p>

For this input there's nothing that will trap the first paragraph and any balanced tag operation will fail after reaching the end of input. Hopefully that will change in time, and we'd properly find <p>Before as one outer content and <p>Inside</p> as another, but even then our performance domain has shifted greatly from the tag processor.

I'm thinking that as this class develops people may not generally call the tag processor and it will remain as a low-level library. That's why I renamed it to WP_HTML_Processor, because I think this could end up as the "entry point" interface for working with HTML. Sistering this is the as-of-yet-still-un-renamed-to WP_PHPQ (for lack of a better name) which might go inside here or still might live separately as a class for turning a CSS selector into an HTML query. The HTML class could expose that query instead of our next_tag() interface.

$p = new WP_HTML_Processor( $html );
while ( $p->find( 'section + h2 > a[href^="#"]' ) ) {
	$title = $p->get_inner_text();
}

It's still not a DOM, we can't call ->parent_node() or walk backwards (even though we can manually do this with bookmarking), but we can jump to specific places in the document, mangle the attributes, and mangle the content inside it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also "unsafe" in that even though it preserves the syntax, it may not do what you expect it to, even in the renamed functions.

And then it allows you to find an element by a class name, even if it's table, and then insert a div inside it – which the browsers can handle, but is likely not what you want.

That's why I renamed it to WP_HTML_Processor, because I think this could end up as the "entry point" interface for working with HTML.

Oh I see, with this insight I'd probably rename the Tag_Processor to something else just eo avoid the confusion – but at this point I don't think it's worth it. 🤷

The HTML class could expose that query instead of our next_tag() interface.

I like that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably rename the Tag_Processor

I don't know about this. I think based on how careful we were with naming that it will retain the meaning we gave it.

What do you find contradictory about that class, if we introduce an HTML processor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not as much that I see a contradiction, as that I worry about confusing everyone with two very similarly named classes (in a similar way as folks get confused about slice vs splice in JS)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep thinking about it then. I want to keep the classes separate because of the different performance implications, but I don't know how to address what you're suggesting (apart from Unsafe which I'm currently skeptical of using)

@ockham
Copy link
Contributor

ockham commented Dec 14, 2022

👋 I'm now using get_content_inside_balanced_tags over at WordPress/block-interactivity-experiments#118 (comment), and it's working great!

One thing that caught me a bit by surprise is that get_content_inside_balanced_tags moved the ... pointer thingy? past the closing tag. Could we maybe consider resetting it to where it started? I'm currently doing that manually, but it's a bit verbose -- especially given that get_content_inside_balanced_tags keeps track of the start and end internally (so it should be fairly easy to reset after getting the content).

@ockham
Copy link
Contributor

ockham commented Dec 14, 2022

Given that WP_HTML_Processor's get_content_inside_balanced_tags is pretty handy in its own right, whereas the CSS selector logic might need a bit more work -- would it make sense to file a separate PR just for WP_HTML_Processor (and its tests)?

I thought that this might also unblock work on a set_content_inside_balanced_tags method, which we'll need for the Block Interactivity SSR stuff (see WordPress/block-interactivity-experiments#118) 😊

(I can file that PR if you're okay with this plan @dmsnell @adamziel)

@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 14, 2022

(I can file that PR if you're okay with this plan @dmsnell @adamziel)

@ockham consider everything in this PR tentative at best and more-than-likely to change. I don't even know if we'll end up with `get_content_inside_balanced_tags() or have something else instead.

Feel free to write your own PRs and explore different approaches and share here. This PR is more like my own notes as I work through the problem and try to understand where things belong and what the different parts of the system are.

@@ -87,6 +87,7 @@ public function get_content_inside_balanced_tags() {
}

$content = $this->content_inside_bookmarks( $start_name, $end_name );
$this->seek( $start_name );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

$tags->seek( 'start' );
$max = 100;
while ( --$max > 0 ) {
$next = $s;
Copy link
Contributor

@adamziel adamziel Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an idea – take a selector like #main > .wrapper a span.

This class is now going top to bottom and matching #main, then the immediate descendant .wrapper, then a descendant a and then a descendant span.

What if it went bottom to top? So we start by looking for the first span in the document and then investigate the parents stack to see if any matches a. Now we don't need to jump around the document to decide if it's a match.

There's still the issue of ~ and + combinators, but we could keep track of a tree of relevant siblings. For example in a HTML like

<div id="main">
   <div class="wrapper">
     <a>
       <div></div>
     </a>
     <a>
         <span></span>
         <span></span> <--- the cursor is here
     </a>
  </div>
</div>

We'd only store the following data structure (without the children of the initial a)

div#main
   div.wrapper
      a
      a
         span
         span

If we want to support matching arbitrary attributes beyond ids and classes, we could keep a bookmark for each parent and backtrack if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems worth a try. If we aren't going to fully parse the HTML then I don't see the same value add because quickly eliminating potential matches would happen before diving deep into the structure (and maybe that can still happen). If on the other hand, we're going to do a full document parse (and carry the weight of all that implies) then this method seems approachable.

The one major difference I see at this point is that if we're looking for the inner-most element first then we have to run $tags->matches() on every node and run a full check on every node. In the approach I've started taking we prune out most nodes out the door and are able to iterate with less careful checking, because at that point we're mostly just checking tag name and whether it's an opener or closer. In practice it may not matter, but I did want to point out something different from our HTML-based approach vs. what might be happening in the DOM land in the browser.

@github-actions
Copy link

github-actions bot commented Jan 6, 2023

Flaky tests detected in 6103139.
Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.

🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/3858581592
📝 Reported issues:

@dmsnell
Copy link
Contributor Author

dmsnell commented Jan 9, 2023

All the test cases that I thought up are passing as of 79c24bd, but if you pressed me to explain how or why I wouldn't be able to say 🙃

Current thought is to continue exploring cases and see how this now interacts with other tag processor operations and modifications, see how deeply we can nest. See about how this does when given example Core block output.

Please for all who review this consider it a draft that's likely to be unlike a final version. Would like to convert the iteration into recursion and remove the goto statements, would like to extract the WP_HTML_Processor() outside the select() function and allow for passing it around through recursion but also for prepping it on an input document.

ockham added a commit that referenced this pull request Jan 30, 2023
@ockham ockham mentioned this pull request Jan 30, 2023
4 tasks
@gziolo gziolo added the [Feature] Block API API that allows to express the block paradigm. label Jan 31, 2023
This was referenced Feb 10, 2023
@gziolo gziolo added [Feature] HTML API An API for updating HTML attributes in markup [Type] Technical Prototype Offers a technical exploration into an idea as an example of what's possible labels Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Block API API that allows to express the block paradigm. [Feature] HTML API An API for updating HTML attributes in markup [Type] Technical Prototype Offers a technical exploration into an idea as an example of what's possible
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants