Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New HTML Tag Processor - An in-depth tutorial #83

Closed
bph opened this issue Mar 2, 2023 Discussed in #75 · 41 comments
Closed

New HTML Tag Processor - An in-depth tutorial #83

bph opened this issue Mar 2, 2023 Discussed in #75 · 41 comments
Assignees
Labels

Comments

@bph
Copy link
Collaborator

bph commented Mar 2, 2023

Discussed in #75

Originally posted by bph February 8, 2023
WordPress 6.2 introduces the first component in a new HTML processing API. Before WordPress 6.2, the block markup was typically updated using regular expressions or the DOMDocument class. Both have downsides. The former is tedious and prone to security issues, while the latter uses libxml2 which does not support HTML5. (Adam Zielinski)

Beyond the Dev Note, it would be beneficial to PHP developers if there is a longer in-depth post on the Developer Blog.

cc: @adamziel , @dmsnell, @justintadlock

@bph bph added the flow: approved can move forward label Mar 2, 2023
@dmsnell
Copy link

dmsnell commented Mar 2, 2023

Thanks @bph !

@dmsnell
Copy link

dmsnell commented Mar 2, 2023

Status

@bph
Copy link
Collaborator Author

bph commented Mar 7, 2023

The Dev Note has been published: Introducing the HTML API in WordPress 6.2

@dmsnell
Copy link

dmsnell commented Apr 13, 2023

Will work on this in the coming week: goal is to have something together, a reasonable draft, by next Friday, April 21

@ndiego
Copy link
Member

ndiego commented Jun 20, 2023

After discussing this with @justintadlock, I am going to pick up this article.

@ndiego ndiego self-assigned this Jun 20, 2023
@justintadlock
Copy link

Just noting that @dmsnell seems to be doing this for now.

@gaambo
Copy link

gaambo commented Jun 20, 2023

I'd like to note one thing (if this isn't the right place for that, please excuse me): I always find the examples for the tag processor "confusing", because they are using $p as a variable for the processor/html. I always expect, that it's modifying a paragrah <p> tag, because of the name of the variable. It's definately a minor thing, but I think using clear variable names would make it easier to follow the examples.
So in may usage, I often use $html for the tag processor instance and then when modifying specific tags i use the tag as a variable name - eg: $a = $html->next_tag('a'); or $div = $html->next_tag('div');

@ndiego ndiego removed their assignment Jun 20, 2023
@dmsnell
Copy link

dmsnell commented Jun 21, 2023

thanks @gaambo - we deliberated a lot when designing the examples and started with the more verbose $tag_processor name, but it got a bit heavy-feeling. we shorted it to $p for $processor

I encourage folks to avoid storing the tag processor itself or the result of next_tag into names relating to HTML tag names, because that's not what they give you. next_tag() only tells you if it found a given tag, so that $div doesn't refer to any specific DIV element. maybe $found_a_div = $html->next_tag('div') is more descriptive there.

Do you find $processor = new WP_HTML_Tag_Processor( $html ) clearer than $p? $p has a nice terse cadence to it, though I can see where it might be confusing if the assumption is that it's storing some kind of tag.

Same caution exists for calling it $html, because $html is both a string and the name of the input HTML argument to the processor. It's given HTML and creates a Processor.

Naming is hard 😄

@gaambo
Copy link

gaambo commented Jun 21, 2023

Thanks for the explanation - tht makes sense, and in this case my examples were actually bad 😅
I, personally, think $processor is more more descriptive and better. But, as I said, it's a minor thing, but just wanted to mention it because it really confuses me in examples 😅

@dmsnell
Copy link

dmsnell commented Jun 29, 2023

Hello all, I've reached a first draft of a brand new version of the post.

Happy to have any review you might want to share. As a first draft, I could end up throwing everything out. That being said, the most valuable kinds of feedback will revolve around how the information is organized, things you think might be missing, things that are unclear, and things that you think should be communicated differently.

Thanks for all your collaboration on this!

@bph
Copy link
Collaborator Author

bph commented Jul 6, 2023

@zzap @marybaum Would you have time to review @dmsnell draft on Google Doc

@marybaum
Copy link
Member

marybaum commented Jul 6, 2023

Happy to! Can’t for about four hours though — D has a procedure this morning

@zzap
Copy link
Member

zzap commented Jul 7, 2023

Hi @dmsnell, thank you so much for writing about this. I'm both personally and professionally very interested in this topic and can't wait for the article to be published.

Overall, I love how it educates us on both Tag Processor and regular expression. I like the flow and structure.

I have a few suggestions:

  1. Is HTML without Regular Expressions the new title? If so, I would add Tag Processor: HTML without Regular Expressions or HTML API:... or something like that. My fear is that people will misjudge and skip it based on that title. For example, the first thought of HTML is not trying to match it with regular expressions. It's doing frontend. Missing the API name will leave people unaware of the article's topic.
  2. I'd add some intro before the first goal. e.g. Let's consider a few goals that can easily be achieved with Tag Processor. The reason is: in the previous paragraph, you talked about the reason why API was introduced in WordPress, and now when I see Goal, my first reaction is that this was the goal for getting API in the core.
  3. This sentence is a bit long. I had to read it twice to understand it fully. Can it be a bit simplified? Taking a step back there’s another problem in view: even if in this one spot the regular expression pattern were complicated enough to understand all of the HTML syntax and all of the semantic processing rules, it would still be one of thousands of places where the code is unique and behaves differently than other HTML processing code and needs its own separate maintenance and bug tasks.
  4. I agree with @gaambo about var naming and $p being confusing.
  5. Filter render_block example is done with closure, which is officially not recommended by wp.org because they are impossible to be removed. Is there any specific reason you used it that way, and if not, can you write it with a callback?
  6. This sentence is also a bit unclear to me (perhaps it's a second language thing): This is durable code because it’s expressing what it seeks to accomplish.

Tiny thingies:

  • The first sentence has use twice. Is there a better way to say it?
  • Missing space in greedyand
  • Plural should be singular in "even if in this one spot the regular expression pattern were complicated enough to understand"

As for use cases, I had the pleasure of seeing Tag Processor being used to remove nofollow attributes from anchors with specified domain names. It's pure joy, and we updated WordPress to 6.2 only to have that available.

Thank you again for writing this article ❤️

@dmsnell
Copy link

dmsnell commented Jul 7, 2023

I had the pleasure of seeing Tag Processor being used to remove nofollow attributes from anchors with specified domain names. It's pure joy, and we updated WordPress to 6.2 only to have that available.

aw. thanks for melting my heart @zzap

Is HTML without Regular Expressions the new title?

Yes, and in fact I wanted to make this more about parsing HTML and Regular Expressions than about the Tag Processor. That is, I've been walking the line carefully about trying to avoid communicating that people ought to find a way to use the Tag Processor - I don't want to convey that through this post. I'd rather address the problem where it's at, "trying to do things in HTML and reaching for regular expressions or DOMDocument, and then show how this API can do it better. In other words, let the problem lead the solution instead of broadcasting this as "you should be using the HTML API."

Any thoughts on how to mix your feedback with this?

Let's consider a few goals that can easily be achieved with Tag Processor
when I see Goal, my first reaction is that this was the goal for getting API in the core.

I'd be interested in hearing a re-wording of this. On one hand, it sounds like you appropriately understood the reason this API was introduced - mission accomplished 😆

This sentence is a bit long

Thanks, I'll work on that one.

Is there any specific reason you used it that way

Nope, just trying to avoid distracting from the topic at hand. Happy to change that.

Plural should be singular in "even if in this one spot the regular expression pattern were complicated enough to understand"

I need your help identifying the plural here. If it's the highlighted were that's not plural, it's the appropriate English subjunctive case, in contrast to the often misused even if it was the case.


I'll take a deeper look at all this feedback after I get back from some vacation. Thanks for reviewing the doc!

@zzap
Copy link
Member

zzap commented Jul 12, 2023

I need your help identifying the plural here. If it's the highlighted were that's not plural, it's the appropriate English subjunctive case, in contrast to the often misused even if it was the case.

Ahh, it was my bad English. Thank you for explaining it ❤️

Any thoughts on how to mix your feedback with this?

OK, I hear your point. In that case, I'd add the word parsing somewhere. E.g. Parsing HTML? Forget the Regular Expression

I'd be interested in hearing a re-wording of this. On one hand, it sounds like you appropriately understood the reason this API was introduced - mission accomplished laughing

Well, I would just add some small intro for examples there, as a bridge between the article intro and actual examples. Because the first goal wasn't the only goal for getting API into the core. It's nothing major, just feels a bit disconnected, like two independent pieces of content were put one after the other. I don't know if I'm explaining this properly.

@dmsnell
Copy link

dmsnell commented Jul 12, 2023

Thanks again @zzap - made another round of edits and updated the document. In places where you mentioned it was hard to read I changed the sentences around to avoid the problems.

the first goal wasn't the only goal for getting API into the core

Maybe it's good to talk about this more here together. The two examples in the document are basically the leading motivations for the introduction of the HTML API. Even though many more applications are possible now, its inception is quite modest.

It's completely possible I'm being quite ignorant here though and am overlooking something obvious.

In any case, at this point in time, my primary hope is that people at large only start adopting the HTML API for very limited cases like the examples. I want to see continued development and use of the HTML API, but in a way that fosters communication back to its development and informs its needs.

@justintadlock
Copy link

@dmsnell - Can change the share settings for the doc to "Anyone with the link" can comment?

@dmsnell
Copy link

dmsnell commented Jul 12, 2023

@justintadlock done.

@justintadlock
Copy link

Thanks. I did a full read through it and really enjoyed what you came up with. I left a few minor in-doc comments. Other than those, it looks good to me.

@dmsnell
Copy link

dmsnell commented Jul 12, 2023

Thanks @justintadlock - done.

@bph
Copy link
Collaborator Author

bph commented Jul 12, 2023

@dmsnell it reads really well.
I did another sweep of commas in and out, and other minor suggestions.

In your email space for your .org account you should find an invitation to the developer news blog with a request to accept.
Once you feel you are ready, move your text there, and share the Public preview link in a comment below.

Here are two checklists:

Pre-publishing checklist:

  • Post Title in Sentence case
  • Use "Center Column" template
  • Are Category or Categories selected?
  • Are Tags identifies?
  • Is there an explicit Excerpt?
  • Do all images have an alt-text?
  • Props added?
    🙌 Publish! 📗

Post-publishing checklist

  • add Props for reviews to #props channel in WP Slack
  • close the issue with a comment to link to the published post
  • close the accompanying discussion with the link to the published post.

Some of it is self-explanatory, some is not. I would be happy to walk you through it. It's your first post and I can't be more excited for you to be a part of the Developer blog crew :-)

@marybaum
Copy link
Member

marybaum commented Jul 12, 2023 via email

@dmsnell
Copy link

dmsnell commented Jul 14, 2023

Thanks @marybaum for all your comments. I'd love for you to get credit for all that writing. Would you like to create a draft post so you can be the author? I'll copy the contents from the Google Doc into that if you do.

@dmsnell
Copy link

dmsnell commented Jul 14, 2023

@marybaum I've adopted all your suggestions in the document but left a few questions where it seems like we've intentionally changed what's being said in a way that's factually inaccurate.

would love to see you get the credit for this writing. @bph apart from copying the contents from the Google Doc into a draft post do you need me for anything else on this? I'm extremely excited to put this behind me.

@bph
Copy link
Collaborator Author

bph commented Jul 14, 2023

It would be great if you could include the pre-publish checklist when you copy/paste as much as you can. But that: would be it. I will push it over the finish line 🥰

@marybaum
Copy link
Member

I’m so flattered by all this! I only got about halfway through, and I’m afk today and tomorrow. We’re in Colorado for some mountain tennis and pickleball, but will have some downtime while Dick drives us to some mountain destinations the next few days after that.

@bph
Copy link
Collaborator Author

bph commented Jul 18, 2023

@dmsnell Let me know how I can assist getting this over the finish line, maybe even this week?

@dmsnell
Copy link

dmsnell commented Jul 18, 2023

@bph let's get that post created with @marybaum as the author. I'm happy to do all the work copying the content into it and reformatting. I'd do this but I don't think I can change the author if I create the post.

@marybaum
Copy link
Member

marybaum commented Jul 18, 2023 via email

@marybaum
Copy link
Member

Finished editing the second half. I'll pull it into the p2 tomorrow.

@marybaum
Copy link
Member

I lied. I'll do that Thursday or Friday.

@marybaum
Copy link
Member

Or two months later. Oops.

Welp—it is in the p2. https://developer.wordpress.org/news/?p=2023&preview=1&_ppp=756268466a

@bph
Copy link
Collaborator Author

bph commented Sep 18, 2023

Thank you so much @marybaum 🙌

@dmsnell could go over this public preview one last time? It's going to be published on Wednesday morning.

I went over the pre-publish list:

  • Post Title in Sentence case
  • Use "Center Column" template
  • Are Category or Categories selected?
  • Are Tags identifies?
  • Is there an explicit Excerpt?
  • Do all images have an alt-text?
  • Props added?
    🙌 Publish! 📗

-[ ] added a "Resources to learn more" section.

Both of you @marybaum & @dmsnell

What do you think about the excerpt: "All by itself, the HTML Tag processor is better than regular expressions. It's convenient, reliable, fast—and You. Can. Read. It. This article shows you in two examples how to get started using the HTML Tag processor."

Do you think we need a Table of Contents?

@marybaum
Copy link
Member

I think that excerpt is great! I have also added a featured image.

@dmsnell
Copy link

dmsnell commented Sep 18, 2023

@bph looks like the post has a number of formatting typos in it still; if it's needed I can go fix those formatting issues. I think there might be some conflicts that arose when copying from the Google document; some things are a bit confusing and don't make sense to me when reading them.

Apart from that I think the author of the post is me, but that's not right. I'd prefer we give credit where it's due 😉.

The way I see it, I'm happy to continue to provide some technical review/feedback, but this has transformed into a post that I'm not really an author or editor of, and I'm happy with that. I think I've provided all the technical review already in the Google document that is informative, so beyond what I've already shared there may not be much new I have to offer - I may just be repeating myself if I try to offer another round of review.

Noted that this is lacking from the pre-publish list, but is valuable anyway:

  • fix the broken formatting throughout the document. mostly this is the inclusion of literal backticks but there are some HTML tags rendered as text and the code blocks are inconsistent in their syntax highlighting.
  • remove me as the post author.
  • there are some things that read confusing or at least technically misleading. I think these are intentional choices at this point, but it could help to ask someone else who works with WordPress code to review the document and share feedback. I only bring this up because I think the post still has some potential to speak to developers.

@marybaum
Copy link
Member

I'll take a run through it for the backticks.

@marybaum
Copy link
Member

I can understand your discomfort with something this florid. You should hear my internal dialogue when I'm writing my own CSS and JSON files, since I am absolutely irrational about things nobody else notices, like typography.

@bph
Copy link
Collaborator Author

bph commented Sep 20, 2023

We went through the pre-publish checklist before...
When you are ready, @marybaum
🚢 it :-)

And here is your post-publish checklist:

  • add Props for reviews to #props channel in WP Slack (Example) (use Slack handles)
  • add copy for a social post as comment to this issue (Beyond Block Styles: Part 2 #152 (comment))
  • close the issue with a comment to link to the published post
  • close the accompanying discussion with the link to the published post.

@justintadlock
Copy link

justintadlock commented Sep 21, 2023

Adding a new item to the pre-publish checklist:

@bph
Copy link
Collaborator Author

bph commented Sep 28, 2023

Published: The HTML API: process your tags, not your pain

@justintadlock
Copy link

Note that I switched the social image to the default template. The chosen template + featured image made the post title really hard to read.

@bph bph closed this as completed Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Published (Done)
Development

No branches or pull requests

8 participants