Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwrap all embeds with paragraph tags #4650

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
7401840
Sanitize DailyMotion embeds
pierlon Apr 1, 2020
3c57f35
Sanitize gfycat embeds
pierlon Apr 27, 2020
7e1ccf0
Unwrap Instagram embeds
pierlon Apr 27, 2020
3bb9d3e
Unwrap Scribd embeds
pierlon May 4, 2020
be260f8
Unwrap SoundCloud embeds
pierlon May 5, 2020
d7f3421
Sanitize Twitter timelines and moments
pierlon May 8, 2020
9c17878
Unwrap Vimeo embeds
pierlon May 8, 2020
91cdd3d
Sanitize YouTube embeds
pierlon May 8, 2020
ce5ef79
Sanitize WordPress TV embeds
pierlon May 8, 2020
575f173
Sanitize Hulu embeds
pierlon May 8, 2020
d789507
Fix lint errors
pierlon May 8, 2020
2ee90b4
Merge branch 'develop' into fix/4450-wrapped-embeds
pierlon May 16, 2020
bd606bf
Make child element count more robust
pierlon May 18, 2020
8e492b2
Extract base embed URLs as class constants
pierlon May 18, 2020
3f4f9ae
Get dimensions from iframe if available
pierlon May 18, 2020
085d1cc
Refactor logic for removing script sibling
pierlon May 18, 2020
050b9e7
Simplify logic for retrieving embed IDs
pierlon May 18, 2020
6893a47
Fix tests
pierlon May 19, 2020
34ca29c
Implement the Template Method design pattern
pierlon May 21, 2020
c955f45
Sanitize Imgur embeds
pierlon May 21, 2020
21ae657
Sanitize Tumblr embeds
pierlon May 21, 2020
8850a0f
Mock tests for DailyMotion
pierlon May 21, 2020
decbb49
Sanitize Reddit embeds
pierlon May 22, 2020
361b712
Implement `get_raw_embed_nodes` in `AMP_Base_Embed_Handler`
pierlon May 22, 2020
28b4394
Remove `get_parent_element` method
pierlon May 22, 2020
c663535
Make YouTube xpath query more specific
pierlon Jun 5, 2020
1f4df49
Merge branch 'develop' into fix/4450-wrapped-embeds
pierlon Jun 17, 2020
573334a
Fix lint issues
pierlon Jun 18, 2020
c695846
Mock Instagram embed test
pierlon Jun 18, 2020
67bf708
Update docblocks specifying alternative return values
pierlon Jun 18, 2020
e1c768d
Fix test cases for Dailymotion, Twitter, Gfycat and WordPress TV embeds
pierlon Jun 19, 2020
5b2ece9
Update mocked responses
pierlon Jun 19, 2020
5fa1cdb
Sanitize Meetup embeds
pierlon Jun 19, 2020
85b55cb
Sanitize Issuu embeds
pierlon Jun 19, 2020
996862a
Sanitize CrowdSignal embeds
pierlon Jun 19, 2020
1cd716f
Merge branch 'develop' into fix/4450-wrapped-embeds
pierlon Jun 19, 2020
4885788
Sanitize Pinterest embeds
pierlon Jun 19, 2020
59066e8
Sanitize Vine embeds
pierlon Jun 21, 2020
7c03ced
Add `Registerable` interface to indicate embed handlers that hook int…
pierlon Jun 21, 2020
1d91d0b
Fix lint and static analysis errors
pierlon Jun 21, 2020
cb3ef8b
Merge branch 'develop' into fix/4450-wrapped-embeds
pierlon Jun 21, 2020
beadf51
Rename `whitelist_sanitizer` to `validating_sanitizer`
pierlon Jun 21, 2020
b0157a1
Mark Pinterest and Vine embed handlers as registerable
pierlon Jun 25, 2020
b3a2c1c
Add oEmbed provider URLs for embeds not supported in WP 5.1
pierlon Jun 25, 2020
dfaf3e5
Fix YouTube embed test
pierlon Jun 25, 2020
244b53d
Merge branch 'develop' into fix/4450-wrapped-embeds
pierlon Jun 25, 2020
a857f25
Remove unused vars
pierlon Jun 25, 2020
434660a
Fix tests failures related to Gutenberg
pierlon Jun 25, 2020
986c036
Add oEmbed provider URL for Crowdsignal surveys on WP 5.2.
pierlon Jun 25, 2020
62fa5ae
Merge branch 'develop' of github.com:ampproject/amp-wp into fix/4450-…
westonruter Jun 25, 2020
0860f48
Update phpdoc to reflect where methods do not return null
westonruter Jun 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions includes/class-amp-theme-support.php
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
*/

use AmpProject\Amp;
use AmpProject\AmpWP\Embed\Registerable;
use AmpProject\AmpWP\Option;
use AmpProject\AmpWP\RemoteRequest\CachedRemoteGetRequest;
use AmpProject\AmpWP\ConfigurationArgument;
Expand Down Expand Up @@ -1179,8 +1180,9 @@ public static function register_content_embed_handlers() {
);
continue;
}

$embed_handler->register_embed();
if ( $embed_handler instanceof Registerable ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this have a backwards-incompatibility problem? If someone has existing subclasses of AMP_Base_Embed_Handler with register_embed defined, should it not do something like this:

Suggested change
if ( $embed_handler instanceof Registerable ) {
if ( $embed_handler instanceof Registerable || method_exists( $embed_handler, 'register_embed' ) ) {

$embed_handler->register_embed();
}
$embed_handlers[] = $embed_handler;
}

Expand Down
150 changes: 146 additions & 4 deletions includes/embeds/class-amp-base-embed-handler.php
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
* @package AMP
*/

use AmpProject\Dom\Document;

/**
* Class AMP_Base_Embed_Handler
*/
Expand Down Expand Up @@ -40,14 +42,18 @@ abstract class AMP_Base_Embed_Handler {
protected $did_convert_elements = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property can be marked as @deprecated and any usages in the subclasses can be eliminated (all of which are write-only currently), as it is not serving any purpose anymore.

In the past it was used to determine whether or not the AMP component script needed to be added to the page. But this is obsolete.


/**
* Registers embed.
* Default AMP tag to be used when sanitizing embeds.
*
* @var string
*/
abstract public function register_embed();
protected $amp_tag = 'amp-iframe';

/**
* Unregisters embed.
* Base URL used for identifying embeds.
*
* @var string
*/
abstract public function unregister_embed();
protected $base_embed_url = '';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this property as being used anywhere. Can it be removed?


/**
* Constructor.
Expand All @@ -64,6 +70,56 @@ public function __construct( $args = [] ) {
);
}

/**
* Sanitize all embeds on the page to be AMP compatible.
*
* @param Document $dom DOM.
*/
public function sanitize_raw_embeds( Document $dom ) {
$nodes = $this->get_raw_embed_nodes( $dom );

if ( null === $nodes ) {
// Bail if the embed handler returns null.
return;
}

if ( 0 === $nodes->length ) {
return;
}

foreach ( $nodes as $node ) {
if ( ! $this->is_raw_embed( $node ) ) {
continue;
}
$this->sanitize_raw_embed( $node );
}
}

/**
* Determine if the node is indeed a raw embed.
*
* @param DOMElement $node DOM element.
* @return bool True if it is a raw embed, false otherwise.
*/
protected function is_raw_embed( DOMElement $node ) {
return $node->parentNode && $this->amp_tag !== $node->parentNode->nodeName;
Comment on lines +101 to +105
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To facilitate using this to loop over a list of arbitrary nodes (e.g. childNodes) probably safer to accept a DOMNode instead:

Suggested change
* @param DOMElement $node DOM element.
* @return bool True if it is a raw embed, false otherwise.
*/
protected function is_raw_embed( DOMElement $node ) {
return $node->parentNode && $this->amp_tag !== $node->parentNode->nodeName;
* @param DOMNode $node DOM node.
* @return bool True if it is a raw embed, false otherwise.
*/
protected function is_raw_embed( DOMNode $node ) {
return $node instanceof DOMElement && $node->parentNode && $this->amp_tag !== $node->parentNode->nodeName;

}

/**
* Get all raw embeds from the DOM.
*
* @param Document $dom Document.
* @return DOMNodeList|null A list of DOMElement nodes, or null if not implemented.
*/
abstract protected function get_raw_embed_nodes( Document $dom );
Comment on lines +112 to +114
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not implemented? Doesn't an abstract method always have to be implemented by a subclass? Shouldn't this method just be changed to be non-abstract and then return null here? Then only the subclassed methods would return DOMNodeList. This would avoid needing to define the method to return null as it stands today for Twitter, Playlist, Gallery, and Blocks.


/**
* Make embed AMP compatible.
*
* @param DOMElement $node DOM element.
*/
abstract protected function sanitize_raw_embed( DOMElement $node );

/**
* Get mapping of AMP component names to AMP script URLs.
*
Expand Down Expand Up @@ -107,4 +163,90 @@ function ( $attr_name ) {
}
return wp_array_slice_assoc( $matches, $attribute_names );
}

/**
* Get all child elements of the specified element.
*
* @param DOMElement $node Element.
* @return DOMElement[] Array of child elements for specified element.
*/
protected function get_child_elements( DOMElement $node ) {
return array_filter(
iterator_to_array( $node->childNodes ),
static function ( DOMNode $child ) {
return $child instanceof DOMElement;
}
);
}

/**
* Replace the node's parent with itself if the parent is a <p> tag, has no attributes and has no other children.
* This usually happens while `wpautop()` processes the element.
*
* @since 1.6
*
* @param DOMElement $node Node.
*/
protected function unwrap_p_element( DOMElement $node ) {
$parent_node = $node->parentNode;
while ( $parent_node && ! ( $parent_node instanceof DOMElement ) ) {
$parent_node = $parent_node->parentNode;
}

if ( $parent_node instanceof DOMElement && 'p' === $parent_node->nodeName && false === $parent_node->hasAttributes() ) {
$child_element_count = count( $this->get_child_elements( $parent_node ) );
if ( 1 === $child_element_count ) {
$parent_node->parentNode->replaceChild( $node, $parent_node );
}
}
}

/**
* Removes the node's nearest <script> sibling with a `src` attribute containing the base `src` URL provided.
*
* @since 1.6
*
* @param DOMElement $node The DOMNode to whose sibling is the script to be removed.
* @param string $base_src_url Script URL to match against.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $base_src_url is not strictly checked as a base. It is getting passed without any protocol. While maybe this is overkill, perhaps a full URL should be passed as $base_src_url and then the initial /https?:(?=\/\/)/ can be stripped from it and the src being checked. Then it could look for 0 === strpos(...) as opposed to false !== strpos(...). That would be more robust.

* @param string $content Text content of node to match against.
* @param bool $is_next_to Whether the script sibling is next or preceding the specified element.
*/
protected function remove_script_sibling( DOMElement $node, $base_src_url, $content = '', $is_next_to = true ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In seeing the usages of remove_script_sibling in the subclasses, it's somewhat hard to read in regards to the $content and $is_next_to parameters. Can helper methods be used to make it more intuitive? Like remove_inline_script_sibling() when seeking to pass $content for example.

$sibling_location = $is_next_to ? 'nextSibling' : 'previousSibling';
$element_sibling = $node->{$sibling_location};

while ( $element_sibling && ! ( $element_sibling instanceof DOMElement ) ) {
$element_sibling = $element_sibling->{$sibling_location};
}

// Handle case where script is wrapped in paragraph by wpautop.
if ( $element_sibling instanceof DOMElement && 'p' === $element_sibling->nodeName ) {
$children_elements = array_values( $this->get_child_elements( $element_sibling ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using array_values() isn't needed because get_child_elements already uses iterator_to_array():

Suggested change
$children_elements = array_values( $this->get_child_elements( $element_sibling ) );
$children_elements = $this->get_child_elements( $element_sibling );


if (
1 === count( $children_elements ) &&
'script' === $children_elements[0]->nodeName &&
(
( $base_src_url && false !== strpos( $children_elements[0]->getAttribute( 'src' ), $base_src_url ) ) ||
( $content && false !== strpos( $children_elements[0]->textContent, $content ) )
)
) {
$element_sibling->parentNode->removeChild( $element_sibling );
return;
}
}

// Handle case where script is immediately following.
$is_embed_script = (
$element_sibling instanceof DOMElement &&
'script' === strtolower( $element_sibling->nodeName ) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is strtolower() needed here? It's not used above:

Suggested change
'script' === strtolower( $element_sibling->nodeName ) &&
'script' === $element_sibling->nodeName &&

I found that in all PHP versions the value is normalized to lower case already: https://3v4l.org/oDCO6

(
( $base_src_url && false !== strpos( $element_sibling->getAttribute( 'src' ), $base_src_url ) ) ||
( $content && false !== strpos( $element_sibling->textContent, $content ) )
)
);
if ( $is_embed_script ) {
$element_sibling->parentNode->removeChild( $element_sibling );
}
}
}
21 changes: 20 additions & 1 deletion includes/embeds/class-amp-core-block-handler.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
* @package AMP
*/

use AmpProject\AmpWP\Embed\Registerable;
use AmpProject\Attribute;
use AmpProject\Dom\Document;

Expand All @@ -13,7 +14,7 @@
*
* @since 1.0
*/
class AMP_Core_Block_Handler extends AMP_Base_Embed_Handler {
class AMP_Core_Block_Handler extends AMP_Base_Embed_Handler implements Registerable {

/**
* Attribute to store the original width on a video or iframe just before WordPress removes it.
Expand Down Expand Up @@ -355,4 +356,22 @@ private function process_text_widgets( Document $dom ) {
}
}
}

/**
* Get all raw embeds from the DOM.
*
* @param Document $dom Document.
* @return DOMNodeList|null A list of DOMElement nodes, or null if not implemented.
*/
protected function get_raw_embed_nodes( Document $dom ) { // phpcs:ignore VariableAnalysis.CodeAnalysis.VariableAnalysis.UnusedVariable
return null;
}
Comment on lines +360 to +368
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just making this the method in the base class.


/**
* Make embed AMP compatible.
*
* @param DOMElement $node DOM element.
*/
protected function sanitize_raw_embed( DOMElement $node ) { // phpcs:ignore VariableAnalysis.CodeAnalysis.VariableAnalysis.UnusedVariable
}
Comment on lines +375 to +376
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be in the base class as well.

}
129 changes: 99 additions & 30 deletions includes/embeds/class-amp-crowdsignal-embed-handler.php
Original file line number Diff line number Diff line change
Expand Up @@ -8,59 +8,128 @@
* @since 1.2
*/

use AmpProject\AmpWP\Embed\Registerable;
use AmpProject\Dom\Document;

/**
* Class AMP_Crowdsignal_Embed_Handler
*/
class AMP_Crowdsignal_Embed_Handler extends AMP_Base_Embed_Handler {
class AMP_Crowdsignal_Embed_Handler extends AMP_Base_Embed_Handler implements Registerable {

/**
* Register embed.
* Register the embed.
*
* @return void
*/
public function register_embed() {
add_filter( 'embed_oembed_html', [ $this, 'filter_embed_oembed_html' ], 10, 3 );
if ( version_compare( get_bloginfo( 'version' ), '5.2', '>=' ) ) {
return;
}

// The oEmbed providers for CrowdSignal embeds are outdated on WP < 5.1. Updating the providers here will
// allow for the oEmbed HTML to be fetched, and can then sanitized later below.
$formats = [
'#https?://(.+\.)?polldaddy\.com/.*#i',
'#https?://poll\.fm/.*#i',
'#https?://(.+\.)?survey\.fm/.*#i', // Not available on WP 5.2.
];

foreach ( $formats as $format ) {
wp_oembed_add_provider( $format, 'https://api.crowdsignal.com/oembed', true );
}
}

/**
* Unregister embed.
* Unregister the embed.
*
* @return void
*/
public function unregister_embed() {
remove_filter( 'embed_oembed_html', [ $this, 'filter_embed_oembed_html' ], 10 );
}

/**
* Filter oEmbed HTML for Crowdsignal for AMP output.
* Get all raw embeds from the DOM.
*
* @param string $cache Cache for oEmbed.
* @param string $url Embed URL.
* @param array $attr Shortcode attributes.
* @return string Embed.
* @param Document $dom Document.
* @return DOMNodeList A list of DOMElement nodes.
*/
public function filter_embed_oembed_html( $cache, $url, $attr ) {
$parsed_url = wp_parse_url( $url );
if ( empty( $parsed_url['host'] ) || empty( $parsed_url['path'] ) || ! preg_match( '#(^|\.)(?P<host>polldaddy\.com|crowdsignal\.com|survey\.fm|poll\.fm)#', $parsed_url['host'], $matches ) ) {
return $cache;
}
protected function get_raw_embed_nodes( Document $dom ) {
$queries = [
// For poll embeds.
'//iframe[ @class="cs-iframe-embed" and starts-with( @src, "https://poll.fm/" ) ]',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'//iframe[ @class="cs-iframe-embed" and starts-with( @src, "https://poll.fm/" ) ]',
'//noscript/iframe[ @class="cs-iframe-embed" and starts-with( @src, "https://poll.fm/" ) ]',

// For survey embeds.
'//div[ @class="pd-embed" and @data-settings ]',
];

$parsed_url['host'] = $matches['host'];
return $dom->xpath->query( implode( ' | ', $queries ) );
}

$output = '';
/**
* Make embed AMP compatible.
*
* @param DOMElement $node DOM element.
*/
protected function sanitize_raw_embed( DOMElement $node ) {
$is_poll = 'cs-iframe-embed' === $node->getAttribute( 'class' );

// Poll oEmbed responses include noscript which can be used as the AMP response.
if ( preg_match( '#<noscript>(.+?)</noscript>#s', $cache, $matches ) ) {
$output = $matches[1];
if ( $is_poll ) {
$this->sanitize_poll_embed( $node );
} else {
$this->sanitize_survey_embed( $node );
}
}

if ( empty( $output ) ) {
if ( ! empty( $attr['title'] ) ) {
$name = $attr['title'];
} elseif ( 'survey.fm' === $parsed_url['host'] || preg_match( '#^/s/#', $parsed_url['path'] ) ) {
$name = __( 'View Survey', 'amp' );
} else {
$name = __( 'View Poll', 'amp' );
}
$output = sprintf( '<a href="%s" target="_blank">%s</a>', esc_url( $url ), esc_html( $name ) );
/**
* Sanitize poll embed.
*
* @param DOMElement $node Poll embed.
*/
private function sanitize_poll_embed( DOMElement $node ) {
// Replace the `noscript` parent element with the iframe.
$node->parentNode->parentNode->replaceChild( $node, $node->parentNode );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this confirm that the $node->parentNode->parentNode is actually a noscript? This could also be done done at the XPath level.


$this->unwrap_p_element( $node );
$this->remove_script_sibling( $node, 'https://secure.polldaddy.com', '', false );
}

/**
* Sanitize survey embed.
*
* @param DOMElement $node Survey embed.
*/
private function sanitize_survey_embed( DOMElement $node ) {
$settings = json_decode( $node->getAttribute( 'data-settings' ), false );

// We can't form the iframe URL without a domain and survey ID.
if ( ! ( property_exists( $settings, 'domain' ) || property_exists( $settings, 'id' ) ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ( ! ( property_exists( $settings, 'domain' ) || property_exists( $settings, 'id' ) ) ) {
if ( ! ( is_object( $settings ) && ( property_exists( $settings, 'domain' ) && property_exists( $settings, 'id' ) ) ) {

return;
}

return $output;
// Logic for building the iframe `src` can be found in https://polldaddy.com/survey.js.
$iframe_src = sprintf(
'https://%s/%s?%s',
$settings->domain,
$settings->id,
property_exists( $settings, 'auto' ) && $settings->auto ? 'ft=1&iframe=' . amp_get_current_url() : 'iframe=1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
property_exists( $settings, 'auto' ) && $settings->auto ? 'ft=1&iframe=' . amp_get_current_url() : 'iframe=1'
( property_exists( $settings, 'auto' ) && $settings->auto ) ? 'ft=1&iframe=' . amp_get_current_url() : 'iframe=1'

);

$iframe_node = AMP_DOM_Utils::create_node(
Document::fromNode( $node ),
'iframe',
[
'src' => $iframe_src,
'layout' => 'responsive',
'width' => 600,
'height' => 600,
'frameborder' => 0,
'scrolling' => 'no',
'allowtransparency' => 'true',
'sandbox' => 'allow-scripts allow-same-origin',
]
);

$this->remove_script_sibling( $node, null, 'https://polldaddy.com/survey.js' );

$node->parentNode->replaceChild( $iframe_node, $node );
}
}
Loading