Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-space whitespace characters are removed from anchor URL #266

Open
ranvis opened this issue Dec 4, 2018 · 2 comments
Open

Non-space whitespace characters are removed from anchor URL #266

ranvis opened this issue Dec 4, 2018 · 2 comments

Comments

@ranvis
Copy link

ranvis commented Dec 4, 2018

Leading and trailing whitespace characters are removed from the link value during the removal of space characters, making extracting/following the link fail.

my $mech = WWW::Mechanize->new();
$mech->update_html(qq'<a href="\x0b">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
$mech->update_html(qq'<a href="\x{3000}">link</a>');
say length $mech->links->[0]->URI->as_string; # 0

According to HTML5 spec, space characters are /[\x09\x0a\x0c\x0d\x20]/:

https://www.w3.org/TR/html52/infrastructure.html#infrastructure-urls
A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid URL.
A string is a valid non-empty URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid non-empty URL.

Re: stripping leading and trailing white space
https://www.w3.org/TR/html52/infrastructure.html#strip-leading-and-trailing-white-space
When a user agent is to strip leading and trailing white space from a string, the user agent must remove all space characters that are at the start or end of the string.

Re: space characters
https://www.w3.org/TR/html52/infrastructure.html#space-characters
The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

URI->new() is causing this, as its document says: it removes white space characters (\s,) which depends on a version of Unicode spec each version of Perl confirms.

@oalders
Copy link
Member

oalders commented Dec 4, 2018

So, is the behaviour of URI incorrect here or do we need an option to define what URI considers to be whitespace at https://metacpan.org/source/ETHER/URI-1.74/lib/URI.pm#L43-44?

@ranvis
Copy link
Author

ranvis commented Dec 5, 2018

The stripping code was committed in 1996
https://metacpan.org/source/GAAS/libwww-perl-5.00/lib/URI/URL.pm#L90-93
(aside from libwww-perl 0.20~0.30)
because old RFC 1738 appendix says URLs may have extra characters around in email or something which themselves are not a part of URL.
Now in 2018, I think the behavior can still be said as a consistent one if URI is trimming spaces like how the location bar of a web browser does (for it no longer mentions RFC.) But as a module it is taking too good care in Unicode regex era?

The following crafted example does not work either. I think that now URI is more widely used than first designed to be, and that the current stripping is kind of obsolete.

$mech->update_html(qq'<a href="&lt;URL:&gt;">link</a>');
say length $mech->links->[0]->URI->as_string; # 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants