Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update String#trim #354

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

lewisje
Copy link

@lewisje lewisje commented Jun 26, 2015

Check for two commonly erroneously trimmed characters instead of one, and check for erroneously failing to trim the trimmable whitespace characters; then use aa* instead of a+, and two regexes instead of one, both because they are more performant (it is also more performant to first trim from the beginning and then trim from the end).

I would also replace if (typeof this === 'undefined' || this === null) with if (this == null) but I don't know whether that's allowed by this library's style rules.

I would also update the the minified shim file, but I suspect that's something the maintainers instead do before every proper release.

@ljharb
Copy link
Collaborator

ljharb commented Jun 26, 2015

Thanks for the contribution!

You're correct about the style and the proper time to update minified files.

Can you also provide a test case that would fail without this change?

Also, I'd prefer to separate the bug fix PR from the performance PR, if possible.

Please also rebase out typo fix commits and force push :-)

@ljharb
Copy link
Collaborator

ljharb commented Jun 26, 2015

Can you also link to the ES6 spec and show where these characters should be included?

@mathiasbynens
Copy link
Contributor

Can you also link to the ES6 spec and show where these characters should be included?

The spec is here: http://ecma-international.org/ecma-262/6.0/#sec-white-space

Throughout the spec, WhiteSpace consists of U+0009, U+000B, U+000C, U+0020, U+00A0, and U+FEFF + all code points in the Zs category as per Unicode v5.1.0 + any additional code points in the Zs category in the Unicode version used by the engine. In regexpu, I assume the latest available Unicode version (i.e. Unicode v8.0.0 at the time of writing) for the latter.

https://github.com/mathiasbynens/regexpu/blob/ff17a00b63a017a69fd93da455a8944eb18918ce/scripts/character-class-escape-sets.js#L72-L80

The algorithm for String.prototype.trim specifies that “the definition of white space is the union of WhiteSpace and LineTerminator”. So, the regular expression is equivalent to /\s/u in a conforming ES6 engine: https://mothereff.in/regexpu#%2F%5Cs%2Fu This, in turn, is equivalent to /[\t-\r \xA0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000\uFEFF]/. This regex is much simpler than the one that is currently used.

@ljharb
Copy link
Collaborator

ljharb commented Jun 26, 2015

So are the code points we're missing ones that were added in later Unicode versions? I'm still not clear on the problem here, but I do want to conform :-)

@mathiasbynens
Copy link
Contributor

Looks like this patch doesn’t add any code points to the whitespace set. It only checks
'\u0085\u002B' instead of just '\u0085' for hasStringBug.

The patch description mentions these are commonly erroneously trimmed. I doubt the spec has info on that, @ljharb. Perhaps @lewisje can tell us which engines incorrectly trim \u002B?

@lewisje
Copy link
Author

lewisje commented Jun 26, 2015

I don't have information on any engines that commonly erroneously trim either of those characters: I just know that this shim checks \x85 and es5-shim checks \u200B, so I thought both of them were erroneously trimmed widely enough to be checked.

Whitespace is defined in section 7.2 of the ES5 spec: http://es5.github.io/#x7.2

It includes \x09, \x0B, \x0C, \x20, \xA0, and \uFEFF and makes reference to the Unicode category "Zs": http://www.unicode.org/Public/8.0.0/ucd/PropList.txt

Both U+180E MONGOLIAN VOWEL SEPARATOR and U+200B ZERO WIDTH SPACE were re-classified from Zs (Space_Separator) to Cf (Format), while U+0085 NEXT LINE (NEL) is in Cc (Control); U+180E was moved in Unicode 6.3, U+200B was moved in Unicode 4.0.1, and U+0085 has been in Cc at least since Unicode 1.1.5.

Line terminators are defined in 7.3 as \x0A, \x0D, \u2028, and \u2029: http://es5.github.io/#x7.3

The spec says, in 15.5.4.20, that String#trim must remove "both leading and trailing white space" where "white space" means both whitespace and line terminators: http://es5.github.io/#x15.5.4.20

The ES6 spec has the same definitions for "whitespace" and "line terminator" as the ES5 spec, in Tables 32 and 33, respectively.

The ES5 spec also says that Unicode 3.0 or later is followed, which means that U+180E and U+200B may be considered non-trimmable, and also that U+0085, despite appearing in Unicode's "White_Space" list, must be considered non-trimmable, because it is not in the explicit list of whitespace or line terminators and is not in Zs: http://es5.github.io/#x2

However, the ES6 spec says that Unicode 5.1.0 or later is followed, which means that U+0085 and U+200B must be considered non-trimmable, while U+180E may be considered non-trimmable.

This means that the test for improperly trimmed characters in this shim must include both \x85 and \u200B, and in es-shims/es5-shim#316, it must include \x85 but not necessarily \u200B (it currently includes \u200B but not \x85); similarly, this shim may need to remove \u180E from the trim regex, depending on what version of Unicode it's targeting (and then add it to the improper-trim test, if a consistent API is desired).

@ljharb
Copy link
Collaborator

ljharb commented Jun 26, 2015

Sounds good - we still need an automated test to prevent your fix from being accidentally removed later.

@mathiasbynens
Copy link
Contributor

However, the ES6 spec says that Unicode 5.1.0 or later is followed, which means that U+0085 and U+200B must be considered non-trimmable, while U+180E may be considered non-trimmable.

U+180E is Zs as per Unicode 5.1.0 so it must be considered trimmable in order to be ES6-compliant.

@lewisje
Copy link
Author

lewisje commented Jun 26, 2015

Then this means this shim is specifically targeting 5.1.0, thanks for letting me know.

I'll get a test in later today, and then feel the frustration that comes with not being able to rebase from GitHub's Web-based interface.

@ljharb
Copy link
Collaborator

ljharb commented Jun 26, 2015

Thanks! Totally agree that it's lame you can't rebase on the web

@mathiasbynens
Copy link
Contributor

Then this means this shim is specifically targeting 5.1.0, thanks for letting me know.

The ES6 spec is, and this shim aims to follow that.

Thanks for bringing this up! We might be able to simplify the regexp (see my earlier comment: #354 (comment)).

@lewisje
Copy link
Author

lewisje commented Jun 26, 2015

The downside to using ranges in the regexp string is that it couldn't be re-purposed to check for failure to trim all trimmable whitespace (although it would save characters to use \t and \r and the like instead of the \x or \u escapes); also, I'm not sure we should rely on the u flag being available for regexes.

@ljharb
Copy link
Collaborator

ljharb commented Jun 26, 2015

It's ok to make multiple regexes rather than repurposing - the cost of that is negligible.

@mathiasbynens
Copy link
Contributor

I'm not sure we should rely on the u flag being available for regexes.

I never said we should! That wouldn’t work in any existing browser. I was suggesting /[\t-\r \xA0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000\uFEFF]/ (i.e. what /\s/u translates to).

@ljharb
Copy link
Collaborator

ljharb commented Jul 20, 2015

@lewisje Are you planning on completing this PR?

@lewisje
Copy link
Author

lewisje commented Jul 20, 2015

I am, but I got distracted 😞

@lewisje lewisje force-pushed the string-trim-patch branch from dcbdd5d to 53beb7c Compare August 23, 2015 06:08
@lewisje
Copy link
Author

lewisje commented Aug 23, 2015

I made U+200B a separate test because if I combined them, and an implementation fails the "should trim on both sides" test, it could end up passing a test involving \u0085\u200B (say, if it normally trims \u200B but doesn't trim from the right to begin with).

Maybe this means, though, that in es6-shim.js I should have also tested those two characters separately instead of in the same string.

Check for two commonly erroneously trimmed characters instead of one, and check for erroneously failing to trim the trimmable whitespace characters.

I would also replace `if (typeof this === 'undefined' || this === null)` with `if (this == null)` but I don't know whether that's allowed by this library's style rules.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants