Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(module-source): Sub non-conforming ZWJ prefixes with CGJ #2436

Merged

Conversation

SMotaal
Copy link
Contributor

@SMotaal SMotaal commented Aug 28, 2024

Description

This PR changes the HIDDEN_PREFIX of ModuleSource from the non-conforming $h\u200D_ zero-width joiner (ZWJ) notation to the conforming $h\u034F_ combining grapheme joiner (CGJ) notation.

A future PR may address further changes to a $\u034F-prefixed and \u034F$-suffixed format as was suggested by @michaelfig in discussions.

Motivation

This change is motivated after encountering a parsing error when using rollup which was traced back to the $h\u200D_-prefixed identifier in an endoScript bundle. More importantly, this is also motivated by the subsequent discovery that rollup's implementation was actually conforming to the ECMAScript Specification when it was throwing this error.

To elaborate, while runtimes today will accept the special identifier notation that is currently being introduced by the ModuleSource rewrites, the current $h\u200D_ zero-width joiner (ZWJ) notation does not conform to the specifications defined in the ECMAScript Lexical Grammar. In essence, what the specifications entail is that the character sequence for Identifier Names once unescaped would be expected to match the /^[$_\p{ID_Start}][$_\p{ID_Continue}]*$/u pattern, aside from the additional # character prefix required in the case of private fields.

As such, one can test this in the console by evaluating the following:

Object.fromEntries([String.raw`$h\u200D_`, String.raw`$h\u034F_`].map(id => [id, /^[$_\p{ID_Start}][$_\p{ID_Continue}]*$/u.test(JSON.parse(`"${id}"`))]))

The above would yield the following object in a runtime where the unicode escape sequences are retained:

{$h\u200D_: false, $h\u034F_: true}

Digging closer in the Unicode Standard, it seems that the zero-width joiner (ZWJ) may indeed be used in a conforming notation per Emoji Profile in Annex #31 of the Unicode Standard, however this is not applicable for this purpose as it would require the use of emojis.

At this point, my suggestion to instead use the combining grapheme joiner (CGJ) is best articulated with this excerpt that I am borrowing from its canonical Wikipedia entry:

However, in contrast to the zero-width joiner and similar characters, the CGJ does not affect whether the two letters are rendered separately or as a ligature or cursively joined—the default behavior for this is determined by the font.1

The Wikipedia article offers additional nuances about the differences, while the Proposal for addition of COMBINING GRAPHEME JOINER offers the necessary context about its intent.

It is fair to note that there are many uses of the zero-width joiner (ZWJ) already in the wild, and in fact there are currently test262 tests for its occurrence. That said, unless those uses are conforming to the ECMAScript Specification and the Unicode Standard, they will limit code portability and adoption by users who may end up confused by failures similar to the one encountered with rollup.

Ultimately, with the reasonable recommendations to exercise caution when it comes to bundling ses and related sources that are best bundled with bundleSource instead, those sources may still need to be parsed with tools like rollup for different purposes that would be aligned with the expectations that they are being handled safely.

Approach

Substituting the invisible joiner character

A search across the monorepo for (?:\u200d|\\u200d)_ yields only 3 files of interest:

  • packages/module-source/TESTS.md
  • packages/module-source/src/hidden.js
  • packages/module-source/test/module-source.test.js

While making changes to the 3 files of interest, a distinction is made between matching \$h\\u200d_ and \$h\u200d_ where the replacements are respectively $h\\u034f_ and $h\u034f_, along with their $c equivalents.

The search across the monorepo for (?:\u200d|\\u200d)_ yields another 978 files that are not of interest found in:

  • packages/test262-runner/test262/test/language/expressions/class/elements
  • packages/test262-runner/test262/test/language/statements/class/elements

All those files remain unchanged.

Ensuring generic wording is used

For testing and other purposes where descriptive phrases are used to refer to the use of ZWJ, CGJ or other characters for this same intent, the phrase "invisible joiner character" is suggested.

Security Considerations

Does not apply to my knowledge

Scaling Considerations

Does not apply to my knowledge

Documentation Considerations

Does not apply to my knowledge

Testing Considerations

See: #2436 (comment)

Compatibility Considerations

While the changes do not affect compatibility when the generated code is evaluated at runtime, there can potentially be compatibility concerns with tools that have been specifically designed to work with the current notation.

Upgrade Considerations

Does not apply to my knowledge

Footnotes

  1. https://en.wikipedia.org/wiki/Combining_grapheme_joiner

@SMotaal SMotaal marked this pull request as ready for review August 28, 2024 15:46
@SMotaal SMotaal force-pushed the smotaal/module-source-with-cgj-identifiers branch from 3b32ab3 to eff44d1 Compare August 28, 2024 20:57
Copy link
Member

@kriskowal kriskowal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m in favor of this change. Since this is a design passed down from @michaelfig and @erights, I would ask for them to sign-off as well.

There are some tests for "zero width joiner" invariants in @endo/module-source. We need to make sure that CGJ is disallowed for all source in ModuleSource(source).

@SMotaal
Copy link
Contributor Author

SMotaal commented Aug 29, 2024

@kriskowal To your point:

There are some tests for "zero width joiner" invariants in @endo/module-source. We need to make sure that CGJ is disallowed for all source in ModuleSource(source).

I only came across the following throw statements in packages/module-source/**/*.js:

Identifier(path) {
if (options.allowHidden || allowedHiddens.has(path.node)) {
return;
}
// Ensure the parse doesn't already include our required hidden identifiers.
// console.log(`have identifier`, path.node);
const i = h.HIDDEN_IDENTIFIERS.indexOf(path.node.name);
if (i >= 0) {
throw path.buildCodeFrameError(
`The ${h.HIDDEN_IDENTIFIERS[i]} identifier is reserved`,
);
}
if (path.node.name.startsWith(h.HIDDEN_CONST_VAR_PREFIX)) {
throw path.buildCodeFrameError(
`The ${path.node.name} constant variable is reserved`,
);
}
},

The code is structured to ensure there is a single source of truth, so this holds with CGJ all the same.


That said, I am also not sure if the following could prove problematic down the road:

new ModuleSource<CGJ>(bundleSource<ZWJ>(sourcePath, { format: 'endoScript' }).source);

I don't foresee this being a problem, but I wonder if adding a test case could be reasonable while we're here.


Concluding Note:

I raised the above based on the realization that until this change, there was never a case where a rewritten module could have undergone a second pass.

In other words, while the intent is that we are changing the invisible joiner character, the outcome is that we are also altering the behaviour of the constructor. This altered behaviour may lead to edge cases where modules previously rewritten with ZWJ now end up passing through a CGJ rewrite where the constructor will not throw.

We are in agreement that we do not foresee problems, at least in theory, omitting adding a test in this PR.

@SMotaal SMotaal force-pushed the smotaal/module-source-with-cgj-identifiers branch from eff44d1 to ef92306 Compare August 30, 2024 14:13
@kriskowal
Copy link
Member

I’m not worried about mixing CGJ and ZWJ. The version of ModuleSource will censor and produce the same one and it doesn’t depend on version coördination with other packages.

I just want to make the names of the tests have "combining grapheme joiner" instead of "zero-width joiner" now, or some abstract term that would apply regardless of which is used. Otherwise, this is good to go.

Copy link
Member

@michaelfig michaelfig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I marked a few cases where the test description could be loosened to just "joiner". Up to you as to what you do with these descriptions.

packages/module-source/test/module-source.test.js Outdated Show resolved Hide resolved
packages/module-source/test/module-source.test.js Outdated Show resolved Hide resolved
packages/module-source/test/module-source.test.js Outdated Show resolved Hide resolved
@SMotaal SMotaal force-pushed the smotaal/module-source-with-cgj-identifiers branch from ef92306 to 862dfd8 Compare September 1, 2024 21:10
@michaelfig michaelfig dismissed kriskowal’s stale review September 1, 2024 21:13

As requested, @SMotaal updated the occurrences of "zero width joiner" (ZWJ)

@SMotaal SMotaal changed the base branch from master to mfig-o September 1, 2024 21:44
@SMotaal SMotaal changed the base branch from mfig-o to master September 1, 2024 21:45
@SMotaal SMotaal merged commit cc82132 into endojs:master Sep 1, 2024
15 checks passed
@SMotaal SMotaal deleted the smotaal/module-source-with-cgj-identifiers branch September 2, 2024 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants