Skip to content

Commit

Permalink
emulation -> accuracy
Browse files Browse the repository at this point in the history
  • Loading branch information
slevithan committed Nov 6, 2024
1 parent 42624bd commit 3a56346
Show file tree
Hide file tree
Showing 14 changed files with 101 additions and 75 deletions.
28 changes: 16 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Compared to running the actual [Oniguruma](https://github.com/kkos/oniguruma) C
### [Try the demo REPL](https://slevithan.github.io/oniguruma-to-es/demo/)

Oniguruma-To-ES deeply understands all of the hundreds of large and small differences in Oniguruma and JavaScript regex syntax and behavior across multiple JavaScript version targets. It's *obsessive* about precisely following Oniguruma syntax rules and ensuring that the emulated features it supports have **exactly the same behavior**, even in extreme edge cases. And it's battle-tested on thousands of real-world Oniguruma regexes used in TextMate grammars (via the Shiki library). A few uncommon features can't be perfectly emulated and allow rare differences, but if you don't want to allow this, you can set the `emulation` option to `strict` and throw for such patterns (see details below).
Oniguruma-To-ES deeply understands all of the hundreds of large and small differences in Oniguruma and JavaScript regex syntax and behavior across multiple JavaScript version targets. It's *obsessive* about precisely following Oniguruma syntax rules and ensuring that the emulated features it supports have **exactly the same behavior**, even in extreme edge cases. And it's battle-tested on thousands of real-world Oniguruma regexes used in TextMate grammars (via the Shiki library). A few uncommon features can't be perfectly emulated and allow rare differences, but if you don't want to allow this, you can set the `accuracy` option to throw for such patterns (see details below).

## 📜 Contents

Expand Down Expand Up @@ -83,7 +83,7 @@ A string with `i`, `m`, and `x` in any order (all optional).

```ts
type CompileOptions = {
emulation?: 'strict' | 'default' | 'loose';
accuracy?: 'strict' | 'default' | 'loose';
global?: boolean;
hasIndices?: boolean;
maxRecursionDepth?: number | null;
Expand Down Expand Up @@ -139,24 +139,28 @@ function toRegexAst(

These options are shared by functions [`compile`](#compile) and [`toRegExp`](#toregexp).

### `emulation`
### `accuracy`

One of `'strict'`, `'default'` *(default)*, or `'loose'`.

Sets the level of emulation strictness.
Sets the level of emulation rigor/strictness.

- **Strict:** Throw if the pattern can't be emulated with identical behavior (even in rare edge cases) for the given target.
- **Default:** The best choice in most cases. Permits a few close approximations of Oniguruma in order to support additional features.
- **Loose:** Useful for non-critical matching like syntax highlighting where having some mismatches is better than not working.

Each level of increased emulation strictness supports a subset of patterns supported by less strict modes. If a given pattern doesn't produce an error for a particular emulation mode, its generated result will be identical with all lower levels of strictness (given the same `target`).
Each level of increased accuracy supports a subset of patterns supported by lower accuracies. If a given pattern doesn't produce an error for a particular accuracy, its generated result will be identical with all lower levels of accuracy (given the same `target`).

<details>
<summary>More details</summary>

#### `default` mode
#### `strict`

Supports all features of `strict` mode, plus the following additional features, depending on `target`:
Supports slightly fewer features, but the missing features are all relatively uncommon (see below).

#### `default`

Supports all features of `strict`, plus the following additional features, depending on `target`:

- All targets (`ESNext` and earlier):
- Enables use of `\X` using a close approximation of a Unicode extended grapheme cluster.
Expand All @@ -166,12 +170,12 @@ Supports all features of `strict` mode, plus the following additional features,
- `ES2018`:
- Enables use of POSIX classes `[:graph:]` and `[:print:]` using ASCII-based versions rather than the Unicode versions available for `ES2024` and later. Other POSIX classes are always based on Unicode.

#### `loose` mode
#### `loose`

Supports all features of `default`, plus the following:

- Silences errors for unsupported uses of the search-start anchor `\G` (a flexible assertion that doesn’t have a direct equivalent in JavaScript).
- Oniguruma-To-ES uses a variety of strategies to accurately emulate many common uses of `\G`. When using `loose` mode, if a `\G` assertion is found that doesn't have a known emulation strategy, the `\G` is simply removed and JavaScript's `y` (`sticky`) flag is added. This might lead to some false positives and negatives.
- Oniguruma-To-ES uses a variety of strategies to accurately emulate many common uses of `\G`. When using `loose` accuracy, if a `\G` assertion is found that doesn't have a known emulation strategy, the `\G` is simply removed and JavaScript's `y` (`sticky`) flag is added. This might lead to some false positives and negatives.
</details>

### `global`
Expand All @@ -190,9 +194,9 @@ Include JavaScript flag `d` (`hasIndices`) in the result.

*Default: `6`.*

If an integer between `2` and `100`, common recursion forms are supported and recurse up to the specified depth limit. If set to `null`, any use of recursion results in an error.
Specifies the recursion depth limit. Supported values are integers `2` to `100` and `null`. If `null`, any use of recursion results in an error.

Since recursion isn't infinite-depth like in Oniguruma, use of recursion also results in an error if the `emulation` option is set to `'strict'`.
Since recursion isn't infinite-depth like in Oniguruma, use of recursion also results in an error if using strict `accuracy`.

<details>
<summary>More details</summary>
Expand Down Expand Up @@ -906,7 +910,7 @@ The table above doesn't include all aspects that Oniguruma-To-ES emulates (inclu

1. Target `ES2018` doesn't allow Unicode property names added in JavaScript specifications after ES2018 to be used.
2. Unicode blocks are easily emulatable but their character data would significantly increase library weight. They're also a deeply flawed and arguably-unuseful feature, given the ability to use Unicode scripts and other properties.
3. With target `ES2018`, the specific POSIX classes `[:graph:]` and `[:print:]` are an error if option `emulation` is `'strict'`, and they use ASCII-based versions rather than the Unicode versions available for target `ES2024` and later.
3. With target `ES2018`, the specific POSIX classes `[:graph:]` and `[:print:]` use ASCII-based versions rather than the Unicode versions available for target `ES2024` and later, and they result in an error if using strict `accuracy`.
4. Target `ES2018` doesn't support nested *negated* character classes.
5. It's not an error for *numbered* backreferences to come before their referenced group in Oniguruma, but an error is the best path for Oniguruma-To-ES because (1) most placements are mistakes and can never match (based on the Oniguruma behavior for backreferences to nonparticipating groups), (2) erroring matches the behavior of named backreferences, and (3) the edge cases where they're matchable rely on rules for backreference resetting within quantified groups that are different in JavaScript and aren't emulatable. Note that it's not a backreference in the first place if using `\10` or higher and not as many capturing groups are defined to the left (it's an octal or identity escape).
6. The recursion depth limit is specified by option `maxRecursionDepth`. Some forms of recursion (multiple recursions in the same pattern, and recursion with backreferences) aren't yet supported. Patterns that would error in Oniguruma due to triggering infinite recursion might find a match in Oniguruma-To-ES since recursion is bounded (future versions will detect this and error at transpilation time).
Expand Down
2 changes: 1 addition & 1 deletion demo/demo.css
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ h2 {

code {
padding: 0 3px;
background-color: #f6f6f6;
background-color: #f0f0f0;
}

kbd {
Expand Down
2 changes: 1 addition & 1 deletion demo/demo.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ const state = {
x: getValue('flag-x'),
},
opts: {
accuracy: getValue('option-accuracy'),
allowSubclassBasedEmulation: getValue('option-allowSubclassBasedEmulation'),
emulation: getValue('option-emulation'),
global: getValue('option-global'),
hasIndices: getValue('option-hasIndices'),
maxRecursionDepth: getValue('option-maxRecursionDepth'),
Expand Down
4 changes: 2 additions & 2 deletions demo/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ <h2>Try it</h2>
<img src="https://upload.wikimedia.org/wikipedia/commons/9/99/Unofficial_JavaScript_logo_2.svg" width="15" height="15">
</label>
<label>
<code>emulation</code>
<select id="option-emulation" onchange="setOption('emulation', this.value)">
<code>accuracy</code>
<select id="option-accuracy" onchange="setOption('accuracy', this.value)">
<option value="strict">strict</option>
<option value="default" selected>default</option>
<option value="loose">loose</option>
Expand Down
2 changes: 2 additions & 0 deletions spec/helpers/features.js
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,13 @@ const patternModsSupported = (() => {
return true;
})();
const maxTestTargetForPatternMods = patternModsSupported ? null : 'ES2024';
const minTestTargetForPatternMods = patternModsSupported ? 'ESNext' : Infinity;

const minTestTargetForFlagV = 'ES2024';

export {
maxTestTargetForDuplicateNames,
maxTestTargetForPatternMods,
minTestTargetForFlagV,
minTestTargetForPatternMods,
};
21 changes: 9 additions & 12 deletions spec/helpers/matchers.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,16 @@ import {toRegExp} from '../../dist/index.mjs';
import {EsVersion} from '../../src/utils.js';

function getArgs(actual, expected) {
const opts = {
pattern: typeof expected === 'string' ? expected : expected.pattern,
flags: expected.flags ?? '',
maxTestTarget: expected.maxTestTarget ?? null,
minTestTarget: expected.minTestTarget ?? null,
};
const max = expected.maxTestTarget;
const min = expected.minTestTarget;
const targets = ['ES2018', 'ES2024', 'ESNext'];
const targeted = targets.
filter(target => !opts.maxTestTarget || (EsVersion[target] <= EsVersion[opts.maxTestTarget])).
filter(target => !opts.minTestTarget || (EsVersion[target] >= EsVersion[opts.minTestTarget]));
filter(target => !max || EsVersion[target] <= EsVersion[max]).
filter(target => !min || (min !== Infinity && EsVersion[target] >= EsVersion[min]));
return {
pattern: opts.pattern,
flags: opts.flags,
pattern: typeof expected === 'string' ? expected : expected.pattern,
flags: expected.flags ?? '',
accuracy: expected.accuracy ?? 'default',
strings: Array.isArray(actual) ? actual : [actual],
targets: targeted,
};
Expand All @@ -27,9 +24,9 @@ function wasFullStrMatch(match, str) {
// Expects `negate` to be set by `negativeCompare` and doesn't rely on Jasmine's automatic matcher
// negation because when negated we don't want to early return `true` when looping over the array
// of strings and one is found to not match; they all need to not match
function matchWithAllTargets({pattern, flags, strings, targets}, {exact, negate}) {
function matchWithAllTargets({pattern, flags, strings, targets, accuracy}, {exact, negate}) {
for (const target of targets) {
const re = toRegExp(pattern, flags, {target});
const re = toRegExp(pattern, flags, {accuracy, target});
for (const str of strings) {
// In case `flags` includes `g` or `y`
re.lastIndex = 0;
Expand Down
4 changes: 2 additions & 2 deletions spec/match-assertion.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -178,14 +178,14 @@ describe('Assertion', () => {
expect(() => compile(r`(?=ab\G)`)).toThrow();
});

it('should allow unsupported forms if using loose emulation', () => {
it('should allow unsupported forms if using loose accuracy', () => {
const patterns = [
r`a\G`,
r`\G|`,
];
patterns.forEach(pattern => {
expect(() => compile(pattern)).toThrow();
expect(toRegExp(pattern, '', {emulation: 'loose'}).sticky).toBe(true);
expect(toRegExp(pattern, '', {accuracy: 'loose'}).sticky).toBe(true);
});
});

Expand Down
31 changes: 28 additions & 3 deletions spec/match-backreference.spec.js
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
import {compile} from '../dist/index.mjs';
import {cp, r} from '../src/utils.js';
import {maxTestTargetForDuplicateNames} from './helpers/features.js';
import {maxTestTargetForDuplicateNames, maxTestTargetForPatternMods, minTestTargetForPatternMods} from './helpers/features.js';
import {matchers} from './helpers/matchers.js';

beforeEach(() => {
jasmine.addMatchers(matchers);
});

describe('Backreference', () => {
// TODO: Test that case-insensitive backref to case-sensitive group requires `ESNext` or non-`strict` emulation

describe('numbered backref', () => {
it('should rematch the captured text', () => {
expect('aa').toExactlyMatch(r`(a)\1`);
Expand Down Expand Up @@ -366,4 +364,31 @@ describe('Backreference', () => {
expect(['aaba', 'bbab']).not.toFindMatch(r`(?<a>(?<b>\w)\k<b>)\g<a>`);
});
});

it('should match case-insensitive backref to case-sensitive group', () => {
// Real support with target ESNext
expect(['aa', 'aA']).toExactlyMatch({
pattern: r`(a)(?i)\1`,
minTestTarget: minTestTargetForPatternMods,
});
expect(['Aa', 'AA']).not.toFindMatch({
pattern: r`(a)(?i)\1`,
minTestTarget: minTestTargetForPatternMods,
});
// Throw with strict `accuracy` if target not ESNext
['ES2018', 'ES2024'].forEach(target => {
expect(() => compile(r`(a)(?i)\1`, '', {
accuracy: 'strict',
target,
})).toThrow();
});
// Matches same case as group with other `accuracy` values
['default', 'loose'].forEach(accuracy => {
expect('aa').toExactlyMatch({
pattern: r`(a)(?i)\1`,
accuracy,
maxTestTarget: maxTestTargetForPatternMods,
});
});
});
});
6 changes: 3 additions & 3 deletions spec/match-recursion.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ beforeEach(() => {
});

describe('Recursion', () => {
it('should throw if recursion used with strict emulation', () => {
expect(() => compile(r`a\g<0>?`, '', {emulation: 'strict'})).toThrow();
expect(() => compile('', '', {emulation: 'strict'})).not.toThrow();
it('should throw if recursion used with strict accuracy', () => {
expect(() => compile(r`a\g<0>?`, '', {accuracy: 'strict'})).toThrow();
expect(() => compile('', '', {accuracy: 'strict'})).not.toThrow();
});

it('should throw if recursion used with null maxRecursionDepth', () => {
Expand Down
22 changes: 10 additions & 12 deletions src/compile.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@ import {generate} from './generate.js';
import {parse} from './parse.js';
import {tokenize} from './tokenize.js';
import {transform} from './transform.js';
import {EmulationMode, EsVersion, Target} from './utils.js';
import {Accuracy, EsVersion, Target} from './utils.js';
import {atomic, possessive} from 'regex/atomic';
import {recursion} from 'regex-recursion';

/**
@typedef {{
emulation?: keyof EmulationMode;
accuracy?: keyof Accuracy;
global?: boolean;
hasIndices?: boolean;
maxRecursionDepth?: number | null;
Expand Down Expand Up @@ -56,8 +56,8 @@ function compileInternal(pattern, flags, options) {
skipBackrefValidation: opts.tmGrammar,
});
const regexAst = transform(onigurumaAst, {
accuracy: opts.accuracy,
allowSubclassBasedEmulation: opts.allowSubclassBasedEmulation,
emulation: opts.emulation,
bestEffortTarget: opts.target,
});
const generated = generate(regexAst, opts);
Expand All @@ -66,14 +66,14 @@ function compileInternal(pattern, flags, options) {
flags: `${opts.hasIndices ? 'd' : ''}${opts.global ? 'g' : ''}${generated.flags}${generated.options.disable.v ? 'u' : 'v'}`,
};
if (regexAst._strategy) {
let emulationSubpattern = null;
let subpattern = null;
result.pattern = result.pattern.replace(/\(\?:\\p{sc=<<}\|(.*?)\|\\p{sc=>>}\)/s, (_, sub) => {
emulationSubpattern = sub;
subpattern = sub;
return '';
});
result._internal = {
strategy: regexAst._strategy.name,
subpattern: emulationSubpattern,
subpattern,
};
}
return result;
Expand All @@ -90,19 +90,17 @@ function getOptions(options) {
}
// Set default values
return {
// Sets the level of emulation rigor/strictness
accuracy: 'default',
// Allows advanced emulation strategies that rely on returning a `RegExp` subclass with an
// overridden `exec` method. A subclass is only used if needed for the given pattern
allowSubclassBasedEmulation: false,
// Sets the level of emulation strictness; `default` is best in most cases. If `strict`, throws
// if the pattern can't be emulated with identical behavior (even in rare edge cases) for the
// given target
emulation: 'default',
// Include JS flag `g` in the result
global: false,
// Include JS flag `d` in the result
hasIndices: false,
// If an integer between `2` and `100`, common recursion forms are supported and recurse up to
// the specified depth limit. If set to `null`, any use of recursion results in an error
// Specifies the recursion depth limit. Supported values are integers `2` to `100` and `null`.
// If `null`, any use of recursion results in an error
maxRecursionDepth: 6,
// Simplify the generated pattern when it doesn't change the meaning
optimize: true,
Expand Down
12 changes: 6 additions & 6 deletions src/generate.js
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,13 @@ function generate(ast, options) {
};
let lastNode = null;
const state = {
accuracy: opts.accuracy,
appliedGlobalFlags,
captureFlagIMap: new Map(),
currentFlags: {
dotAll: ast.flags.dotAll,
ignoreCase: ast.flags.ignoreCase,
},
emulation: opts.emulation,
groupNames: new Set(),
inCharClass: false,
lastNode,
Expand Down Expand Up @@ -226,11 +226,11 @@ function genBackreference({ref}, state) {
}
if (
!state.useFlagMods &&
state.emulation === 'strict' &&
state.accuracy === 'strict' &&
state.currentFlags.ignoreCase &&
!state.captureFlagIMap.get(ref)
) {
throw new Error('Use of case-insensitive backref to case-sensitive group requires target ESNext or non-strict emulation');
throw new Error('Use of case-insensitive backref to case-sensitive group requires target ESNext or non-strict accuracy');
}
return '\\' + ref;
}
Expand Down Expand Up @@ -342,7 +342,7 @@ function genCharacterSet({kind, negate, value, key}, state) {
UnicodePropertiesWithSpecificCase.has(value)
) {
// Support for this would require heavy Unicode data. Could change e.g. `\p{Lu}` to `\p{LC}`
// if not using `strict` emulation (since it's close but not 100%), but this wouldn't work
// if not using strict `accuracy` (since it's close but not 100%), but this wouldn't work
// for e.g. `\p{Lt}`, and in any case, it's probably user error if using these case-specific
// props case-insensitively
throw new Error(`Unicode property "${value}" can't be case-insensitive when other chars have specific case`);
Expand Down Expand Up @@ -393,8 +393,8 @@ function genRecursion({ref}, state) {
if (!rDepth) {
throw new Error('Use of recursion disabled');
}
if (state.emulation === 'strict') {
throw new Error('Use of recursion requires non-strict emulation');
if (state.accuracy === 'strict') {
throw new Error('Use of recursion requires non-strict accuracy due to depth limit');
}
// Using the syntax supported by `regex-recursion`
return ref === 0 ? `(?R=${rDepth})` : r`\g<${ref}&R=${rDepth}>`;
Expand Down
Loading

0 comments on commit 3a56346

Please sign in to comment.