Skip to content

Commit

Permalink
accuracy:'loose' to overrides.allowAllSearchStartAnchors
Browse files Browse the repository at this point in the history
  • Loading branch information
slevithan committed Nov 22, 2024
1 parent dc7f9e5 commit 0f39c04
Show file tree
Hide file tree
Showing 8 changed files with 59 additions and 61 deletions.
37 changes: 12 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ function toRegExp(

```ts
type OnigurumaToEsOptions = {
accuracy?: 'strict' | 'default' | 'loose';
accuracy?: 'default' | 'strict';
avoidSubclass?: boolean;
flags?: string;
global?: boolean;
Expand Down Expand Up @@ -139,15 +139,12 @@ The following options are shared by functions [`toRegExp`](#toregexp) and [`toDe

### `accuracy`

One of `'strict'`, `'default'` *(default)*, or `'loose'`.
One of `'default'` *(default)* or `'strict'`.

Sets the level of emulation rigor/strictness.

- **Default:** The best choice in most cases. Permits a few close approximations in order to support additional features.
- **Strict:** Throw if the pattern can't be emulated with identical behavior (even in rare edge cases) for the given `target`.
- **Default:** The best choice in most cases. Permits a few close approximations of Oniguruma in order to support additional features.
- **Loose:** Useful for non-critical matching like syntax highlighting where having some mismatches is better than not working.

Each level of increased accuracy supports a subset of patterns supported by lower accuracies. If a given pattern doesn't produce an error for a particular accuracy, its generated result will be identical with all lower levels of accuracy (given the same `target`).

<details>
<summary>More details</summary>
Expand All @@ -162,18 +159,11 @@ Supports all features of `strict`, plus the following additional features, depen

- All targets (`ES2025` and earlier):
- Enables use of `\X` using a close approximation of a Unicode extended grapheme cluster.
- Enables recursion (e.g. via `\g<0>`) with a depth limit specified by option `maxRecursionDepth`.
- Enables recursion (ex: `\g<0>`) with a depth limit specified by option `maxRecursionDepth`.
- `ES2024` and earlier:
- Enables use of case-insensitive backreferences to case-sensitive groups.
- `ES2018`:
- Enables use of POSIX classes `[:graph:]` and `[:print:]` using ASCII-based versions rather than the Unicode versions available for `ES2024` and later. Other POSIX classes are always based on Unicode.

#### `loose`

Supports all features of `default`, plus the following:

- Silences errors for unsupported uses of the search-start anchor `\G` (a flexible assertion that doesn’t have a direct equivalent in JavaScript).
- Oniguruma-To-ES uses a variety of strategies to accurately emulate many common uses of `\G`. When using `loose` accuracy, if a `\G` assertion is found that doesn't have a known emulation strategy, the `\G` is simply removed and JavaScript's `y` (`sticky`) flag is added. This might lead to some false positives and negatives.
- Enables use of POSIX classes `[:graph:]` and `[:print:]` using ASCII-based versions rather than the Unicode versions available for `ES2024` and later. Other POSIX classes are always Unicode-based.
</details>

### `avoidSubclass`
Expand Down Expand Up @@ -221,7 +211,9 @@ Using a high limit has a small impact on performance. Generally, this is only a

Advanced options that take precedence over standard error checking and flags.

- `allowOrphanBackrefs`: Useful with TextMate grammar processors that merge backreferences across `begin` and `end` patterns.
- `allowOrphanBackrefs`: Useful with TextMate grammars that merge backreferences across `begin` and `end` patterns.
- `allowAllSearchStartAnchors`: Silences errors for unsupported uses of the search-start anchor `\G`.
- Oniguruma-To-ES uses a variety of strategies to accurately emulate many common uses of `\G`. When using this option, if a `\G` is found that doesn't have a known emulation strategy, the `\G` is simply removed and JavaScript's `y` (`sticky`) flag is added. This might lead to some false positives and negatives, but is useful for non-critical matching like syntax highlighting when having some mismatches is better than not working.

### `target`

Expand Down Expand Up @@ -938,18 +930,13 @@ The table above doesn't include all aspects that Oniguruma-To-ES emulates (inclu

The following features don't yet have any support, and throw errors. They're all uncommonly used, with most being *extremely* rare.

- ASCII mode for POSIX classes (flag <code>P</code>).
- Grapheme boundaries: <code>\y</code>, <code>\Y</code>.
- Grapheme boundary options (flags <code>y{g}</code>, <code>y{w}</code>).
- Whole-pattern options: don't capture <code>(?C)</code>, ignore-care is ASCII <code>(?I)</code>, find longest <code>(?L)</code>.
- Flags <code>P</code> (ASCII-only POSIX classes), <code>y{g}</code>/<code>y{w}</code> (grapheme boundary options).
- Whole-pattern modifiers: Don't capture <code>(?C)</code>, ignore-care is ASCII <code>(?I)</code>, find longest <code>(?L)</code>.
- Absent repeater <code>(?\~…)</code>, expression <code>(?\~|…|…)</code>, and range cutter <code>(?\~|…)</code>.
- Conditionals: <code>(?(…)…)</code>, <code>(?(…)…|…)</code>.
- If-then-else conditionals: <code>(?(…)…)</code>, <code>(?(…)…|…)</code>.
- Rarely used character specifiers: Non-A-Za-z with <code>\cx</code>, <code>\C-x</code>. Meta: <code>\M-x</code>, <code>\M-\C-x</code>. Bracketed octals: <code>\o{…}</code>. Octal UTF-8 encoded bytes (<code>\200</code>+).
- Code point sequences: <code>\x{H H …H}</code>, <code>\o{O O …O}</code>.
- Additional, extremely rare ways to specify characters.
- Non-A-Za-z with <code>\cx</code>, <code>\C-x</code>.
- Meta: <code>\M-x</code>, <code>\M-\C-x</code>.
- Octal code points: <code>\o{…}</code>.
- Octal UTF-8 encoded bytes (<code>\200</code>+).
- Callout functions: <code>(?{…})</code>, etc.

## ㊗️ Unicode / mixed case-sensitivity
Expand Down
5 changes: 3 additions & 2 deletions demo/demo.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ const state = {
maxRecursionDepth: getValue('option-maxRecursionDepth'),
overrides: {
allowOrphanBackrefs: getValue('option-allowOrphanBackrefs'),
allowAllSearchStartAnchors: getValue('option-allowAllSearchStartAnchors'),
},
target: getValue('option-target'),
verbose: getValue('option-verbose'),
Expand Down Expand Up @@ -116,7 +117,7 @@ function showTranspiled() {
}
ui.comparisonInfo.classList.remove('hidden');
const otherTargetAccuracyCombinations = ['ES2018', 'ES2024', 'ES2025'].flatMap(
t => ['loose', 'default', 'strict'].map(a => ({target: t, accuracy: a}))
t => ['default', 'strict'].map(a => ({target: t, accuracy: a}))
).filter(c => c.target !== options.target || c.accuracy !== options.accuracy);
const differents = [];
// Collect the different results, including differences in error status
Expand All @@ -136,7 +137,7 @@ function showTranspiled() {
}
}
// Compose and display message about differences or lack thereof
let str = 'Tested all 9 <code>target</code>/<code>accuracy</code> combinations.';
let str = 'Tested all 6 <code>target</code>/<code>accuracy</code> combinations.';
if (differents.length) {
const withError = [];
const withDiff = [];
Expand Down
24 changes: 16 additions & 8 deletions demo/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,8 @@ <h2>Try it</h2>
</label>
<label>
<select id="option-accuracy" onchange="setOption('accuracy', this.value)">
<option value="strict">strict</option>
<option value="default" selected>default</option>
<option value="loose">loose</option>
<option value="strict">strict</option>
</select>
<code>accuracy</code>
<span class="tip tip-lg"><code>default</code> permits a few close approximations to support additional features</span>
Expand Down Expand Up @@ -111,16 +110,25 @@ <h2>Try it</h2>
<div>
<p>
<label>
<input type="checkbox" id="option-verbose" onchange="setOption('verbose', this.checked)">
<code>verbose</code>
<span class="tip tip-lg">Disables optimizations that simplify the pattern without changing the meaning</span>
<input type="checkbox" id="option-allowOrphanBackrefs" onchange="setOverride('allowOrphanBackrefs', this.checked)">
<code>allowOrphanBackrefs</code>
<span class="tip tip-xl">Useful with TextMate grammars that merge backrefs across <code>begin</code> and <code>end</code> patterns</span>
</label>
</p>
<p>
<label>
<input type="checkbox" id="option-allowOrphanBackrefs" onchange="setOverride('allowOrphanBackrefs', this.checked)">
<code>allowOrphanBackrefs</code>
<span class="tip tip-xl">Useful with TextMate grammar processors that merge backrefs across <code>begin</code> and <code>end</code> patterns</span>
<input type="checkbox" id="option-allowAllSearchStartAnchors" onchange="setOverride('allowAllSearchStartAnchors', this.checked)">
<code>allowAllSearchStartAnchors</code>
<span class="tip tip-lg">Silences errors for unsupported uses of <code>\G</code></span>
</label>
</p>
</div>
<div>
<p>
<label>
<input type="checkbox" id="option-verbose" onchange="setOption('verbose', this.checked)">
<code>verbose</code>
<span class="tip tip-lg">Disables optimizations that simplify the pattern without changing the meaning</span>
</label>
</p>
</div>
Expand Down
13 changes: 5 additions & 8 deletions spec/match-backreference.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -437,20 +437,17 @@ describe('Backreference', () => {
pattern: r`(a)(?i)\1`,
minTestTarget: minTestTargetForFlagGroups,
});
// Throw with strict `accuracy` if `target` not ES2025
// Throw with strict `accuracy` if `target` below ES2025
['ES2018', 'ES2024'].forEach(target => {
expect(() => toDetails(r`(a)(?i)\1`, {
accuracy: 'strict',
target,
})).toThrow();
});
// Matches only the same case as the reffed case-sensitive group with other `accuracy` values
['default', 'loose'].forEach(accuracy => {
expect('aa').toExactlyMatch({
pattern: r`(a)(?i)\1`,
accuracy,
maxTestTarget: maxTestTargetForFlagGroups,
});
// With default `accuracy` and `target` below ES2025, matches only the same case as the reffed case-sensitive group
expect('aa').toExactlyMatch({
pattern: r`(a)(?i)\1`,
maxTestTarget: maxTestTargetForFlagGroups,
});
});
});
Expand Down
4 changes: 2 additions & 2 deletions spec/match-search-start.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -128,14 +128,14 @@ describe('Assertion: Search start', () => {
expect(() => toDetails(r`(?=ab\G)`)).toThrow();
});

it('should allow unsupported forms if using loose accuracy', () => {
it('should allow unsupported forms if allowing all search start anchors', () => {
const patterns = [
r`a\G`,
r`\G|`,
];
patterns.forEach(pattern => {
expect(() => toDetails(pattern)).toThrow();
expect(toRegExp(pattern, {accuracy: 'loose'}).sticky).toBe(true);
expect(toRegExp(pattern, {overrides: {allowAllSearchStartAnchors: true}}).sticky).toBe(true);
});
});
});
Expand Down
2 changes: 2 additions & 0 deletions src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ import {recursion} from 'regex-recursion';
maxRecursionDepth?: number | null;
overrides?: {
allowOrphanBackrefs?: boolean;
allowAllSearchStartAnchors: boolean;
};
target?: keyof Target;
verbose?: boolean;
Expand All @@ -55,6 +56,7 @@ function toDetails(pattern, options) {
});
const regexAst = transform(onigurumaAst, {
accuracy: opts.accuracy,
allowAllSearchStartAnchors: opts.overrides.allowAllSearchStartAnchors,
avoidSubclass: opts.avoidSubclass,
bestEffortTarget: opts.target,
});
Expand Down
16 changes: 8 additions & 8 deletions src/options.js
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@
import {envSupportsDuplicateNames, envSupportsFlagGroups, envSupportsFlagV} from './utils.js';

const Accuracy = /** @type {const} */ ({
strict: 'strict',
default: 'default',
loose: 'loose',
strict: 'strict',
});

const EsVersion = {
ES2018: 2018,
ES2024: 2024,
ES2025: 2025,
ES2024: 2024,
ES2018: 2018,
};

const Target = /** @type {const} */ ({
auto: 'auto',
ES2018: 'ES2018',
ES2024: 'ES2024',
ES2025: 'ES2025',
ES2024: 'ES2024',
ES2018: 'ES2018',
});

/**
Expand Down Expand Up @@ -54,9 +53,10 @@ function getOptions(options) {
...options,
// Advanced options that take precedence over standard error checking and flags.
overrides: {
// Useful with TextMate grammar processors that merge backreferences across `begin` and `end`
// patterns.
// Useful with TextMate grammars that merge backreferences across `begin` and `end` patterns.
allowOrphanBackrefs: false,
// Silences errors for unsupported uses of the search-start anchor `\G`.
allowAllSearchStartAnchors: false,
...(options?.overrides),
},
};
Expand Down
19 changes: 11 additions & 8 deletions src/transform.js
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ AST represents what's needed to precisely reproduce Oniguruma behavior using Reg
@param {import('./parse.js').OnigurumaAst} ast
@param {{
accuracy?: keyof Accuracy;
allowAllSearchStartAnchors?: boolean;
avoidSubclass?: boolean;
bestEffortTarget?: keyof Target;
}} [options]
Expand All @@ -40,10 +41,11 @@ function transform(ast, options) {
// A couple edge cases exist where options `accuracy` and `bestEffortTarget` are used:
// - `VariableLengthCharacterSet` kind `grapheme` (`\X`): An exact representation would require
// heavy Unicode data; a best-effort approximation requires knowing the target.
// - `CharacterSet` kind `posix` with values `graph` and `print`: Their complex exact
// representations are hard to change after the fact in the generator to a best-effort
// approximation based on the target, so produce the appropriate structure here.
// - `CharacterSet` kind `posix` with values `graph` and `print`: Their complex Unicode-based
// representations would be hard to change to ASCII-based after the fact in the generator
// based on `target`/`accuracy`, so produce the appropriate structure here.
accuracy: 'default',
allowAllSearchStartAnchors: false,
avoidSubclass: false,
bestEffortTarget: 'ES2025',
...options,
Expand All @@ -52,6 +54,7 @@ function transform(ast, options) {
const strategy = opts.avoidSubclass ? null : applySubclassStrategies(ast);
const firstPassState = {
accuracy: opts.accuracy,
allowAllSearchStartAnchors: opts.allowAllSearchStartAnchors,
flagDirectivesByAlt: new Map(),
minTargetEs2024: isMinTarget(opts.bestEffortTarget, 'ES2024'),
// Subroutines can appear before the groups they ref, so collect reffed nodes for a second pass
Expand Down Expand Up @@ -124,7 +127,7 @@ const FirstPassVisitor = {
},
},

Assertion({node, ast, remove, replaceWith}, {accuracy, supportedGNodes, wordIsAscii}) {
Assertion({node, ast, remove, replaceWith}, {allowAllSearchStartAnchors, supportedGNodes, wordIsAscii}) {
const {kind, negate} = node;
if (kind === AstAssertionKinds.line_end) {
// Onig's only line break char is line feed, unlike JS
Expand All @@ -133,7 +136,7 @@ const FirstPassVisitor = {
// Onig's only line break char is line feed, unlike JS
replaceWith(parseFragment(r`(?<=\A|\n)`));
} else if (kind === AstAssertionKinds.search_start) {
if (!supportedGNodes.has(node) && accuracy !== 'loose') {
if (!supportedGNodes.has(node) && !allowAllSearchStartAnchors) {
throw new Error(r`Uses "\G" in a way that's unsupported`);
}
ast.flags.sticky = true;
Expand Down Expand Up @@ -294,7 +297,7 @@ const FirstPassVisitor = {
!node.flags.enable && !node.flags.disable && delete node.flags;
},

Pattern({node}, {accuracy, supportedGNodes}) {
Pattern({node}, {allowAllSearchStartAnchors, supportedGNodes}) {
// For `\G` to be accurately emulatable using JS flag y, it must be at (and only at) the start
// of every top-level alternative (with complex rules for what determines being at the start).
// Additional `\G` error checking in `Assertion` visitor
Expand All @@ -312,10 +315,10 @@ const FirstPassVisitor = {
hasAltWithoutLeadG = true;
}
}
if (hasAltWithLeadG && hasAltWithoutLeadG && accuracy !== 'loose') {
if (hasAltWithLeadG && hasAltWithoutLeadG && !allowAllSearchStartAnchors) {
throw new Error(r`Uses "\G" in a way that's unsupported`);
}
// Supported `\G` nodes will be removed when traversed; others will error if not `loose`
// Supported `\G` nodes will be removed when traversed; others will error
leadingGs.forEach(g => supportedGNodes.add(g))
},

Expand Down

0 comments on commit 0f39c04

Please sign in to comment.