Skip to content

Commit

Permalink
Switch from allowBestEffort to 3 emulation modes
Browse files Browse the repository at this point in the history
  • Loading branch information
slevithan committed Nov 6, 2024
1 parent e199f30 commit d594d53
Show file tree
Hide file tree
Showing 10 changed files with 154 additions and 111 deletions.
61 changes: 40 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Compared to running the actual [Oniguruma](https://github.com/kkos/oniguruma) C
### [Try the demo REPL](https://slevithan.github.io/oniguruma-to-es/demo/)

Oniguruma-To-ES deeply understands all of the hundreds of large and small differences in Oniguruma and JavaScript regex syntax and behavior across multiple JavaScript version targets. It's *obsessive* about precisely following Oniguruma syntax rules and ensuring that the emulated features it supports have **exactly the same behavior**, even in extreme edge cases. And it's battle-tested on thousands of real-world Oniguruma regexes used in TextMate grammars (via the Shiki library). A few uncommon features can't be perfectly emulated and allow rare differences, but if you don't want to allow this, you can disable the `allowBestEffort` option to throw for such patterns (see details below).
Oniguruma-To-ES deeply understands all of the hundreds of large and small differences in Oniguruma and JavaScript regex syntax and behavior across multiple JavaScript version targets. It's *obsessive* about precisely following Oniguruma syntax rules and ensuring that the emulated features it supports have **exactly the same behavior**, even in extreme edge cases. And it's battle-tested on thousands of real-world Oniguruma regexes used in TextMate grammars (via the Shiki library). A few uncommon features can't be perfectly emulated and allow rare differences, but if you don't want to allow this, you can set the `emulation` option to `strict` and throw for such patterns (see details below).

## 📜 Contents

Expand Down Expand Up @@ -83,7 +83,7 @@ A string with `i`, `m`, and `x` in any order (all optional).

```ts
type CompileOptions = {
allowBestEffort?: boolean;
emulation?: 'strict' | 'default' | 'loose';
global?: boolean;
hasIndices?: boolean;
maxRecursionDepth?: number | null;
Expand Down Expand Up @@ -139,63 +139,82 @@ function toRegexAst(

These options are shared by functions [`compile`](#compile) and [`toRegExp`](#toregexp).

### `allowBestEffort`
### `emulation`

Allows results that differ from Oniguruma in rare cases. If `false`, throws if the pattern can't be emulated with identical behavior for the given `target`.
One of `'strict'`, `'default'` *(default)*, or `'loose'`.

*Default: `true`.*
Sets the level of emulation strictness.

- **Strict:** Throw if the pattern can't be emulated with identical behavior (even in rare edge cases) for the given target.
- **Default:** The best choice in most cases. Permits a few close approximations of Oniguruma in order to support additional features.
- **Loose:** Useful for non-critical matching like syntax highlighting where having some mismatches is better than not working.

Each level of increased emulation strictness supports a subset of patterns supported by less strict modes. If a given pattern doesn't produce an error for a particular emulation mode, its generated result will be identical with all lower levels of strictness (given the same `target`).

<details>
<summary>More details</summary>

Specifically, this option enables the following additional features, depending on `target`:
#### `default` mode

Supports all features of `strict` mode, plus the following additional features, depending on `target`:

- All targets (`ESNext` and earlier):
- Enables use of `\X` using a close approximation of a Unicode extended grapheme cluster.
- Enables recursion (e.g. via `\g<0>`) using a depth limit specified via option `maxRecursionDepth`.
- Enables recursion (e.g. via `\g<0>`) with a depth limit specified by option `maxRecursionDepth`.
- `ES2024` and earlier:
- Enables use of case-insensitive backreferences to case-sensitive groups.
- `ES2018`:
- Enables use of POSIX classes `[:graph:]` and `[:print:]` using ASCII-based versions rather than the Unicode versions available for `ES2024` and later. Other POSIX classes are always based on Unicode.

#### `loose` mode

Supports all features of `default`, plus the following:

- Silences errors for unsupported uses of the search-start anchor `\G` (a flexible assertion that doesn’t have a direct equivalent in JavaScript).
- Oniguruma-To-ES uses a variety of strategies to accurately emulate many common uses of `\G`. When using `loose` mode, if a `\G` assertion is found that doesn't have a known emulation strategy, the `\G` is simply removed and JavaScript's `y` (`sticky`) flag is added. This might lead to some false positives and negatives.
</details>

### `global`

Include JavaScript flag `g` (`global`) in the result.

*Default: `false`.*

### `hasIndices`
Include JavaScript flag `g` (`global`) in the result.

Include JavaScript flag `d` (`hasIndices`) in the result.
### `hasIndices`

*Default: `false`.*

### `maxRecursionDepth`
Include JavaScript flag `d` (`hasIndices`) in the result.

If `null`, any use of recursion throws. If an integer between `2` and `100` (and `allowBestEffort` is `true`), common recursion forms are supported and recurse up to the specified max depth.
### `maxRecursionDepth`

*Default: `6`.*

If an integer between `2` and `100`, common recursion forms are supported and recurse up to the specified depth limit. If set to `null`, any use of recursion results in an error.

Since recursion isn't infinite-depth like in Oniguruma, use of recursion also results in an error if the `emulation` option is set to `'strict'`.

<details>
<summary>More details</summary>

Using a high limit is not a problem if needed. Although there can be a performance cost (minor unless it's exacerbating an existing issue with runaway backtracking), there is no effect on regexes that don't use recursion.
Using a high limit has a (usually tiny) impact on transpilation and regex performance. Generally, this is only a problem if the regex has an existing issue with runaway backtracking that recursion exacerbates.

Higher limits have no effect on regexes that don't use recursion, so you should feel free to increase this if helpful.
</details>

### `optimize`

Simplify the generated pattern when it doesn't change the meaning.

*Default: `true`.*

### `target`
Simplify the generated pattern when it doesn't change the meaning.

Sets the JavaScript language version for generated patterns and flags. Later targets allow faster processing, simpler generated source, and support for additional features.
### `target`

*Default: `'ES2024'`.*

<details open>
Sets the JavaScript language version for generated patterns and flags. Later targets allow faster processing, simpler generated source, and support for additional features.

<details>
<summary>More details</summary>

- `ES2018`: Uses JS flag `u`.
Expand Down Expand Up @@ -887,10 +906,10 @@ The table above doesn't include all aspects that Oniguruma-To-ES emulates (inclu

1. Target `ES2018` doesn't allow Unicode property names added in JavaScript specifications after ES2018 to be used.
2. Unicode blocks are easily emulatable but their character data would significantly increase library weight. They're also a deeply flawed and arguably-unuseful feature, given the ability to use Unicode scripts and other properties.
3. With target `ES2018`, the specific POSIX classes `[:graph:]` and `[:print:]` are an error if option `allowBestEffort` is `false`, and they use ASCII-based versions rather than the Unicode versions available for target `ES2024` and later.
3. With target `ES2018`, the specific POSIX classes `[:graph:]` and `[:print:]` are an error if option `emulation` is `'strict'`, and they use ASCII-based versions rather than the Unicode versions available for target `ES2024` and later.
4. Target `ES2018` doesn't support nested *negated* character classes.
5. It's not an error for *numbered* backreferences to come before their referenced group in Oniguruma, but an error is the best path for Oniguruma-To-ES because (1) most placements are mistakes and can never match (based on the Oniguruma behavior for backreferences to nonparticipating groups), (2) erroring matches the behavior of named backreferences, and (3) the edge cases where they're matchable rely on rules for backreference resetting within quantified groups that are different in JavaScript and aren't emulatable. Note that it's not a backreference in the first place if using `\10` or higher and not as many capturing groups are defined to the left (it's an octal or identity escape).
6. The maximum recursion depth is specified by option `maxRecursionDepth`. Use of recursion results in an error if `maxRecursionDepth` is `null` or `allowBestEffort` is `false`. Some forms of recursion (recursion with backreferences, and multiple recursions in the same pattern) aren't yet supported. Note that, because recursion is bounded, patterns that fail due to infinite recursion in Oniguruma might find a match in Oniguruma-To-ES. Future versions will detect this and throw an error.
6. The recursion depth limit is specified by option `maxRecursionDepth`. Some forms of recursion (multiple recursions in the same pattern, and recursion with backreferences) aren't yet supported. Patterns that would error in Oniguruma due to triggering infinite recursion might find a match in Oniguruma-To-ES since recursion is bounded (future versions will detect this and error at transpilation time).

## ㊗️ Unicode / mixed case-sensitivity

Expand Down
42 changes: 26 additions & 16 deletions demo/demo.css
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ main {
border-radius: 0 0 15px 15px;
}

h1, h2, ul, p, pre, details, summary {
h1, h2, ul, p, pre, summary {
margin-bottom: 12px;
}

Expand All @@ -37,20 +37,28 @@ code {
background-color: #f6f6f6;
}

kbd {
padding: 0 3px;
}

small {
font-size: 0.8em;
}

td {
padding-right: 3vw;
.hidden {
display: none;
}

summary {
cursor: pointer;
}

label, .label {
margin-right: 0.4em;
label {
margin-right: 0.5em;
}

label img {
vertical-align: middle;
}

input[type='checkbox'] {
Expand All @@ -62,20 +70,18 @@ input[type='checkbox'] {
}

input[type='number'] {
width: 3.5em;
padding: 3px;
font-size: 0.9em;
border: 1px solid #bbb;
height: 1.6em;
border-radius: 4px;
padding-left: 4px;
width: 3.5em;
}

select {
padding: 4px 35px 4px 10px;
padding: 3px;
font-size: 0.9em;
border: 1px solid #bbb;
border-radius: 4px;
appearance: none;
background: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Unofficial_JavaScript_logo_2.svg) 96% / 15% no-repeat #f6f6f6;
}

textarea {
Expand All @@ -92,12 +98,20 @@ textarea:focus {
box-shadow: 0 0 8px #80c0ff;
}

pre, code, textarea {
pre, code, kbd, textarea {
font-family: Consolas, "Source Code Pro", Monospace;
font-size: 0.9em;
border-radius: 0.375em;
}

#more-options {
display: flex;
}

#more-options div {
margin-right: 3%;
}

#output, textarea {
padding: 0.6em;
white-space: pre-wrap;
Expand Down Expand Up @@ -133,7 +147,3 @@ pre, code, textarea {
margin-top: -12px;
padding: 0.6em;
}

.hidden {
display: none;
}
8 changes: 4 additions & 4 deletions demo/demo.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ const state = {
x: getValue('flag-x'),
},
opts: {
allowBestEffort: getValue('option-allow-best-effort'),
allowSubclassBasedEmulation: getValue('option-subclass'),
allowSubclassBasedEmulation: getValue('option-allowSubclassBasedEmulation'),
emulation: getValue('option-emulation'),
global: getValue('option-global'),
hasIndices: getValue('option-has-indices'),
maxRecursionDepth: getValue('option-max-recursion-depth'),
hasIndices: getValue('option-hasIndices'),
maxRecursionDepth: getValue('option-maxRecursionDepth'),
optimize: getValue('option-optimize'),
target: getValue('option-target'),
},
Expand Down
89 changes: 51 additions & 38 deletions demo/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -19,75 +19,88 @@ <h1>
<h2>Try it</h2>
<p><textarea id="input" spellcheck="false" oninput="autoGrow(this); showOutput(this)"></textarea></p>
<p>
<b class="label">Flags:</b>
<label><code>flags</code></label>
<label>
<input type="checkbox" id="flag-i" onchange="setFlag('i', this.checked)">
<code>i</code>
<kbd>i</kbd>
</label>
<label>
<input type="checkbox" id="flag-m" onchange="setFlag('m', this.checked)">
<code>m</code> <small>(JS flag <code>s</code>)</small>
<kbd>m</kbd> <small>(JS flag <kbd>s</kbd>)</small>
</label>
<label>
<input type="checkbox" id="flag-x" onchange="setFlag('x', this.checked)">
<code>x</code>
<kbd>x</kbd>
</label>
</p>
<p>
<b class="label"><code>target</code>:</b>
<select id="option-target" onchange="setOption('target', this.value)">
<option value="ES2018">ES2018</option>
<option value="ES2024" selected>ES2024</option>
<option value="ESNext">ESNext</option>
</select>
<label>
<code>target</code>
<select id="option-target" onchange="setOption('target', this.value)">
<option value="ES2018">ES2018</option>
<option value="ES2024" selected>ES2024</option>
<option value="ESNext">ESNext</option>
</select>
<img src="https://upload.wikimedia.org/wikipedia/commons/9/99/Unofficial_JavaScript_logo_2.svg" width="15" height="15">
</label>
<label>
<code>emulation</code>
<select id="option-emulation" onchange="setOption('emulation', this.value)">
<option value="strict">strict</option>
<option value="default" selected>default</option>
<option value="loose">loose</option>
</select>
</label>
</p>
<details>
<summary>More options</summary>
<table id="more-options">
<tr>
<td>
<label>
<input type="checkbox" id="option-allow-best-effort" checked onchange="setOption('allowBestEffort', this.checked)">
<code>allowBestEffort</code>
</label>
</td>
<td>
<section id="more-options">
<div>
<p>
<label>
<input type="checkbox" id="option-global" onchange="setOption('global', this.checked)">
<code>global</code>
</label>
</td>
<td>
</p>
<p>
<label>
<input type="number" id="option-max-recursion-depth" value="6" min="2" max="100" onchange="setOption('maxRecursionDepth', this.value)" onkeyup="setOption('maxRecursionDepth', this.value)">
<code>maxRecursionDepth</code>
<input type="checkbox" id="option-hasIndices" onchange="setOption('hasIndices', this.checked)">
<code>hasIndices</code>
</label>
</td>
</tr>
<tr>
<td>
</p>
</div>
<div>
<p>
<label>
<input type="checkbox" id="option-subclass" onchange="setOption('allowSubclassBasedEmulation', this.checked)">
<input type="checkbox" id="option-allowSubclassBasedEmulation" onchange="setOption('allowSubclassBasedEmulation', this.checked)">
<code>allowSubclassBasedEmulation</code>
</label>
</td>
<td>
</p>
<p>
<label>
<input type="checkbox" id="option-has-indices" onchange="setOption('hasIndices', this.checked)">
<code>hasIndices</code>
<input type="number" id="option-maxRecursionDepth" value="6" min="2" max="100" onchange="setOption('maxRecursionDepth', this.value)" onkeyup="setOption('maxRecursionDepth', this.value)">
<code>maxRecursionDepth</code>
</label>
</td>
<td>
</p>
</div>
<div>
<p>
<label>
<input type="checkbox" id="option-optimize" checked onchange="setOption('optimize', this.checked)">
<code>optimize</code>
</label>
</td>
</tr>
</table>
</p>
<p>
<label>
<input type="checkbox" id="option-tmGrammar" onchange="setOption('tmGrammar', this.checked)">
<code>tmGrammar</code>
</label>
</p>
</div>
</section>
</details>
<pre id="output"></pre>
<div id="info" class="hidden"><p>This regex is emulated through the combination of changes in the pattern and the use of a <code>RegExp</code> subclass with custom logic.</p></div>
<div id="info" class="hidden"><p>A <code>RegExp</code> subclass instance (with a custom execution strategy) is returned for this pattern. It remains a native JavaScript regex and works the same as <code>RegExp</code> in all contexts.</p></div>
<p>The output shows the result of calling <code>toRegExp</code>. Oniguruma-To-ES includes functions to generate additional formats: <code>compile</code>, <code>toOnigurumaAst</code>, and <code>toRegexAst</code> (for an AST based on <a href="https://github.com/slevithan/regex"><code>regex</code></a>). You can run all of these from the console on this page, and you can pretty-print AST results by passing them to <code>printAst</code>.</li>
</main>

Expand Down
2 changes: 1 addition & 1 deletion spec/match-backreference.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ beforeEach(() => {
});

describe('Backreference', () => {
// TODO: Test that case-insensitive backref to case-sensitive group requires allowBestEffort or ESNext
// TODO: Test that case-insensitive backref to case-sensitive group requires `ESNext` or non-`strict` emulation

describe('numbered backref', () => {
it('should rematch the captured text', () => {
Expand Down
8 changes: 4 additions & 4 deletions spec/match-recursion.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ beforeEach(() => {
});

describe('Recursion', () => {
it('should throw if recursion used with allowBestEffort false', () => {
expect(() => compile(r`a\g<0>?`, '', {allowBestEffort: false})).toThrow();
expect(() => compile('', '', {allowBestEffort: false})).not.toThrow();
it('should throw if recursion used with strict emulation', () => {
expect(() => compile(r`a\g<0>?`, '', {emulation: 'strict'})).toThrow();
expect(() => compile('', '', {emulation: 'strict'})).not.toThrow();
});

it('should throw if recursion used with maxRecursionDepth null', () => {
it('should throw if recursion used with null maxRecursionDepth', () => {
expect(() => compile(r`a\g<0>?`, '', {maxRecursionDepth: null})).toThrow();
expect(() => compile('', '', {maxRecursionDepth: null})).not.toThrow();
});
Expand Down
Loading

0 comments on commit d594d53

Please sign in to comment.