Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Use library regex to offer extended features #97

Closed
slevithan opened this issue Aug 12, 2024 · 4 comments
Closed

[Feature] Use library regex to offer extended features #97

slevithan opened this issue Aug 12, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@slevithan
Copy link

slevithan commented Aug 12, 2024

Thoughts on using the regex library under the hood?

This would allow offering some unique/powerful features, including atomic groups and subroutines.

No worries if you think it's not the right fit. But otherwise, this could show up, for example, as:

  • subroutine(name)
    • Or you could use your own name for it like refSubpattern(name).
    • You could go all out and offer a way to create subroutine definition groups, and then use their patterns by reference.
  • atomic(…)
    • atomic could also be a boolean option for choiceOf(…).

Additionally, since regex supports possessive quantifiers (from PCRE, Perl, Java, Ruby, Python, etc.), you could easily offer them. E.g., all the quantifier functions could have their options changed to replace the greedy property with type (with values: 'greedy', 'lazy' (or 'nongreedy' if you prefer), and 'possessive'). You could alternatively include a new possessive boolean option in addition to greedy, but I wouldn't recommend that since there is no precedent in existing regex flavors for lazy+possessive quantifiers (for good reason, since this would effectively mean to just always use the lower bound of any quantifier).

With the addition of atomic groups and/or possessive quantifiers, you could rightly describe TS Regex Builder as a great way to avoid ReDoS / catastrophic backtracking.

Introducing regex would also mean being able to improve any regexes within TS Regex Builder's source for readability, etc., and would be particularly beneficial if you start offering a library of common patterns (#73). In source, you'd get the full benefits of the regex library including free spacing and comments, context-aware interpolation, etc.

@slevithan slevithan added the enhancement New feature or request label Aug 12, 2024
@mdjastrzebski
Copy link
Member

Hi @slevithan, sorry for late reply. Could you explain in simple terms the concepts you are proposing? What are the corresponding JS regex patterns for subroutine and atomic?

@slevithan
Copy link
Author

What are the corresponding JS regex patterns

Here's the syntax for these features in the regex flavors that support then (PCRE, Perl, etc., as well as the regex library which adds support for them to native JS):

  • Atomic groups: (?>...).
  • Possessive quantifiers: ?+, *+, ++, {n}+, {n,}+, {n,m}+.
  • Subroutines: \g<name>, where name refers to a named group.

These are powerful features not supported by native JS regexes, except when using the regex library. I can't directly/fully show what they're transpiled to for JS regexes because these are nontrivial features whose translation depends on context. But you can see results by playing with patterns and seeing transpiled output in regex's Babel plugin demo REPL.

Could you explain in simple terms the concepts you are proposing?

I'd recommend reading the corresponding sections in the regex documentation where I explain them with examples. See: atomic groups, possessive quantifiers, subroutines. Atomic groups and possessive quantifiers are primarily used for performance and to avoid runaway backtracking. Subroutines are primarily about reusing subpatterns and building up complex patterns through composition.

I'd be happy to answer any further questions to help clarify!

@mdjastrzebski
Copy link
Member

Ok, so to clarify your idea, it's is to implement these more advanced regex features and integrate them as regex construct functions (e.g. atomic(...)) or options to existing functions (like zeroOrMore(..., { mode: 'possessive' })?

Regarding particular features:

  • atomic sounds interesting, the Swift Regex Builder we are modelled after has the same (?) feature under local function for perf optimization
  • possessive quantifiers also sound interesting, might we useful when hitting perf issues
  • subroutines: when it comes to composability, I think we have covered that with being able to re-use pattern fragments in a more readable way (not sure about perf differences here).

Regarding having dependencies on other packages I am against it, as it would significantly increase bundle size. Regex Builder is designed to have minimal (reasonable) bundle size, so it's feasible to use in web apps with minimal perf impact. That's why it's fully tree-shakable, etc. There is a trade-off between having a small bundle size vs having more advanced features. In that dilemma I would rather focus on 80% users using most common 80% features, rather than having most comprehensive regex library out there.

That being said, it would be possible to add these more advanced features in following ways:

  • import them (some or all of) directly in TS Regex Builder, so they become tree-shakable and do not impact bundle size when not used
  • providing a separate package like regex/ts-regex-builder or ts-regex-builder-advanced so that 20% more advanced users could opt-in to these features.

@slevithan wdyt?

@slevithan
Copy link
Author

slevithan commented Sep 8, 2024

regex is also concerned about bundle size, so it’s reasonably small. But no worries at all if that nevertheless makes it not the right fit, or means that it would have to be relegated to a ts-regex-builder-advanced.

regex doesn’t currently offer exports of its internals that can do rewrites for only specific features (although it does offer an options API for controlling which features are applied), because that would impose different tradeoffs. regex’s extended syntax and implicit flags, due to the complexity of emulating them, work best when they can depend on being composed in the right sequence, share certain data, and rely on not being transpiled in isolation (forward and backward context is needed).

My subjective opinion though is that it’s possible to overly focus on size in a library like this. Many people significantly concerned about bundle size would likely skip ts-regex-builder entirely or pre-run their regexes through it and copy/paste the output into their code. So it might be more common for this library to be used in Node.js, build steps, and other situations where bundle size is less critical.

the Swift Regex Builder we are modelled after has the same (?) feature under local function for perf optimization

Yes, based on the linked docs page, Local creates an atomic group under the hood.

In any case, feel to close this if you don't think this is something you'll pursue.

@mdjastrzebski mdjastrzebski closed this as not planned Won't fix, can't repro, duplicate, stale Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants