Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

lookaround support? #12

Open
wcaneira opened this issue Jun 24, 2020 · 6 comments
Open

lookaround support? #12

wcaneira opened this issue Jun 24, 2020 · 6 comments

Comments

@wcaneira
Copy link

Thank you very much for contributing this and regex.rip.

I was wondering if it supports lookarounds (e.g. negative lookahead)?

I get an error message Nfa.UnsupportedGroupingConstruct

image

@ConradIrwin
Copy link
Contributor

Not yet. I've started working on porting the logic to javascript (on the js branch) to make it easier to make changes like this, though it may be possible for someone who knows more Ocaml than me to add it to the original.

I think there's also an open research question which is what is the impact of negative lookahead on exponentially backtracking?

@UziTech
Copy link

UziTech commented Feb 4, 2021

Anything specific you need help on the js branch?

I am a mantainer of markedjs which heavily uses regular expressions and I have been looking for a way to reliably check for ReDoS in our CI environment. This looks very promising but we would need support for look arounds.

Also fixing whatever is wrong with this regexp

/^ {0,3}(`{3,}(?=[^`\n]*\n)|~{3,})([^\n]*)\n(?:|([\s\S]*?)\n)(?: {0,3}\1[~`]* *(?:\n+|$)|$)/

image

@ConradIrwin
Copy link
Contributor

@UziTech I've not been able to get to it for a while, so if you are enthusiastic please feel free to jump in (though as I was trying to figure out the theory based on the paper and the ocaml, it may not be the easiest place to start!). I'd also take PRs to the original OCAML if you want to give that a try (though running it in CI was the use-case I wanted for which javascript would be much better, though you're welcome to hit the regex.rip API from your CI)

As I mentioned above, there is still an open question that may make this change more involved than you hope: I do not know what impact look-around assertions have on the performance characteristics fo matching. I assume that:

  1. They don't make the performance of the full expression worse
  2. Except that they may themselves contain patterns that could be vulnerable to ReDoS
    But I'm not sure... it may be ok to assume the first of those and just strip out the assertions, and then run the assertion clauses through the checker separately.

One thing to note, is that v8 (the javascript engine in Chrome and chromium-based browsers) is fixing this problem properly: https://v8.dev/blog/non-backtracking-regexp but (sadly) only for regexes that don't include look-around. This is due to another open research question: can look-ahead and look-behind be implemented in a regex engine that does a single pass of the string and maintains only state in proportion to the size of the regex. (I'm pretty sure that positive and negative look-behind can be, but I haven't yet figured out a way to support look-ahead without storing more state, which breaks the runtime guarantees that re2 and thus v8 are aiming for)

@UziTech
Copy link

UziTech commented Feb 5, 2021

it may be ok to assume the first of those and just strip out the assertions, and then run the assertion clauses through the checker separately.

Just running them separately isn't going to work.

For example:

  • /(a*)*/ is not backtracking
  • and /b/ is not backtracking
  • but /(a*)*(?=b)/ is backtracking

I will have a look at this soon and see what I can find. Thanks

@ConradIrwin
Copy link
Contributor

Great counter-example, thank you!

@UziTech
Copy link

UziTech commented Feb 6, 2021

After a bit of research here is what I found:

  1. Look arounds are atomic so they can't backtrack once complete. (i.e. /(?=(a*)*)b/ does not backtrack)
  2. They are regular expressions so /(?=(a*)*b)/ is backtracking. (Believe me, I crashed this tab testing it in the console 🤣)
  3. Some languages reverse lookbehinds and the string so they don't need to be fixed width. (i.e. /(?<=b(a*)*)/ which is not backtracking becomes /(?=(a*)*b)/ which is backtracking and vice versa)
  4. As my example in the comment above shows look arounds can be a cause of backtracking in a non-backtracking regular expression. (i.e. /(a*)*(?=b)/)

So if we are going to check look arounds it seems we need to:

  1. Test them separately
  2. Turn them around (🤷‍♂️) and test them separately
  3. Test them as a way to start backtracking the previous regex

References:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants