Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiline Option with ^ and $ anchors #57

Open
kmalski opened this issue Jan 12, 2022 · 10 comments
Open

Multiline Option with ^ and $ anchors #57

kmalski opened this issue Jan 12, 2022 · 10 comments

Comments

@kmalski
Copy link

kmalski commented Jan 12, 2022

Hi,

I am struggling with proper configuration of Option passed to search method with the Syntax.ECMAScript. I would expect that with Option.DEFAULT / Option.NONE regex with usage of ^ ,$ anchors and no explicit newline will fail with newline character. For example

byte[] pattern = "^[a-z]{1,10}$".getBytes();
byte[] str = "a\nb".getBytes();

Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
Matcher matcher = regex.matcher(str);
int result = matcher.search(0, str.length, Option.DEFAULT);

should results with -1 but currently results with 0. Even passing Option.SINGLELINE does not change it. What I did to make this work, was to subtract the Option.MULTILINE

int result = matcher.search(0, str.length, -Option.MULTILINE)

I have tested this case with multiple online regex tools and JavaScript regex implementation in my browser and this example always gives me no match (as I expect). Only adding multiline option gives me similar result as with Joni library.

Setting syntax to Java works as expected and gives similar result as this snippet with built-in java regex

String pattern = "^[a-z]{1,10}$";
String str = "a\nb";

Pattern p = Pattern.compile(pattern);
java.util.regex.Matcher m = p.matcher(str);
boolean result = m.find();

Is the MULTILINE option default for library ECMAScript syntax and should it be? I was digging into the ECMAScript and looks like multiline = false is the default (user has to explicitly pass m flag).

@kmalski
Copy link
Author

kmalski commented Jan 13, 2022

One more note, in this example

        byte[] pattern = "^[a-z]{1,10}$".getBytes();
        byte[] str = "ab\nab\n".getBytes();

        Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
        Matcher matcher = regex.matcher(str);
        int result = matcher.search(0, str.length, -Option.MULTILINE);

result is equal 3. I think this should also be equal to -1 (no found).

@headius
Copy link
Member

headius commented Jan 18, 2022

I'm not familiar with the differences in the ECMAScript support in Joni but perhaps @lopex will have something more to say?

It might be worth us digging up some ECMAScript regex tests to verify whether this mode is working as it should.

@kmalski
Copy link
Author

kmalski commented Jan 19, 2022

What I found are official test cases for EcmaScript262 test262 but I did not find them really useful.

Much more readable are V8 tests (V8 is the JavaScript engine of Chrome, search for files named .*regexp.*js). For example there are test cases for multline flag.

@kmalski
Copy link
Author

kmalski commented May 22, 2023

Hi, did you have any chance to look at this issue? I would like to bring this thread back

@lopex
Copy link
Contributor

lopex commented May 22, 2023

Maybe the syntax settings just needs fixing ?

@enebo
Copy link
Member

enebo commented May 23, 2023

@kmalski @lopex and I realized the other week that the ECMA settings were made during the development of the now dead DynJS project and were not sourced from oniguruma. So it could very well be that Syntax for that mode is just not quite right. Not being JS devs we don't really know.

@kmalski
Copy link
Author

kmalski commented May 23, 2023

I have checked the oniguruma project and could not find syntax for ECMA (I believe there is no such). There are a lot of different options in this project, do you have any suggestions what is the best approach how to prepare best config for ECMA?

@enebo
Copy link
Member

enebo commented May 23, 2023

@kmalski I can see it is marked OP2_OPTION_PERL and that when it sees '^' will set multi true and single false. Not completely sure on direction here but ECMA OR'ing with PERL gives a bunch of default option twiddling in Parser.parseEnclose (look for syntax.op2OptionPerl()).

@enebo
Copy link
Member

enebo commented May 23, 2023

@kmalski I think the long term solution would be to remove OP2_OPTION_PERL from ECMA Syntax but this is more complicated since in Parser#parseEnclose we get a lot of behavior from it. As an intermediary step you can update case '^': to twiddle options by adding some logic for syntax.op3OptionECMAScript(). Notice it toggles 5 things. It appears 2 of those you do not want (e.g. your issue) but what about the other 3? I have no idea.

@enebo
Copy link
Member

enebo commented May 23, 2023

You could also just try removing OP2_OPTION_PERL and see if you can see anything break. I suspect yes but _RUBY does not set it and they have many similar features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants