Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TemplateRegexMatcher.getStartRegex sometimes returns a regex that matches with an index before the license start #244

Open
sdheh opened this issue Jun 13, 2024 · 1 comment

Comments

@sdheh
Copy link

sdheh commented Jun 13, 2024

Version 1.1.11
Example 1 greedy regex after optional:

String licenseText = "ab cd text";
String licenseTemplate = "<<beginOptional>>cd<<endOptional>> <<var;name=\"copyright\";original=\"Copyright (c) <year> <copyright holders>  \";match=\".{0,5000}\">> text";
TemplateRegexMatcher templateRegexMatcher = new TemplateRegexMatcher(licenseTemplate);
String startRegex = templateRegexMatcher.getStartRegex(25);
System.out.println("start regex: " + startRegex);
Matcher matcher = Pattern.compile(startRegex).matcher(licenseText);
if (matcher.find()) {
    System.out.println("start index found: " + matcher.start());
}

Returns

start regex: (?im)(\Qcd\E\s*)?(.{0,5000})\Qtext\E\s*
start index found: 0

but the start index should be 3.

Example 2 greedy regex at start:

String licenseText = "abtext";
String licenseTemplate = "<<var;name=\"copyright\";original=\"Copyright (c) <year> <copyright holders>  \";match=\".{0,5000}\">> text";
TemplateRegexMatcher templateRegexMatcher = new TemplateRegexMatcher(licenseTemplate);
String startRegex = templateRegexMatcher.getStartRegex(25);
System.out.println("start regex: " + startRegex);
Matcher matcher = Pattern.compile(startRegex).matcher(licenseText);
if (matcher.find()) {
    System.out.println("start index found: " + matcher.start());
}

Returns

start regex: (?im)(.?{0,5000})\Qtext\E\s*
start index found: 1

but the start index should be 2.
.?{0,5000} doesn't seem to work as expected. It is an unusual regex that some online regex websites say is invalid: https://regex101.com/r/l3810b/1, regexr.com/81kfo.
https://www.freeformatter.com/java-regex-tester.html says the regular expression is valid.

I think maybe to fix this you could just offer a method for a regex to find the beginning of the non-optional part. Otherwise a changing the regular expressions in these two cases to something like the following could work

(?im)((\Qcd\E\s*)(.{0,5000})\Qtext\E\s*)|(\Qtext\E\s*)
(?im)\Qtext\E\s*

In the first case if there were multiple optional parts it would get even more complicated to do it correctly.

@goneall
Copy link
Member

goneall commented Jun 15, 2024

Now that I review this issue, I do tend to agree that this is an issue if the method is being used without following on with the template matcher.

@sdheh - It looks like you have a pretty good handle on approaches to fix this. Can you create a pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants