Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate Generation of QueryParser to C# #268

Open
clambertus opened this issue Aug 16, 2019 · 6 comments
Open

Automate Generation of QueryParser to C# #268

clambertus opened this issue Aug 16, 2019 · 6 comments
Assignees
Labels
up-for-grabs This issue is open to be worked on by anyone
Milestone

Comments

@clambertus
Copy link

The Lucene team is using a tool called javacc to generate the main business logic behind the query parsers. If we had a similar tool it could help:

  • Speed up the process of porting/upgrading QueryParser
  • Reduce the number of bugs in these modules caused by doing it manually
  • Most importantly, QueryParser could potentially be generated without using exceptions for control flow

The javacc tool uses a configuration file as input and creates java code as output. Here are some examples of those configuration files:

This has not been fully researched, but there are at least 2 potential ways we could approach this:

  1. Find a similar tool to javacc in .NET that supports similar options that were used in javacc, and create a converter tool to change the javacc configuration into a configuration that the .NET tool supports.
  2. Do a direct port of javacc to C#, and fix its logic to use a more efficient control flow mechanism than exceptions (perhaps goto would be the most direct replacement).

It seems according to this document that using a port of javacc should be our first choice because of the performance benchmarks of the resultant code. And certainly that would eliminate the risk of having a .NET tool not support an option that we need either now or for some future version of Lucene.

JIRA link - [LUCENENET-620] created by nightowl888
@clambertus clambertus added Lucene.Net.QueryParser up-for-grabs This issue is open to be worked on by anyone labels May 5, 2020
@NightOwl888 NightOwl888 added this to the Future milestone May 7, 2020
@paulirwin paulirwin self-assigned this Oct 21, 2024
@paulirwin paulirwin modified the milestones: Future, 4.8.0, 4.8.0-beta00018 Oct 21, 2024
@paulirwin
Copy link
Contributor

I looked into this a bit last night, and the situation is less than ideal. To augment/update the original two possibilities in the original issue from 2019 with some other ideas (items 3+ below):

  1. Find a similar tool to javacc in .NET that supports similar options that were used in javacc, and create a converter tool to change the javacc configuration into a configuration that the .NET tool supports. This does not seem to exist, as far as I can tell.
  2. Do a direct port of javacc to C#, and fix its logic to use a more efficient control flow mechanism than exceptions (perhaps goto would be the most direct replacement). While possible, this is quite a large undertaking, as javacc even builds its own parser itself. I do not think this amount of effort is worth it, especially given that the queryparser jj files have only seen a few minor updates in the last 10ish years.
  3. Convert the javacc syntax to ANTLRv4, like I did with Expressions.JS in Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 #996. This is a significant undertaking though, as javacc syntax is not directly comparable to ANTLR. It's also fairly risky, as there might be subtle incompatibilities in the parsing that are not caught by our tests. It also has the potential for having worse performance (or better performance, who knows) - but we won't know until the work is done. And then it would be a divergent syntax that would be difficult to port future fixes to.
  4. An idea I had last night is that I could add a build task library to my JavaToCSharp project, and instead of trying to automate building the jj file (or porting to ANTLR), we could automate converting the generated Java code from Lucene into C#, treating Java as an intermediate language for this purpose. I tested that the 4.8 generated QueryParser.java file converts successfully, although it has a few warnings I'd need to look into. If we could automate this as part of the build via a msbuild task (like the Antlr4BuildTasks project), that could be a good middle ground. I'll spend a few mental cycles on seeing if this is feasible and worth doing. One possible downside of this is we'd lose possible opportunity for optimizations of the code, unless someone wants to contribute the ability to add rewriting plugins to JavaToCSharp.
  5. Leave it as-is and keep porting the generated code by hand (or converting one-off as needed with JavaToCSharp).

At this point, I don't know if it makes sense to try to attempt this for 4.8, and that porting work is already done. I think we can revisit this for the next post-4.8 release, so I'm moving this back to the Future milestone for now.

@paulirwin paulirwin modified the milestones: 4.8.0-beta00018, Future Oct 29, 2024
@NightOwl888
Copy link
Contributor

NightOwl888 commented Oct 29, 2024

At this point, I don't know if it makes sense to try to attempt this for 4.8, and that porting work is already done. I think we can revisit this for the next post-4.8 release, so I'm moving this back to the Future milestone for now.

@paulirwin - Actually, out of all of the packages, this one is the least "done".

  1. There are still many culture aware APIs that we are calling that we probably shouldn't be.
  2. There is no way to localize the error messages like there was in Lucene. We had that, but I ripped out the NLS error message resources long time ago. But I realized that some of those errors are being thrown from inside static methods the user cannot override, so this approach is incomplete. Maybe it is time to have a look at using localized satellite assemblies, which might be a better option in .NET.

Having a TryParse() could potentially allow users to localize error messages based on some error code rather than letting the Lucene.Net code throw (or bubble up) an exception.

While having no TryParse and having it throw exceptions to loop back to the prior position in a string are also pretty bad, query strings don't tend to be too long, so it is probably okay for this use case.

But fixing it so it doesn't throw exceptions just to loop back would be much easier if we generated the code for the query string grammar rather than hand porting it like this.

@paulirwin
Copy link
Contributor

But fixing it ... would be much easier if we generated the code for the query string grammar rather than hand porting it like this.

I wanted this to be feasible, as I think it's a good goal. And I wish that were "easier," but see my comment above. None of these automated approaches are particularly easy. The generated files are only ~750 lines of Java each, compared to how many thousands of lines and hours of porting javacc, or having ANTLR syntax that does not match upstream, or having to add rewriting and build task support to JavaToCSharp. javacc is not a trivial syntax to implement a generator for, unfortunately. I could envision easily spending hundreds of hours on this problem, just to save myself one or two fewer orders of magnitude by just updating the generated code by hand (including even getting it up to date as of 10.0 for the next release). This is compounded by the chicken-and-egg problem of needing to know what our desired generated code would look like with those exception and i18n changes you mention above, before we know how to write a generator for it, at which point we might as well hand-port the first version of that anyways, regardless of if we automate it or not.

Of note, the Lucene team commits their generated Java code to the repo, so it will be easy to do diffs to see what changed, if we want to continue to hand-port it in the future.

If we decide to update the generated code manually to factor in the requested changes above, I think we can split that off as its own issue, since that would be separate from "Automate Generation of QueryParser to C#." And that seems like a reasonable thing to include in 4.8.

@NightOwl888
Copy link
Contributor

The requested changes weren't part of the generated code, so we are talking about 2 different things. Given your input, I think that doing this at a later point makes sense if we even do it at all.

After working with the implementation of the BCL parsers, it seems like a far better approach to use spans for something like this. Basically, go back to requirements and change the whole approach on how it is done. But that is a lot of work and we should probably wait until later.

@rclabo
Copy link
Contributor

rclabo commented Oct 29, 2024

Regarding error message localization, honestly, in this day and age, if one really wanted them translated, they could catch the exception and then use an LLM to translate the message to the language(s) of their choice. That may be too pragmatic of a solution, but it would work. :-)

@NightOwl888
Copy link
Contributor

NightOwl888 commented Oct 29, 2024

Well, that is the problem. Catching the exceptions are required. The only way to display them to a user is to first catch them and then show the message. .NET parsers usually have a TryParse so you don't have to catch anything - you just put it into your decision logic. If there were some type of error code returned that the user could make sense of, the user could then hook into their own localization code rather than having it baked into the design in some obscure way.

But, it doesn't have that. I see your point though, Ron. As long as we keep doing things the wrong way and throwing the exception, there is not much point in doing the localization because the user has the chance to intervene (since they are forced to catch an exception anyway). But the exception is a problem carried over from Java where it is considered acceptable to catch an exception for control flow, which is not the case in .NET.

Even if we did a cheesy hack and caught the exceptions for the user just to give them a TryParse method, it would at least fix the API and take the responsibility of catching the exceptions away from them. We could then clean it up later without breaking the API. By that, I mean creating an extra method named TryParse and leaving Parse alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
up-for-grabs This issue is open to be worked on by anyone
Projects
None yet
Development

No branches or pull requests

4 participants