Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query parser support for wildcards in phrase queries #14168

Open
aliciavargas opened this issue Jan 23, 2025 · 1 comment
Open

Query parser support for wildcards in phrase queries #14168

aliciavargas opened this issue Jan 23, 2025 · 1 comment

Comments

@aliciavargas
Copy link

Description

The Lucene docs specify that wildcard search is only supported for single terms but not phrases (link). We’d like to support wildcard search in phrase queries in our Lucene-based service and in our research we came across the PhraseWildcardQuery, which we see is marked as experimental.

The PhraseWildcardQuery can support our needs, and we tested it out in a proof of concept, but it is not actually used by any of the existing lucene query parsers. We see some options to implement query parsing that supports phrase wildcards: (1) the StandardQueryParser could be modified to optionally allow for phrases with wildcards and utilize the PhraseWildcardQuery to support this use case, or (2) there could there be a new query parser that specifically supports phrases with wildcards.

So we have two questions:

  1. What would it take to make PhraseWildcardQuery fully supported (no longer marked experimental)?
  2. Can Lucene provide a query parser that utilizes the PhraseWildcardQuery?

For reference, this is how we added to the StandardQueryParser to support wildcards in phrases in our proof of concept:

luceneParser =
    new org.apache.lucene.queryparser.flexible.standard.StandardQueryParser(analyzer);
luceneParser.setDefaultOperator(operator);

StandardQueryTreeBuilder builder = new StandardQueryTreeBuilder();
builder.setBuilder(PhraseWildcardQueryNode.class, new PhraseWildcardQueryNodeBuilder());
luceneParser.setQueryBuilder(builder);

StandardQueryNodeProcessorPipeline processor =
    new StandardQueryNodeProcessorPipeline(luceneParser.getQueryConfigHandler());
processor.add(new PhraseWildcardQueryNodeProcessor(luceneParser.getQueryConfigHandler()));
luceneParser.setQueryNodeProcessor(processor);

Using the following new components:

  • PhraseWildcardQueryNode - represents a query node for a phrase query with wildcards
  • PhraseWildcardQueryNodeProcessor - a query node processor that can be added to the end of the StandardQueryNodeProcessorPipeline. If it receives a phrase query node, it will go through its child nodes and build a new list of child nodes that are processed by the WildcardQueryNodeProcessor. If any of those new children have wildcards, it will replace the original phrase query node with a PhraseWildcardQueryNode with the new list of children.
    • This could be turned on or off by a config parameter
  • PhraseWildcardQueryNodeBuilder - a query builder that converts a PhraseWildcardQueryNode into a PhraseWildcardQuery using the PhraseWildcardQuery.Builder class. It iterates over the children of the PhraseWildcardQueryNode and for each child:
    • If it's a WildcardQueryNode, then we add it as a MultiTermQuery to the PhraseWildcardQuery.Builder
    • Else we add the single term to the PhraseWildcardQuery.Builder
@jpountz
Copy link
Contributor

jpountz commented Jan 27, 2025

PhraseWildcardSearch is appealing, but its implementation makes trade-offs to work around the fact that it doesn't work efficiently if any of the wildcards expands to many terms. If you have a low-cardinality vocabulary, this is probably fine, but otherwise (e.g. English content), your queries may either be extremely costly if maxMultiTermExpansions is high, or miss matches (possibly all of them) if maxMultiTermExpansions is low. This makes me a bit uneasy about exposing it out of the box as it could take users by surprise.

For reference, there are other approaches for wildcard search that have different trade-offs, such as indexing (edge) n-grams, so that your wildcard expressions can actually be indexed and searched as simple terms (what Elasticsearch does when you configure text fields with index_prefixes: true) or indexing n-grams (with n=3 typically) for the whole input, using ngrams to find a superset of the matches, and then verifying the wildcard phrase against the raw data of this superset of matches (what the Elasticsearch wildcard field does under the hood).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants