Query parser support for wildcards in phrase queries #14168

aliciavargas · 2025-01-23T20:33:32Z

Description

The Lucene docs specify that wildcard search is only supported for single terms but not phrases (link). We’d like to support wildcard search in phrase queries in our Lucene-based service and in our research we came across the PhraseWildcardQuery, which we see is marked as experimental.

The PhraseWildcardQuery can support our needs, and we tested it out in a proof of concept, but it is not actually used by any of the existing lucene query parsers. We see some options to implement query parsing that supports phrase wildcards: (1) the StandardQueryParser could be modified to optionally allow for phrases with wildcards and utilize the PhraseWildcardQuery to support this use case, or (2) there could there be a new query parser that specifically supports phrases with wildcards.

So we have two questions:

What would it take to make PhraseWildcardQuery fully supported (no longer marked experimental)?
Can Lucene provide a query parser that utilizes the PhraseWildcardQuery?

For reference, this is how we added to the StandardQueryParser to support wildcards in phrases in our proof of concept:

luceneParser =
    new org.apache.lucene.queryparser.flexible.standard.StandardQueryParser(analyzer);
luceneParser.setDefaultOperator(operator);

StandardQueryTreeBuilder builder = new StandardQueryTreeBuilder();
builder.setBuilder(PhraseWildcardQueryNode.class, new PhraseWildcardQueryNodeBuilder());
luceneParser.setQueryBuilder(builder);

StandardQueryNodeProcessorPipeline processor =
    new StandardQueryNodeProcessorPipeline(luceneParser.getQueryConfigHandler());
processor.add(new PhraseWildcardQueryNodeProcessor(luceneParser.getQueryConfigHandler()));
luceneParser.setQueryNodeProcessor(processor);

Using the following new components:

PhraseWildcardQueryNode - represents a query node for a phrase query with wildcards
PhraseWildcardQueryNodeProcessor - a query node processor that can be added to the end of the StandardQueryNodeProcessorPipeline. If it receives a phrase query node, it will go through its child nodes and build a new list of child nodes that are processed by the WildcardQueryNodeProcessor. If any of those new children have wildcards, it will replace the original phrase query node with a PhraseWildcardQueryNode with the new list of children.
- This could be turned on or off by a config parameter
PhraseWildcardQueryNodeBuilder - a query builder that converts a PhraseWildcardQueryNode into a PhraseWildcardQuery using the PhraseWildcardQuery.Builder class. It iterates over the children of the PhraseWildcardQueryNode and for each child:
- If it's a WildcardQueryNode, then we add it as a MultiTermQuery to the PhraseWildcardQuery.Builder
- Else we add the single term to the PhraseWildcardQuery.Builder

The text was updated successfully, but these errors were encountered:

jpountz · 2025-01-27T13:20:41Z

PhraseWildcardSearch is appealing, but its implementation makes trade-offs to work around the fact that it doesn't work efficiently if any of the wildcards expands to many terms. If you have a low-cardinality vocabulary, this is probably fine, but otherwise (e.g. English content), your queries may either be extremely costly if maxMultiTermExpansions is high, or miss matches (possibly all of them) if maxMultiTermExpansions is low. This makes me a bit uneasy about exposing it out of the box as it could take users by surprise.

For reference, there are other approaches for wildcard search that have different trade-offs, such as indexing (edge) n-grams, so that your wildcard expressions can actually be indexed and searched as simple terms (what Elasticsearch does when you configure text fields with index_prefixes: true) or indexing n-grams (with n=3 typically) for the whole input, using ngrams to find a superset of the matches, and then verifying the wildcard phrase against the raw data of this superset of matches (what the Elasticsearch wildcard field does under the hood).

aliciavargas added the type:enhancement label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query parser support for wildcards in phrase queries #14168

Query parser support for wildcards in phrase queries #14168

aliciavargas commented Jan 23, 2025

jpountz commented Jan 27, 2025

Query parser support for wildcards in phrase queries #14168

Query parser support for wildcards in phrase queries #14168

Comments

aliciavargas commented Jan 23, 2025

Description

jpountz commented Jan 27, 2025