You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Lucene docs specify that wildcard search is only supported for single terms but not phrases (link). We’d like to support wildcard search in phrase queries in our Lucene-based service and in our research we came across the PhraseWildcardQuery, which we see is marked as experimental.
The PhraseWildcardQuery can support our needs, and we tested it out in a proof of concept, but it is not actually used by any of the existing lucene query parsers. We see some options to implement query parsing that supports phrase wildcards: (1) the StandardQueryParser could be modified to optionally allow for phrases with wildcards and utilize the PhraseWildcardQuery to support this use case, or (2) there could there be a new query parser that specifically supports phrases with wildcards.
So we have two questions:
What would it take to make PhraseWildcardQuery fully supported (no longer marked experimental)?
Can Lucene provide a query parser that utilizes the PhraseWildcardQuery?
For reference, this is how we added to the StandardQueryParser to support wildcards in phrases in our proof of concept:
luceneParser =
new org.apache.lucene.queryparser.flexible.standard.StandardQueryParser(analyzer);
luceneParser.setDefaultOperator(operator);
StandardQueryTreeBuilder builder = new StandardQueryTreeBuilder();
builder.setBuilder(PhraseWildcardQueryNode.class, new PhraseWildcardQueryNodeBuilder());
luceneParser.setQueryBuilder(builder);
StandardQueryNodeProcessorPipeline processor =
new StandardQueryNodeProcessorPipeline(luceneParser.getQueryConfigHandler());
processor.add(new PhraseWildcardQueryNodeProcessor(luceneParser.getQueryConfigHandler()));
luceneParser.setQueryNodeProcessor(processor);
Using the following new components:
PhraseWildcardQueryNode - represents a query node for a phrase query with wildcards
PhraseWildcardQueryNodeProcessor - a query node processor that can be added to the end of the StandardQueryNodeProcessorPipeline. If it receives a phrase query node, it will go through its child nodes and build a new list of child nodes that are processed by the WildcardQueryNodeProcessor. If any of those new children have wildcards, it will replace the original phrase query node with a PhraseWildcardQueryNode with the new list of children.
This could be turned on or off by a config parameter
PhraseWildcardQueryNodeBuilder - a query builder that converts a PhraseWildcardQueryNode into a PhraseWildcardQuery using the PhraseWildcardQuery.Builder class. It iterates over the children of the PhraseWildcardQueryNode and for each child:
If it's a WildcardQueryNode, then we add it as a MultiTermQuery to the PhraseWildcardQuery.Builder
Else we add the single term to the PhraseWildcardQuery.Builder
The text was updated successfully, but these errors were encountered:
PhraseWildcardSearch is appealing, but its implementation makes trade-offs to work around the fact that it doesn't work efficiently if any of the wildcards expands to many terms. If you have a low-cardinality vocabulary, this is probably fine, but otherwise (e.g. English content), your queries may either be extremely costly if maxMultiTermExpansions is high, or miss matches (possibly all of them) if maxMultiTermExpansions is low. This makes me a bit uneasy about exposing it out of the box as it could take users by surprise.
For reference, there are other approaches for wildcard search that have different trade-offs, such as indexing (edge) n-grams, so that your wildcard expressions can actually be indexed and searched as simple terms (what Elasticsearch does when you configure text fields with index_prefixes: true) or indexing n-grams (with n=3 typically) for the whole input, using ngrams to find a superset of the matches, and then verifying the wildcard phrase against the raw data of this superset of matches (what the Elasticsearch wildcard field does under the hood).
Description
The Lucene docs specify that wildcard search is only supported for single terms but not phrases (link). We’d like to support wildcard search in phrase queries in our Lucene-based service and in our research we came across the PhraseWildcardQuery, which we see is marked as experimental.
The PhraseWildcardQuery can support our needs, and we tested it out in a proof of concept, but it is not actually used by any of the existing lucene query parsers. We see some options to implement query parsing that supports phrase wildcards: (1) the StandardQueryParser could be modified to optionally allow for phrases with wildcards and utilize the PhraseWildcardQuery to support this use case, or (2) there could there be a new query parser that specifically supports phrases with wildcards.
So we have two questions:
For reference, this is how we added to the StandardQueryParser to support wildcards in phrases in our proof of concept:
Using the following new components:
The text was updated successfully, but these errors were encountered: