Skip to content

Port Of MultiTermQuery

rdelbru edited this page Sep 29, 2011 · 7 revisions

Port Of MultiTermQuery

This document covers the tasks that have to be done to port the MultiTermQuery features of Lucene (e.g., FuzzyQuery, WildcardQuery, etc.) to SIREn, and contains notes about problems encounter during the development, as well as the solutions we chose.

Task Overview

Rewrite SirenMultiTermQuery

  • SirenConstantScoreQuery
  • SirenMultiTermQueryWrapperFilter
  • SirenConstantScoreAutoRewrite
  • SirenFuzzyQuery
  • SirenNumericRangeQuery
  • SirenPrefixQuery
  • SirenTermRangeQuery
  • SirenWildcardQuery

SirenMultiTermQuery

Change MultiTermQuery into SirenMultiTermQuery Change ConstantScoreQuery into SirenConstantScoreQuery Change MultiTermQueryWrapperFilter into SirenMultiTermQueryWrapperFilter Change TopTermsScoringBooleanQueryRewrite into TopTermsScoringSirenBooleanQueryRewrite Change TopTermsBoostOnlyBooleanQueryRewrite into TopTermsBoostOnlySirenBooleanQueryRewrite Change ConstantScoreAutoRewrite into SirenConstantScoreAutoRewrite

  • SirenScoringRewrite had to be copied from ScoringRewrite in order to use SirenBooleanQuery
  • because of that, SirenTermCollectingRewrite had to be copied from TermCollectingRewrite, because TermCollectingRewrite overwrite the rewrite method in (Siren)MultiTermQuery
  • to implement a SirenFuzzyQuery, a call to SirenMultiTermQuery.TopTermsScoringSirenBooleanQueryRewrite has to be done. Since this class extends TopTermsRewrite, which itself extends TermCollectingRewrite, the SirenTopTermsRewrite had to be copied from TopTermsRewrite in order to extends SirenTermCollectingRewrite
  • SirenConstantScoreAutoRewrite had to be copied from ConstantScoreAutoRewrite because it has the default controlling access. Look at TestMultiTermConstantScore, and maybe convert the test

SirenConstantScoreAutoRewrite

Change BooleanQuery to SirenBooleanQuery Change BooleanClause to SirenBooleanClause Change TermQuery to SirenTermQuery Change MultiTermQuery to SirenMultiTermQuery

SirenFuzzyQuery

Need to rewrite MultiTermQuery#TopTermsScoringBooleanQueryRewrite into SirenMultiTermQuery#TopTermsScoringSirenBooleanQueryRewrite MultiTermQuery#TopTermsBoostOnlyBooleanQueryRewrite into SirenMultiTermQuery#TopTermsBoostOnlySirenBooleanQueryRewrite Change in FuzzyQuery#constructor replace MultiTermQuery#TopTermsScoringBooleanQueryRewrite by SirenMultiTermQuery#TopTermsScoringSirenBooleanQueryRewrite

Convert TestFuzzyQuery

SirenPrefixQuery

The PrefixQuery relies on a Filter approach when "the number of matched terms or matched documents is non trivial". In this case, the query rewriting is delegated to the rewrite method MultiTermQuery#CONSTANT_SCORE_FILTER_REWRITE. The decision is taken by ConstantScoreAutoRewrite and is based on a "term count cutoff" and "document count percent".

A filter is created by aggregating the document ids of the enumeration of TermDocs into a bitset by MultiTermQueryWrapperFilter.

However, in our case, we do not have only the document ids, but also the tuple and cell ids. Given the fact that the range of tuple and cell ids within a document is generally relatively small, it makes perfectly sense to use also additional bitset to store them. In this case, the Filter will be composed of three elements:

  • OpenBitSet for storing document ids
  • OpenBitSet[] for storing tuple ids, with one OpenBitSet per document id stored.
  • OpenBitSet[] for storing cell ids, with one OpenBitSet per document id stored.

During query processing, the bitset for documents ids will be used until a candidate document is found. When iterating over document ids, we will increment a pointer to keep track of the position of the related bitset in the the OpenBitSet arrays. This approach looks ideal and quite optimised.

However, the problem is how to create such filter object based on multiple TermDocs. We haven't found a proper solution. Any approaches will require either too much memory or are too slow (require multiple passes to create the OpenBitSet arrays) which annihilate the gain provided by the bitset data structure approach.

Another solution was to read the multiple TermDocs in parallel, in order to create the bitset arrays in one pass. However, again this approach will generate too many disk seeks and might slow down the creation of the filter. This might be investigated in the future (with SSD drives this might not be a problem).

Therefore, since there is not a proper solution to this problem, the filter-based query rewriting method will not be ported. This means that certain queries (generating a large number of terms) will be very slow.

SirenNumericRangeQuery

NumericRange relies on filter-based rewrite method for big precision steps. For the same reasons than PrefixQuery, we had to disable the filter-based rewrite optimisation.

The performance is expected to be good for 32 bit (int/float) ranges with precisionStep <8 and 64 bit (long/double) ranges with precisionStep <6. In the other cases, bad performance (similar to SirenTermRangeQuery) has to be expected as the number of terms is likely to be high.

SirenNumericRangeQuery, SirenWildcardQuery, SirenTermRangeQuery

  • Change super class reference MultiTermQuery to SirenMultiTermQuery
  • Rewrite part of the unit tests for each of them