Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexOutOfRangeException when searching #296

Closed
jregnier opened this issue Jun 16, 2020 · 22 comments · Fixed by #386
Closed

IndexOutOfRangeException when searching #296

jregnier opened this issue Jun 16, 2020 · 22 comments · Fixed by #386

Comments

@jregnier
Copy link

Hello, I'm getting an IndexOutOfRangeException when searching in some cases. It's happening maybe 10% of the time so I'm unsure what is causing this. See below for the search code and stack trace. I feel like it might be something with the query but its more of a guess. Any guidance on this would be very much appreciated.

var sort = new Sort(new SortField(null, SortFieldType.DOC)); return _searcher.Search(Query, _reader.NumDocs, sort);

FATAL Update Data Set System.IndexOutOfRangeException: Index was outside the bounds of the array. at Lucene.Net.Store.ByteArrayDataInput.ReadVInt32() at Lucene.Net.Codecs.BlockTreeTermsReader.FieldReader.IntersectEnum.Frame.NextLeaf() at Lucene.Net.Codecs.BlockTreeTermsReader.FieldReader.IntersectEnum.Next() at Lucene.Net.Search.TermCollectingRewrite1.CollectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector)
at Lucene.Net.Search.ConstantScoreAutoRewrite.Rewrite(IndexReader reader, MultiTermQuery query)
at Lucene.Net.Join.ToParentBlockJoinQuery.Rewrite(IndexReader reader)
at Lucene.Net.Search.IndexSearcher.Rewrite(Query original)
at Lucene.Net.Search.IndexSearcher.CreateNormalizedWeight(Query query)
at Lucene.Net.Search.IndexSearcher.Search(Query query, Int32 n, Sort sort)`

@NightOwl888
Copy link
Contributor

Thanks for the report. The stack trace is helpful as it indicates an index read failure, but could you provide more sample setup code? It would be helpful if you could provide the following:

  1. Which Lucene version compatibility setting you are using
  2. Sample code to create an index (including field/analyzer setup)
  3. Sample query code to read the index
  4. Some sample data

It is much more likely we will solve this if we have code that can be run to duplicate the conditions at the time of the exception, either as a standalone console app or a test.

I suspect there may be a mismatch between the BlockTreeTermsWriter and the BlockTreeTermsReader. It may be unrelated, but there is a comment in the code in the BlockTreeTermsWriter that indicates an index out of range exception when asserting the "floor blocks" data. Floor blocks are used if you have more than 48 terms in a block.

@jregnier
Copy link
Author

Thanks for the quick response, I can't really supply the sample data since it could be many things. The data is very diverse. Hopefully, the breakdown of my setup will be enough.

`analyzer uses a chartokenizer with a lowercase filter

var dir = FSDirectory.Open(indexFolderPath);
var indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, {analyzer});
_writer = new IndexWriter(dir, indexConfig);

var parentDocument = new Document();
parentDocument.Add({BinaryDocValuesField});
parentDocument.Add({StringField});
parentDocument.Add({StringField});
parentDocument.Add({StringField});

var childDocument = new Document();
childDocument.Add({StringField});
childDocument.Add({StringField});
childDocument.Add({TextField}) // not stored;
childDocument.Add({StringField}) // only some documents will have this;

// we are creating a parent child relationship with this list of documents
_writer.AddDocuments(documentList)

_reader = DirectoryReader.Open(FSDirectory.Open(indexFolderPath));
_searcher = new IndexSearcher(_reader);
BooleanQuery.MaxClauseCount = int.MaxValue;

var searchString = "value:test search string"
var terms = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term(fieldName, word)) // terms is a list of these for each word
var childQuery = new SpanNearQuery(terms, 0, true)

var parentFilter = new FixedBitSetCachingWrapperFilter(
new QueryWrapperFilter(
new TermQuery(
new Term(fieldName, value))));

var query = ToParentBlockJoinQuery(childQuery, parentFilter, ScoreMode.Max);

var sort = new Sort(new SortField(null, SortFieldType.DOC));
return _searcher.Search(query, _reader.NumDocs, sort)`

@jregnier
Copy link
Author

any ideas on this???

@NightOwl888
Copy link
Contributor

I traced an issue that was causing another IndexOutOfRangeException in the ThaiTokenizer to an invalid cast from int to char that was causing it to filter out surrogate pairs when it shouldn't have been. This is the second such issue I found this week, and searching through the analyzers for the string (char), this appears to be a problem that affects several of them. This is definitely a bug that we will need to address.

It might also be useful to know whether the problem you are seeing is happening in all cultures. In Java, none of the methods are culture-sensitive, so to match the behavior we should be using the invariant culture. .NET has several methods that are culture-sensitive by default. While we have gone through to ensure we are not calling any of them in places where we shouldn't be, there could be a case or two that were missed or were recently added. If you switch the current thread to the invariant culture, does it cause the problem to go away?

@mlaufer
Copy link

mlaufer commented Jul 30, 2020

Hi @NightOwl888,

I can basically confirm the behavior described here when using FuzzyQuery, most of the times it works, but sometimes searches fail with a pretty similar exception tho:

System.IndexOutOfRangeException: Index was outside the bounds of the array. at Lucene.Net.Util.Automaton.UTF32ToUTF8.Convert(Automaton utf32) at Lucene.Net.Util.Automaton.CompiledAutomaton..ctor(Automaton automaton, Nullable1 finite, Boolean simplify)
at Lucene.Net.Search.FuzzyTermsEnum.InitAutomata(Int32 maxDistance)
at Lucene.Net.Search.FuzzyTermsEnum.GetAutomatonEnum(Int32 editDistance, BytesRef lastTerm)
at Lucene.Net.Search.FuzzyTermsEnum.MaxEditDistanceChanged(BytesRef lastTerm, Int32 maxEdits, Boolean init)
at Lucene.Net.Search.FuzzyTermsEnum..ctor(Terms terms, AttributeSource atts, Term term, Single minSimilarity, Int32 prefixLength, Boolean transpositions)
at Lucene.Net.Search.FuzzyQuery.GetTermsEnum(Terms terms, AttributeSource atts)
at Lucene.Net.Search.TermCollectingRewrite1.CollectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector) at Lucene.Net.Search.TopTermsRewrite1.Rewrite(IndexReader reader, MultiTermQuery query)
at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader)
at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader)
at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader)
at Lucene.Net.Search.IndexSearcher.Rewrite(Query original)
at Lucene.Net.Search.IndexSearcher.CreateNormalizedWeight(Query query)
at Lucene.Net.Search.IndexSearcher.Search(Query query, Int32 n, Sort sort)`

We are using Lucene 4.8. For now, we are "solving" this by using a try catch around the Search() and catch it to do retry of the search, which greatly reduces the amount of failed searches.

NightOwl888 added a commit to NightOwl888/lucenenet that referenced this issue Aug 2, 2020
@NightOwl888
Copy link
Contributor

Okay, I have fixed some culture sensitivity issues with the analyzers that could be leading to this. Could someone please check the packages in the nuget artifact here to see whether the IndexOutOfRangeException still exists?

@mlaufer
Copy link

mlaufer commented Aug 3, 2020

I installed the nugt artifact locally and the error seems to not occur as often as before, but I'm still able to reproduce it:

System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at Lucene.Net.Util.Automaton.UTF32ToUTF8.Convert(Automaton utf32)
   at Lucene.Net.Util.Automaton.CompiledAutomaton..ctor(Automaton automaton, Nullable`1 finite, Boolean simplify)
   at Lucene.Net.Search.FuzzyTermsEnum.InitAutomata(Int32 maxDistance)
   at Lucene.Net.Search.FuzzyTermsEnum.GetAutomatonEnum(Int32 editDistance, BytesRef lastTerm)
   at Lucene.Net.Search.FuzzyTermsEnum.MaxEditDistanceChanged(BytesRef lastTerm, Int32 maxEdits, Boolean init)
   at Lucene.Net.Search.FuzzyTermsEnum..ctor(Terms terms, AttributeSource atts, Term term, Single minSimilarity, Int32 prefixLength, Boolean transpositions)
   at Lucene.Net.Search.FuzzyQuery.GetTermsEnum(Terms terms, AttributeSource atts)
   at Lucene.Net.Search.TermCollectingRewrite`1.CollectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector)
   at Lucene.Net.Search.TopTermsRewrite`1.Rewrite(IndexReader reader, MultiTermQuery query)
   at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader)
   at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader)
   at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader)
   at Lucene.Net.Search.IndexSearcher.Rewrite(Query original)
   at Lucene.Net.Search.IndexSearcher.CreateNormalizedWeight(Query query)
   at Lucene.Net.Search.IndexSearcher.Search(Query query, Int32 n, Sort sort)

With the added retry functionality, I wasn't able to produce the errors two times in a row using the same FuzzyQuery. So it seems to be nearly fixed, with only a small bug remaining

@NightOwl888
Copy link
Contributor

NightOwl888 commented Aug 3, 2020

Thanks for the info. I suspect this is a different issue than the one from the OP.

Can you tell me which version the error first appeared in? There have been some recent changes to both Automaton and FuzzyTermsEnum to improve performance and I am sure it can be narrowed to a few suspect commits pretty easily.

@mlaufer
Copy link

mlaufer commented Aug 3, 2020

We just recently implemented the FuzzyQuery on 4.8.0-beta00008 and updated to 4.8.0-beta00011, so the error could have happened in an earlier version. I will try a downgrade to an older version and check if the error still occurs and get back to you.

@mlaufer
Copy link

mlaufer commented Aug 3, 2020

I'm unable to reproduce the bug on 4.8.0-beta00006, I will try 4.8.0-beta00007 next. Hope this helps.

I was also unable to reproduce on 4.8.0-beta00007.

@jregnier
Copy link
Author

jregnier commented Aug 7, 2020

Sorry been off for a few days, I'm unfortunately not able to repro it on my side so I can't really test it out

@NightOwl888
Copy link
Contributor

@mlaufer - Since you are able to reliably reproduce this, is it possible you can submit a PR with a test that fails (no matter how rarely) with this problem?

@epDugas
Copy link

epDugas commented Sep 16, 2020

If it helps any:
I get same exception, with same StackTrace, when using WildcardQuery on a particular StringField (the field contains a string of ints). If I wrap the WildcardQuery in a single item BooleanQuery, I do not experience the issue. This seems to happen when I add a StringField that is the Reverse of another.

@NightOwl888
Copy link
Contributor

Could someone please provide a minimal example I can run? Even with the descriptions here, there is not enough info to piece together both the code and the data to reproduce this without research and trial and err. There is probably a test that is similar enough to what you are doing in TestWildcard to use as a starting point, just modify it accordingly and post it here so we can run it. If you need to, use the [Repeat] attribute to run it multiple times to force a failure.

Also, what platform is this happening on and is this x86 or x64?

Note there are now 8 known failing tests on .NET Framework under x86 in 4.8.0-beta00011 and prior, several of which relate to FuzzyTermsEnum and TopTermsRewrite. These test failures go away with optimizations disabled, indicating they are likely JIT optimization bugs of some kind. Even in 4.8.0-beta00012 there are still 4 tests failing, and it will be difficult to pin these down because the failures are not happening in debug mode. These tests do not fail on .NET Core/x86 or on .NET Framework/x64.

4.8.0-beta00012 can be downloaded at https://dist.apache.org/repos/dist/dev/lucenenet/ (it is currently pending the release vote, which takes 72 hours). Could someone please confirm this problem still exists on 4.8.0-beta00012?

@epDugas
Copy link

epDugas commented Sep 21, 2020

Platform = x64.
Build = 4.8.0-beta00012
Example (Note: before adding reversed fields, the issue does not present itself):

Analyzer: KeywordAnalyzer

Field Definitions:
new StringField("Address", string.Empty, Field.Store.YES)
new StringField("Address" + "_Reversed, string.Empty, Field.Store.YES)
new StringField("Zip", string.Empty, Field.Store.YES)
new StringField("Zip" + "_Reversed", string.Empty, Field.Store.YES)

Query:
var query = new BooleanQuery
{
{ new WildcardQuery(new Term("Address", "hwy")), Occur.MUST },
{ new WildcardQuery(new Term("Zip", "06")), Occur.MUST },
};

indexSearcher.Search(query, 10)

NOTE: If I name the "reversed" columns "_Reversed" + Name, the issue goes away.

I apologize, I don't have the data to reproduce the exception any longer, as I rebuilt the index with a different name for the reversed columns, and the issue seems to have gone away, and to rebuild with the field names that were problematic takes a long time...

@willson556
Copy link

willson556 commented Oct 14, 2020

I am able to reliably reproduce with one of my datasets but I'm not sure if I could write a test to fail. I'm running on .NET Core/x64 with 4.8.0-beta00012.

Similar stack trace to everyone after OP:

   at Lucene.Net.Util.Automaton.UTF32ToUTF8.Convert(Automaton utf32) 
   at Lucene.Net.Util.Automaton.CompiledAutomaton..ctor(Automaton automaton, Nullable`1 finite, Boolean simplify) 
   at Lucene.Net.Search.FuzzyTermsEnum.InitAutomata(Int32 maxDistance) 
   at Lucene.Net.Search.FuzzyTermsEnum.GetAutomatonEnum(Int32 editDistance, BytesRef lastTerm) 
   at Lucene.Net.Search.FuzzyTermsEnum.MaxEditDistanceChanged(BytesRef lastTerm, Int32 maxEdits, Boolean init) 
   at Lucene.Net.Search.FuzzyTermsEnum..ctor(Terms terms, AttributeSource atts, Term term, Single minSimilarity, Int32 prefixLength, Boolean transpositions) 
   at Lucene.Net.Search.FuzzyQuery.GetTermsEnum(Terms terms, AttributeSource atts) 
   at Lucene.Net.Search.MultiTermQuery.RewriteMethod.GetTermsEnum(MultiTermQuery query, Terms terms, AttributeSource atts) 
   at Lucene.Net.Search.TermCollectingRewrite`1.CollectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector) 
   at Lucene.Net.Search.TopTermsRewrite`1.Rewrite(IndexReader reader, MultiTermQuery query) 
   at Lucene.Net.Search.MultiTermQuery.Rewrite(IndexReader reader) 
   at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader) 
   at Lucene.Net.Search.IndexSearcher.Rewrite(Query original) 
   at Lucene.Net.Search.IndexSearcher.CreateNormalizedWeight(Query query) 
   at Lucene.Net.Search.IndexSearcher.Search(Query query, Filter filter, Int32 n) 
   at Lucene.Net.Search.IndexSearcher.Search(Query query, Int32 n) 

Using this analyzer (I'm just starting to come up to speed with Lucene so I'm not sure the arrangement of filters actually makes any sense):

public class NGramAnalyzer : Analyzer
{
    private readonly LuceneVersion version;
    private readonly int minGram;
    private readonly int maxGram;

    public NGramAnalyzer(LuceneVersion version, int minGram = 2, int maxGram = 8)
    {
        this.version = version;
        this.minGram = minGram;
        this.maxGram = maxGram;
    }

    /// <inheritdoc />
    protected override TextReader InitReader(string fieldName, TextReader reader)
    {
        var charMap = new NormalizeCharMap.Builder();
        charMap.Add("_", " ");
        return new MappingCharFilter(charMap.Build(), reader);
    }

    /// <inheritdoc />
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    {
        // Splits words at punctuation characters, removing punctuation.
        // Splits words at hyphens, unless there's a number in the token...
        // Recognizes email addresses and internet hostnames as one token.
        var tokenizer = new StandardTokenizer(version, reader);

        TokenStream filter = new StandardFilter(version, tokenizer);

        // Normalizes token text to lower case.
        filter = new LowerCaseFilter(version, filter);

        // Removes stop words from a token stream.
        filter = new StopFilter(version, filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET);

        filter = new EnglishMinimalStemFilter(filter);

        filter = new NGramTokenFilter(version, filter, minGram, maxGram);
        return new TokenStreamComponents(tokenizer, filter);
    }
}

Setup is then:

var indexStore = new RAMDirectory();
var indexConfig = new IndexWriterConfig(Version, Analyzer);
indexWriter = new IndexWriter(indexStore, indexConfig);
initialIndexingTask = Task.Run(() =>
                                              {
                                                  var stopwatch = Stopwatch.StartNew();
                                                  indexWriter.AddDocuments(collection.Select(GetAndSubscribeToDocument));
                                                  indexWriter.Commit();
                                                  Debug.WriteLine(@$"{typeof(TDocument)} Indexing: {stopwatch.ElapsedMilliseconds}ms");
                                              });

Searching after initial indexing is complete is done with:

using var reader = DirectoryReader.Open(indexWriter.Directory);
var searcher = new IndexSearcher(reader);

Query? parsedQuery;
try
{
    var queryParser = new MultiFieldQueryParser(Version, DefaultSearchFields, Analyzer);
    var terms = new HashSet<Term>();
    queryParser.Parse(query).Rewrite(reader).ExtractTerms(terms);

    var boolQuery = new BooleanQuery();
    terms.ForEach(t =>
                    {
                        boolQuery.Add(new FuzzyQuery(t), Occur.SHOULD);
                        boolQuery.Add(new WildcardQuery(t), Occur.SHOULD);
                    });

    parsedQuery = boolQuery;
}
catch (Exception)
{
    // TODO: User feedback
    return new (TDocument doc, float score)[0];
}

var hits = searcher.Search(parsedQuery, resultLimit);

I've archived off the dataset and code so that I can hopefully go back and gather more data to help troubleshoot. It's worth noting that in my current repro case, I have 4 separate instances of this (RAMDirectory, IndexWriter, and Reader+Searcher) all running at the same time (and with nearly identical datasets). A quick look through the code up and down the stack trace didn't show me anything in Lucene that was obviously shared between those instances that could be the culprit.

@NightOwl888
Copy link
Contributor

@willson556

Thanks for the info.

If a test is too much to ask, could you distill this down to a console app using the failing data set and put it in a repo to share?

If the data is sensitive, do note that both Azure DevOps and BitBucket allow you to create free private repos that you can then share by invitation. Just use the email address in my GitHub profile.

@willson556
Copy link

willson556 commented Oct 15, 2020

If a test is too much to ask, could you distill this down to a console app using the failing data set and put it in a repo to share?

Yeah, I should be able to get that to you by the end of the week. Thanks for the prompt response!

@willson556
Copy link

@NightOwl888 Repo is posted and I just invited you to it. The console app prompts you to enter a query. The suggested query provided in the prompt fails nearly every time for me.

Thanks again!

@AntonOttoW
Copy link

AntonOttoW commented Nov 4, 2020

In my case, when doing a fuzzy search using 4.8.0-beta00012 and doing load testing I get the IndexOutOfRangeException with the following stack trace:

at Lucene.Net.Util.Automaton.UTF32ToUTF8.Convert(Automaton utf32)
at Lucene.Net.Util.Automaton.CompiledAutomaton..ctor(Automaton automaton, Nullable1 finite, Boolean simplify) at Lucene.Net.Search.FuzzyTermsEnum.InitAutomata(Int32 maxDistance) at Lucene.Net.Search.FuzzyTermsEnum.GetAutomatonEnum(Int32 editDistance, BytesRef lastTerm) at Lucene.Net.Search.FuzzyTermsEnum.MaxEditDistanceChanged(BytesRef lastTerm, Int32 maxEdits, Boolean init) at Lucene.Net.Search.FuzzyTermsEnum..ctor(Terms terms, AttributeSource atts, Term term, Single minSimilarity, Int32 prefixLength, Boolean transpositions) at Lucene.Net.Search.FuzzyQuery.GetTermsEnum(Terms terms, AttributeSource atts) at Lucene.Net.Search.TermCollectingRewrite1.CollectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector)
at Lucene.Net.Search.TopTermsRewrite`1.Rewrite(IndexReader reader, MultiTermQuery query)
at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader)
at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader)
at Lucene.Net.Search.FilteredQuery.Rewrite(IndexReader reader)
at Lucene.Net.Search.IndexSearcher.Rewrite(Query original)
at Lucene.Net.Search.IndexSearcher.CreateNormalizedWeight(Query query)
at Lucene.Net.Search.IndexSearcher.Search(Query query, Filter filter, Int32 n, Sort sort)

I'm running a thousand requests that is ramped up over 60 seconds. I then get an error rate of about 20 to 30 percent.

I then included a retry whenever I catch this exception and have brought the error rate down to 1 to 2 percent. (I don't count the errors in the retries and only the ones that didn't return success after 3 attempts)

Interesting thing is, when I removed the fuzzy search, I was able to do a 1000 successful requests. No issues.

NightOwl888 added a commit to NightOwl888/lucenenet that referenced this issue Nov 5, 2020
…lity checking, as implementing Equals() to compare other than reference equality causes IndexOutOfRangeException to randomly occur when using FuzzyTermsEnum. Fixes apache#296.
NightOwl888 added a commit that referenced this issue Nov 5, 2020
…lity checking, as implementing Equals() to compare other than reference equality causes IndexOutOfRangeException to randomly occur when using FuzzyTermsEnum. Fixes #296.
@NightOwl888
Copy link
Contributor

@willson556

Thanks for submitting a working failure case. I was able to use it to create a test project that contained a failing test. From there, I was able to confirm that the error was introduced between 4.8.0-beta00007 and 4.8.0-beta00008 and by using git's detached mode the issue was traced to commit 0eaf765. Unfortunately, I had to start all over again at that point, since it was a merge of 60 commits, but eventually I ended up here: e1ead06.

It turned out to be a simple misinterpretation that id means "unique", when in fact the object reference was the unique identifier that should be used in the Equals() implementation.

@NightOwl888
Copy link
Contributor

BTW - if anyone wants to try out these changes before they are rolled into a release to confirm it is a complete fix, the NuGet packages can be downloaded from the nuget artifact here: https://dev.azure.com/LuceneNET-Temp/Lucene.NET/_build/results?buildId=1171&view=artifacts&type=publishedArtifacts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants