-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexOutOfRangeException when searching #296
Comments
Thanks for the report. The stack trace is helpful as it indicates an index read failure, but could you provide more sample setup code? It would be helpful if you could provide the following:
It is much more likely we will solve this if we have code that can be run to duplicate the conditions at the time of the exception, either as a standalone console app or a test. I suspect there may be a mismatch between the |
Thanks for the quick response, I can't really supply the sample data since it could be many things. The data is very diverse. Hopefully, the breakdown of my setup will be enough. `analyzer uses a chartokenizer with a lowercase filter var dir = FSDirectory.Open(indexFolderPath); var parentDocument = new Document(); var childDocument = new Document(); // we are creating a parent child relationship with this list of documents _reader = DirectoryReader.Open(FSDirectory.Open(indexFolderPath)); var searchString = "value:test search string" var parentFilter = new FixedBitSetCachingWrapperFilter( var query = ToParentBlockJoinQuery(childQuery, parentFilter, ScoreMode.Max); var sort = new Sort(new SortField(null, SortFieldType.DOC)); |
any ideas on this??? |
I traced an issue that was causing another It might also be useful to know whether the problem you are seeing is happening in all cultures. In Java, none of the methods are culture-sensitive, so to match the behavior we should be using the invariant culture. .NET has several methods that are culture-sensitive by default. While we have gone through to ensure we are not calling any of them in places where we shouldn't be, there could be a case or two that were missed or were recently added. If you switch the current thread to the invariant culture, does it cause the problem to go away? |
Hi @NightOwl888, I can basically confirm the behavior described here when using FuzzyQuery, most of the times it works, but sometimes searches fail with a pretty similar exception tho:
We are using Lucene 4.8. For now, we are "solving" this by using a try catch around the Search() and catch it to do retry of the search, which greatly reduces the amount of failed searches. |
Okay, I have fixed some culture sensitivity issues with the analyzers that could be leading to this. Could someone please check the packages in the nuget artifact here to see whether the |
I installed the nugt artifact locally and the error seems to not occur as often as before, but I'm still able to reproduce it:
With the added retry functionality, I wasn't able to produce the errors two times in a row using the same FuzzyQuery. So it seems to be nearly fixed, with only a small bug remaining |
Thanks for the info. I suspect this is a different issue than the one from the OP. Can you tell me which version the error first appeared in? There have been some recent changes to both |
We just recently implemented the FuzzyQuery on 4.8.0-beta00008 and updated to 4.8.0-beta00011, so the error could have happened in an earlier version. I will try a downgrade to an older version and check if the error still occurs and get back to you. |
I'm unable to reproduce the bug on 4.8.0-beta00006, I will try 4.8.0-beta00007 next. Hope this helps. I was also unable to reproduce on 4.8.0-beta00007. |
Sorry been off for a few days, I'm unfortunately not able to repro it on my side so I can't really test it out |
@mlaufer - Since you are able to reliably reproduce this, is it possible you can submit a PR with a test that fails (no matter how rarely) with this problem? |
If it helps any: |
Could someone please provide a minimal example I can run? Even with the descriptions here, there is not enough info to piece together both the code and the data to reproduce this without research and trial and err. There is probably a test that is similar enough to what you are doing in Also, what platform is this happening on and is this x86 or x64? Note there are now 8 known failing tests on .NET Framework under x86 in 4.8.0-beta00011 and prior, several of which relate to 4.8.0-beta00012 can be downloaded at https://dist.apache.org/repos/dist/dev/lucenenet/ (it is currently pending the release vote, which takes 72 hours). Could someone please confirm this problem still exists on 4.8.0-beta00012? |
Platform = x64. Analyzer: KeywordAnalyzer Field Definitions: Query: indexSearcher.Search(query, 10) NOTE: If I name the "reversed" columns "_Reversed" + Name, the issue goes away. I apologize, I don't have the data to reproduce the exception any longer, as I rebuilt the index with a different name for the reversed columns, and the issue seems to have gone away, and to rebuild with the field names that were problematic takes a long time... |
I am able to reliably reproduce with one of my datasets but I'm not sure if I could write a test to fail. I'm running on .NET Core/x64 with 4.8.0-beta00012. Similar stack trace to everyone after OP:
Using this analyzer (I'm just starting to come up to speed with Lucene so I'm not sure the arrangement of filters actually makes any sense): public class NGramAnalyzer : Analyzer
{
private readonly LuceneVersion version;
private readonly int minGram;
private readonly int maxGram;
public NGramAnalyzer(LuceneVersion version, int minGram = 2, int maxGram = 8)
{
this.version = version;
this.minGram = minGram;
this.maxGram = maxGram;
}
/// <inheritdoc />
protected override TextReader InitReader(string fieldName, TextReader reader)
{
var charMap = new NormalizeCharMap.Builder();
charMap.Add("_", " ");
return new MappingCharFilter(charMap.Build(), reader);
}
/// <inheritdoc />
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
{
// Splits words at punctuation characters, removing punctuation.
// Splits words at hyphens, unless there's a number in the token...
// Recognizes email addresses and internet hostnames as one token.
var tokenizer = new StandardTokenizer(version, reader);
TokenStream filter = new StandardFilter(version, tokenizer);
// Normalizes token text to lower case.
filter = new LowerCaseFilter(version, filter);
// Removes stop words from a token stream.
filter = new StopFilter(version, filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
filter = new EnglishMinimalStemFilter(filter);
filter = new NGramTokenFilter(version, filter, minGram, maxGram);
return new TokenStreamComponents(tokenizer, filter);
}
} Setup is then: var indexStore = new RAMDirectory();
var indexConfig = new IndexWriterConfig(Version, Analyzer);
indexWriter = new IndexWriter(indexStore, indexConfig);
initialIndexingTask = Task.Run(() =>
{
var stopwatch = Stopwatch.StartNew();
indexWriter.AddDocuments(collection.Select(GetAndSubscribeToDocument));
indexWriter.Commit();
Debug.WriteLine(@$"{typeof(TDocument)} Indexing: {stopwatch.ElapsedMilliseconds}ms");
}); Searching after initial indexing is complete is done with: using var reader = DirectoryReader.Open(indexWriter.Directory);
var searcher = new IndexSearcher(reader);
Query? parsedQuery;
try
{
var queryParser = new MultiFieldQueryParser(Version, DefaultSearchFields, Analyzer);
var terms = new HashSet<Term>();
queryParser.Parse(query).Rewrite(reader).ExtractTerms(terms);
var boolQuery = new BooleanQuery();
terms.ForEach(t =>
{
boolQuery.Add(new FuzzyQuery(t), Occur.SHOULD);
boolQuery.Add(new WildcardQuery(t), Occur.SHOULD);
});
parsedQuery = boolQuery;
}
catch (Exception)
{
// TODO: User feedback
return new (TDocument doc, float score)[0];
}
var hits = searcher.Search(parsedQuery, resultLimit); I've archived off the dataset and code so that I can hopefully go back and gather more data to help troubleshoot. It's worth noting that in my current repro case, I have 4 separate instances of this (RAMDirectory, IndexWriter, and Reader+Searcher) all running at the same time (and with nearly identical datasets). A quick look through the code up and down the stack trace didn't show me anything in Lucene that was obviously shared between those instances that could be the culprit. |
Thanks for the info. If a test is too much to ask, could you distill this down to a console app using the failing data set and put it in a repo to share? If the data is sensitive, do note that both Azure DevOps and BitBucket allow you to create free private repos that you can then share by invitation. Just use the email address in my GitHub profile. |
Yeah, I should be able to get that to you by the end of the week. Thanks for the prompt response! |
@NightOwl888 Repo is posted and I just invited you to it. The console app prompts you to enter a query. The suggested query provided in the prompt fails nearly every time for me. Thanks again! |
In my case, when doing a fuzzy search using 4.8.0-beta00012 and doing load testing I get the IndexOutOfRangeException with the following stack trace: at Lucene.Net.Util.Automaton.UTF32ToUTF8.Convert(Automaton utf32) I'm running a thousand requests that is ramped up over 60 seconds. I then get an error rate of about 20 to 30 percent. I then included a retry whenever I catch this exception and have brought the error rate down to 1 to 2 percent. (I don't count the errors in the retries and only the ones that didn't return success after 3 attempts) Interesting thing is, when I removed the fuzzy search, I was able to do a 1000 successful requests. No issues. |
…lity checking, as implementing Equals() to compare other than reference equality causes IndexOutOfRangeException to randomly occur when using FuzzyTermsEnum. Fixes apache#296.
…lity checking, as implementing Equals() to compare other than reference equality causes IndexOutOfRangeException to randomly occur when using FuzzyTermsEnum. Fixes #296.
Thanks for submitting a working failure case. I was able to use it to create a test project that contained a failing test. From there, I was able to confirm that the error was introduced between 4.8.0-beta00007 and 4.8.0-beta00008 and by using git's detached mode the issue was traced to commit 0eaf765. Unfortunately, I had to start all over again at that point, since it was a merge of 60 commits, but eventually I ended up here: e1ead06. It turned out to be a simple misinterpretation that |
BTW - if anyone wants to try out these changes before they are rolled into a release to confirm it is a complete fix, the NuGet packages can be downloaded from the |
Hello, I'm getting an IndexOutOfRangeException when searching in some cases. It's happening maybe 10% of the time so I'm unsure what is causing this. See below for the search code and stack trace. I feel like it might be something with the query but its more of a guess. Any guidance on this would be very much appreciated.
var sort = new Sort(new SortField(null, SortFieldType.DOC)); return _searcher.Search(Query, _reader.NumDocs, sort);
FATAL Update Data Set System.IndexOutOfRangeException: Index was outside the bounds of the array. at Lucene.Net.Store.ByteArrayDataInput.ReadVInt32() at Lucene.Net.Codecs.BlockTreeTermsReader.FieldReader.IntersectEnum.Frame.NextLeaf() at Lucene.Net.Codecs.BlockTreeTermsReader.FieldReader.IntersectEnum.Next() at Lucene.Net.Search.TermCollectingRewrite
1.CollectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector)at Lucene.Net.Search.ConstantScoreAutoRewrite.Rewrite(IndexReader reader, MultiTermQuery query)
at Lucene.Net.Join.ToParentBlockJoinQuery.Rewrite(IndexReader reader)
at Lucene.Net.Search.IndexSearcher.Rewrite(Query original)
at Lucene.Net.Search.IndexSearcher.CreateNormalizedWeight(Query query)
at Lucene.Net.Search.IndexSearcher.Search(Query query, Int32 n, Sort sort)`
The text was updated successfully, but these errors were encountered: