-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore fsync behavior in FSDirectory via P/Invoke, #933 #938
Conversation
This restores the commented-out fsync behavior in FSDirectory to help mitigate a performance regression in .NET 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This looks great. There are a few rough edges to iron out, but the implementation of Fsync
looks pretty good.
However, in addition to my inline comments, we should remove the AwaitsFix
attribute from Lucene.Net.Index.TestIndexWriterOnJRECrash::TestNRTThreads_Mem()
. No doubt this PR should also close #894, as that directly tests this functionality. However, I noticed it was failing to produce the file to transfer the process ID back to the test that kills it when we added the .NET 8 tests, so there may be another problem there to investigate that .NET 8 broke.
After those changes are implemented, I will run this in Azure DevOps to see if we can get it to pass consistently on 60 or so runs.
…oved AwaitsFix attribute. The FSync implementation should fix this test.
I am going to do a check with the TestIndexWriterOnJRECrash test to try to confirm this is a fix for it. There is an issue with the test itself (something about the system property with the temp file name not being passed through) that I am working on a fix for now. |
Looks like it still fails on Windows sometimes. I think it is more robust than it was, though. I had to try with TestNRTThreads_Mem
Source: TestIndexWriterOnJRECrash.cs line 51
Duration: 34.3 sec
Message:
Lucene.Net.Util.LuceneSystemException : CheckIndex failed
To reproduce this test result:
Option 1:
Apply the following assembly-level attributes:
[assembly: Lucene.Net.Util.RandomSeed("0x58551ebf9b5e10f6")]
[assembly: NUnit.Framework.SetCulture("en-AG")]
Option 2:
Use the following .runsettings file:
<RunSettings>
<TestRunParameters>
<Parameter name="tests:seed" value="0x58551ebf9b5e10f6" />
<Parameter name="tests:culture" value="en-AG" />
</TestRunParameters>
</RunSettings>
See the .runsettings documentation at: https://docs.microsoft.com/en-us/visualstudio/test/configure-unit-tests-by-using-a-dot-runsettings-file.
Stack Trace:
TestUtil.CheckIndex(Directory dir, Boolean crossCheckTermVectors) line 177
TestUtil.CheckIndex(Directory dir) line 162
TestIndexWriterOnJRECrash.CheckIndexes(FileSystemInfo file) line 322
TestIndexWriterOnJRECrash.CheckIndexes(FileSystemInfo file) line 329
TestIndexWriterOnJRECrash.CheckIndexes(FileSystemInfo file) line 329
TestIndexWriterOnJRECrash.TestNRTThreads_Mem() line 88
InvokeStub_TestIndexWriterOnJRECrash.TestNRTThreads_Mem(Object, Object, IntPtr*)
MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)
Standard Output:
CheckIndex failed
Segments file=segments_2 numSegments=3 version=4.8 format=
1 of 3: name=_9 docCount=162
codec=Lucene46
compound=True
numFiles=4
size (MB)=0.34093189239501953
diagnostics = {source=merge, lucene.version=4.8.0, os=Microsoft Windows, os.arch=X64, os.version=10.0.19045, java.version=8.0.1, java.vendor=Microsoft, timestamp=15/05/2024 10:17:05 am, mergeMaxNumSegments=-1, mergeFactor=5}
has deletions [delGen=1]
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK [32 deleted docs]
test: fields..............OK [26 fields]
test: field norms.........OK [21 fields]
test: terms, freq, prox...OK [6532 terms; 9562 terms/docs pairs; 12115 tokens]
test (ignoring deletes): terms, freq, prox...OK [7799 terms; 11823 terms/docs pairs; 14862 tokens]
test: stored fields.......OK [660 total field count; avg 5.076923 fields per doc]
test: term vectors........OK [265 total vector count; avg 2.0384614 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 0 NUMERIC; 1 SORTED; 0 SORTED_SET]
2 of 3: name=_8 docCount=56
codec=Lucene46
compound=False
numFiles=33
size (MB)=0.18136310577392578
diagnostics = {source=flush, lucene.version=4.8.0, os=Microsoft Windows, os.arch=X64, os.version=10.0.19045, java.version=8.0.1, java.vendor=Microsoft, timestamp=15/05/2024 10:17:06 am}
has deletions [delGen=1]
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK [2 deleted docs]
test: fields..............OK [13 fields]
test: field norms.........OK [8 fields]
test: terms, freq, prox...OK [3962 terms; 4916 terms/docs pairs; 7363 tokens]
test (ignoring deletes): terms, freq, prox...OK [4035 terms; 5024 terms/docs pairs; 7480 tokens]
test: stored fields.......OK [282 total field count; avg 5.2222223 fields per doc]
test: term vectors........OK [112 total vector count; avg 2.074074 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 0 NUMERIC; 1 SORTED; 0 SORTED_SET]
3 of 3: name=_7 docCount=44
codec=Lucene46
compound=False
numFiles=31
size (MB)=0.5951318740844727
diagnostics = {source=flush, lucene.version=4.8.0, os=Microsoft Windows, os.arch=X64, os.version=10.0.19045, java.version=8.0.1, java.vendor=Microsoft, timestamp=15/05/2024 10:17:06 am}
has deletions [delGen=1]
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK [1 deleted docs]
test: fields..............OK [18 fields]
test: field norms.........OK [13 fields]
test: terms, freq, prox...ERROR: Lucene.Net.Diagnostics.AssertionException: buffer.length=190007630 count=2142165037
at Lucene.Net.Codecs.MockIntBlock.MockVariableInt32BlockPostingsFormat.MockInt32Factory.VariableInt32BlockIndexInputAnonymousClass.BlockReaderAnonymousClass.ReadBlock() in F:\Projects\lucenenet\src\Lucene.Net.TestFramework\Codecs\MockIntBlock\MockVariableIntBlockPostingsFormat.cs:line 110
at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.MaybeSeek() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 152
at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.Next() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 160
at Lucene.Net.Codecs.Sep.SepPostingsReader.SepDocsAndPositionsEnum.NextDoc() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\Sep\SepPostingsReader.cs:line 561
at Lucene.Net.Index.CheckIndex.CheckFields(Fields fields, IBits liveDocs, Int32 maxDoc, FieldInfos fieldInfos, Boolean doPrint, Boolean isVectors, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1156
at Lucene.Net.Index.CheckIndex.TestPostings(AtomicReader reader, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1651
Lucene.Net.Diagnostics.AssertionException: buffer.length=190007630 count=2142165037
at Lucene.Net.Codecs.MockIntBlock.MockVariableInt32BlockPostingsFormat.MockInt32Factory.VariableInt32BlockIndexInputAnonymousClass.BlockReaderAnonymousClass.ReadBlock() in F:\Projects\lucenenet\src\Lucene.Net.TestFramework\Codecs\MockIntBlock\MockVariableIntBlockPostingsFormat.cs:line 110
at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.MaybeSeek() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 152
at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.Next() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 160
at Lucene.Net.Codecs.Sep.SepPostingsReader.SepDocsAndPositionsEnum.NextDoc() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\Sep\SepPostingsReader.cs:line 561
at Lucene.Net.Index.CheckIndex.CheckFields(Fields fields, IBits liveDocs, Int32 maxDoc, FieldInfos fieldInfos, Boolean doPrint, Boolean isVectors, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1156
at Lucene.Net.Index.CheckIndex.TestPostings(AtomicReader reader, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1651
test: stored fields.......OK [221 total field count; avg 5.139535 fields per doc]
test: term vectors........ERROR [Exception of type 'Lucene.Net.Diagnostics.AssertionException' was thrown.]
Lucene.Net.Diagnostics.AssertionException: Exception of type 'Lucene.Net.Diagnostics.AssertionException' was thrown.
at Lucene.Net.Codecs.IntBlock.FixedInt32BlockIndexInput.Index.Read(DataInput indexIn, Boolean absolute) in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\FixedIntBlockIndexInput.cs:line 177
at Lucene.Net.Codecs.Sep.SepPostingsReader.DecodeTerm(Int64[] empty, DataInput input, FieldInfo fieldInfo, BlockTermState termState, Boolean absolute) in F:\Projects\lucenenet\src\Lucene.Net.Codecs\Sep\SepPostingsReader.cs:line 202
at Lucene.Net.Codecs.BlockTerms.BlockTermsReader.FieldReader.SegmentTermsEnum.DecodeMetaData() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\BlockTerms\BlockTermsReader.cs:line 986
at Lucene.Net.Codecs.BlockTerms.BlockTermsReader.FieldReader.SegmentTermsEnum.DocsAndPositions(IBits liveDocs, DocsAndPositionsEnum reuse, DocsAndPositionsFlags flags) in F:\Projects\lucenenet\src\Lucene.Net.Codecs\BlockTerms\BlockTermsReader.cs:line 798
at Lucene.Net.Index.TermsEnum.DocsAndPositions(IBits liveDocs, DocsAndPositionsEnum reuse) in F:\Projects\lucenenet\src\Lucene.Net\Index\TermsEnum.cs:line 242
at Lucene.Net.Index.CheckIndex.TestTermVectors(AtomicReader reader, TextWriter infoStream, Boolean verbose, Boolean crossCheckTermVectors) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 2146
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 0 NUMERIC; 1 SORTED; 0 SORTED_SET]
FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
Lucene.Net.Util.LuceneSystemException: Term Index test failed
at Lucene.Net.Index.CheckIndex.DoCheckIndex(IList`1 onlySegments) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 848
WARNING: 1 broken segments (containing 43 documents) detected |
…ed [Repeat(25)] attribute.
…ile to pass the process ID to kill back to the original test process, open a socket and listen for the process ID to be written.
Well, the good news is that this seems to be 100% stable on .NET Framework. The bad news is that it is not performing as well on .NET Core. I attempted to swap the
This is still a huge improvement over 4.8.0-beta00016, which would get a corrupt index about once every ~30 power failures. Here are the results of the runs with the swapped hashset implementation with https://dev.azure.com/shad0962/Experiments/_build/results?buildId=2343&view=results Note that I left the commit with the I also improved the |
Very nice! Do you think that maybe the thread-unsafety of FSDirectory (see my comment above about the race conditions) could be causing that? And maybe if .NET 5+ performs better than .NET FX, that is exacerbating it? I'll look at improving the thread safety here next and see if that can have better results. Because one very real possibility in this code (and AFAICT Java Lucene 4.8 too) under heavy load is that an fsync/FlushFileBuffers gets missed because of thread races. |
Now that you mention it, it is possible that is the cause, (or a contributor) since we could be Fsyncing a partially written buffer to disk. While building a production quality implementation may take some time, if we locked all access to write, flush, and sync on all files, it would make testing the scenario pretty simple. I will set that up and run the check. |
This is necessary to prevent race conditions, even though this code is not in the upstream Java code. A thread could try to add an item to the collection after it has been synced in `Sync` but before it is removed from the collection, then the file is removed from the collection, resulting in a missed sync.
@NightOwl888 I just pushed up a change that synchronizes access to the HashSet, let me know what you think. |
The performance is even better with this new approach. Of course some heavily multi-threaded writes/syncs might see a hit, but at least it's more thread-safe now.
|
…nd refactored to ensure the TcpListener and Process are cleaned up at the end of each test iteration. This makes it run ~20% faster.
@paulirwin @NightOwl888 Is this moving forward? Do we need to wait for this to get merged in order to use .NET 8 as per #933? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! This is good to merge.
This restores the commented-out fsync behavior in FSDirectory to help mitigate a performance regression in .NET 8.
Fixes #933
Description
See 933 for more details. Perf benchmark before:
After: