Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore fsync behavior in FSDirectory via P/Invoke, #933 #938

Merged
merged 17 commits into from
Aug 12, 2024

Conversation

paulirwin
Copy link
Contributor

  • You've read the Contributor Guide and Code of Conduct.
  • You've included unit or integration tests for your change, where applicable. (Existing unit tests)
  • You've included inline docs for your change, where applicable.
  • There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

This restores the commented-out fsync behavior in FSDirectory to help mitigate a performance regression in .NET 8.

Fixes #933

Description

See 933 for more details. Perf benchmark before:

| Method                  | Job      | Runtime  | IterationCount | LaunchCount | WarmupCount | N  | Mean      | Error      | StdDev   | Median    | Ratio | RatioSD |
|------------------------ |--------- |--------- |--------------- |------------ |------------ |--- |----------:|-----------:|---------:|----------:|------:|--------:|
| IndexDocumentsBenchmark | .NET 7.0 | .NET 7.0 | Default        | Default     | Default     | 10 |  44.11 ms |   3.756 ms | 11.02 ms |  40.34 ms |  1.00 |    0.00 |
| IndexDocumentsBenchmark | .NET 8.0 | .NET 8.0 | Default        | Default     | Default     | 10 | 843.36 ms |  16.485 ms | 24.16 ms | 843.93 ms | 19.11 |    4.86 |

After:

| Method                  | Job      | Runtime  | IterationCount | LaunchCount | WarmupCount | N  | Mean     | Error      | StdDev    | Median   | Ratio | RatioSD |
|------------------------ |--------- |--------- |--------------- |------------ |------------ |--- |---------:|-----------:|----------:|---------:|------:|--------:|
| IndexDocumentsBenchmark | .NET 7.0 | .NET 7.0 | Default        | Default     | Default     | 10 | 50.15 ms |   5.044 ms | 14.874 ms | 54.12 ms |  1.00 |    0.00 |
| IndexDocumentsBenchmark | .NET 8.0 | .NET 8.0 | Default        | Default     | Default     | 10 | 42.88 ms |   4.736 ms | 13.964 ms | 38.17 ms |  0.95 |    0.50 |

Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This looks great. There are a few rough edges to iron out, but the implementation of Fsync looks pretty good.

However, in addition to my inline comments, we should remove the AwaitsFix attribute from Lucene.Net.Index.TestIndexWriterOnJRECrash::TestNRTThreads_Mem(). No doubt this PR should also close #894, as that directly tests this functionality. However, I noticed it was failing to produce the file to transfer the process ID back to the test that kills it when we added the .NET 8 tests, so there may be another problem there to investigate that .NET 8 broke.

After those changes are implemented, I will run this in Azure DevOps to see if we can get it to pass consistently on 60 or so runs.

src/Lucene.Net/Util/IOUtils.cs Outdated Show resolved Hide resolved
src/Lucene.Net/Util/IOUtils.cs Outdated Show resolved Hide resolved
src/Lucene.Net/Util/IOUtils.cs Outdated Show resolved Hide resolved
src/Lucene.Net/Support/ConcurrentHashSet.cs Outdated Show resolved Hide resolved
src/Lucene.Net/Support/ConcurrentHashSet.cs Outdated Show resolved Hide resolved
@NightOwl888
Copy link
Contributor

I am going to do a check with the TestIndexWriterOnJRECrash test to try to confirm this is a fix for it. There is an issue with the test itself (something about the system property with the temp file name not being passed through) that I am working on a fix for now.

@NightOwl888
Copy link
Contributor

Looks like it still fails on Windows sometimes. I think it is more robust than it was, though. I had to try with [Repeat(25)] to get it to fail. I will try in Azure DevOps with the attribute to see whether it will fail on Linux or macOS.

 TestNRTThreads_Mem
   Source: TestIndexWriterOnJRECrash.cs line 51
   Duration: 34.3 sec

  Message: 
Lucene.Net.Util.LuceneSystemException : CheckIndex failed

To reproduce this test result:

Option 1:

 Apply the following assembly-level attributes:

[assembly: Lucene.Net.Util.RandomSeed("0x58551ebf9b5e10f6")]
[assembly: NUnit.Framework.SetCulture("en-AG")]

Option 2:

 Use the following .runsettings file:

<RunSettings>
  <TestRunParameters>
    <Parameter name="tests:seed" value="0x58551ebf9b5e10f6" />
    <Parameter name="tests:culture" value="en-AG" />
  </TestRunParameters>
</RunSettings>

See the .runsettings documentation at: https://docs.microsoft.com/en-us/visualstudio/test/configure-unit-tests-by-using-a-dot-runsettings-file.

  Stack Trace: 
TestUtil.CheckIndex(Directory dir, Boolean crossCheckTermVectors) line 177
TestUtil.CheckIndex(Directory dir) line 162
TestIndexWriterOnJRECrash.CheckIndexes(FileSystemInfo file) line 322
TestIndexWriterOnJRECrash.CheckIndexes(FileSystemInfo file) line 329
TestIndexWriterOnJRECrash.CheckIndexes(FileSystemInfo file) line 329
TestIndexWriterOnJRECrash.TestNRTThreads_Mem() line 88
InvokeStub_TestIndexWriterOnJRECrash.TestNRTThreads_Mem(Object, Object, IntPtr*)
MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

  Standard Output: 
CheckIndex failed
Segments file=segments_2 numSegments=3 version=4.8 format=
  1 of 3: name=_9 docCount=162
    codec=Lucene46
    compound=True
    numFiles=4
    size (MB)=0.34093189239501953
    diagnostics = {source=merge, lucene.version=4.8.0, os=Microsoft Windows, os.arch=X64, os.version=10.0.19045, java.version=8.0.1, java.vendor=Microsoft, timestamp=15/05/2024 10:17:05 am, mergeMaxNumSegments=-1, mergeFactor=5}
    has deletions [delGen=1]
    test: open reader.........OK
    test: check integrity.....OK
    test: check live docs.....OK [32 deleted docs]
    test: fields..............OK [26 fields]
    test: field norms.........OK [21 fields]
    test: terms, freq, prox...OK [6532 terms; 9562 terms/docs pairs; 12115 tokens]
    test (ignoring deletes): terms, freq, prox...OK [7799 terms; 11823 terms/docs pairs; 14862 tokens]
    test: stored fields.......OK [660 total field count; avg 5.076923 fields per doc]
    test: term vectors........OK [265 total vector count; avg 2.0384614 term/freq vector fields per doc]
    test: docvalues...........OK [1 docvalues fields; 0 BINARY; 0 NUMERIC; 1 SORTED; 0 SORTED_SET]

  2 of 3: name=_8 docCount=56
    codec=Lucene46
    compound=False
    numFiles=33
    size (MB)=0.18136310577392578
    diagnostics = {source=flush, lucene.version=4.8.0, os=Microsoft Windows, os.arch=X64, os.version=10.0.19045, java.version=8.0.1, java.vendor=Microsoft, timestamp=15/05/2024 10:17:06 am}
    has deletions [delGen=1]
    test: open reader.........OK
    test: check integrity.....OK
    test: check live docs.....OK [2 deleted docs]
    test: fields..............OK [13 fields]
    test: field norms.........OK [8 fields]
    test: terms, freq, prox...OK [3962 terms; 4916 terms/docs pairs; 7363 tokens]
    test (ignoring deletes): terms, freq, prox...OK [4035 terms; 5024 terms/docs pairs; 7480 tokens]
    test: stored fields.......OK [282 total field count; avg 5.2222223 fields per doc]
    test: term vectors........OK [112 total vector count; avg 2.074074 term/freq vector fields per doc]
    test: docvalues...........OK [1 docvalues fields; 0 BINARY; 0 NUMERIC; 1 SORTED; 0 SORTED_SET]

  3 of 3: name=_7 docCount=44
    codec=Lucene46
    compound=False
    numFiles=31
    size (MB)=0.5951318740844727
    diagnostics = {source=flush, lucene.version=4.8.0, os=Microsoft Windows, os.arch=X64, os.version=10.0.19045, java.version=8.0.1, java.vendor=Microsoft, timestamp=15/05/2024 10:17:06 am}
    has deletions [delGen=1]
    test: open reader.........OK
    test: check integrity.....OK
    test: check live docs.....OK [1 deleted docs]
    test: fields..............OK [18 fields]
    test: field norms.........OK [13 fields]
    test: terms, freq, prox...ERROR: Lucene.Net.Diagnostics.AssertionException: buffer.length=190007630 count=2142165037
   at Lucene.Net.Codecs.MockIntBlock.MockVariableInt32BlockPostingsFormat.MockInt32Factory.VariableInt32BlockIndexInputAnonymousClass.BlockReaderAnonymousClass.ReadBlock() in F:\Projects\lucenenet\src\Lucene.Net.TestFramework\Codecs\MockIntBlock\MockVariableIntBlockPostingsFormat.cs:line 110
   at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.MaybeSeek() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 152
   at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.Next() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 160
   at Lucene.Net.Codecs.Sep.SepPostingsReader.SepDocsAndPositionsEnum.NextDoc() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\Sep\SepPostingsReader.cs:line 561
   at Lucene.Net.Index.CheckIndex.CheckFields(Fields fields, IBits liveDocs, Int32 maxDoc, FieldInfos fieldInfos, Boolean doPrint, Boolean isVectors, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1156
   at Lucene.Net.Index.CheckIndex.TestPostings(AtomicReader reader, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1651
Lucene.Net.Diagnostics.AssertionException: buffer.length=190007630 count=2142165037
   at Lucene.Net.Codecs.MockIntBlock.MockVariableInt32BlockPostingsFormat.MockInt32Factory.VariableInt32BlockIndexInputAnonymousClass.BlockReaderAnonymousClass.ReadBlock() in F:\Projects\lucenenet\src\Lucene.Net.TestFramework\Codecs\MockIntBlock\MockVariableIntBlockPostingsFormat.cs:line 110
   at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.MaybeSeek() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 152
   at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.Next() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 160
   at Lucene.Net.Codecs.Sep.SepPostingsReader.SepDocsAndPositionsEnum.NextDoc() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\Sep\SepPostingsReader.cs:line 561
   at Lucene.Net.Index.CheckIndex.CheckFields(Fields fields, IBits liveDocs, Int32 maxDoc, FieldInfos fieldInfos, Boolean doPrint, Boolean isVectors, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1156
   at Lucene.Net.Index.CheckIndex.TestPostings(AtomicReader reader, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1651
    test: stored fields.......OK [221 total field count; avg 5.139535 fields per doc]
    test: term vectors........ERROR [Exception of type 'Lucene.Net.Diagnostics.AssertionException' was thrown.]
Lucene.Net.Diagnostics.AssertionException: Exception of type 'Lucene.Net.Diagnostics.AssertionException' was thrown.
   at Lucene.Net.Codecs.IntBlock.FixedInt32BlockIndexInput.Index.Read(DataInput indexIn, Boolean absolute) in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\FixedIntBlockIndexInput.cs:line 177
   at Lucene.Net.Codecs.Sep.SepPostingsReader.DecodeTerm(Int64[] empty, DataInput input, FieldInfo fieldInfo, BlockTermState termState, Boolean absolute) in F:\Projects\lucenenet\src\Lucene.Net.Codecs\Sep\SepPostingsReader.cs:line 202
   at Lucene.Net.Codecs.BlockTerms.BlockTermsReader.FieldReader.SegmentTermsEnum.DecodeMetaData() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\BlockTerms\BlockTermsReader.cs:line 986
   at Lucene.Net.Codecs.BlockTerms.BlockTermsReader.FieldReader.SegmentTermsEnum.DocsAndPositions(IBits liveDocs, DocsAndPositionsEnum reuse, DocsAndPositionsFlags flags) in F:\Projects\lucenenet\src\Lucene.Net.Codecs\BlockTerms\BlockTermsReader.cs:line 798
   at Lucene.Net.Index.TermsEnum.DocsAndPositions(IBits liveDocs, DocsAndPositionsEnum reuse) in F:\Projects\lucenenet\src\Lucene.Net\Index\TermsEnum.cs:line 242
   at Lucene.Net.Index.CheckIndex.TestTermVectors(AtomicReader reader, TextWriter infoStream, Boolean verbose, Boolean crossCheckTermVectors) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 2146
    test: docvalues...........OK [1 docvalues fields; 0 BINARY; 0 NUMERIC; 1 SORTED; 0 SORTED_SET]
FAILED
    WARNING: fixIndex() would remove reference to this segment; full exception:
Lucene.Net.Util.LuceneSystemException: Term Index test failed
   at Lucene.Net.Index.CheckIndex.DoCheckIndex(IList`1 onlySegments) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 848

WARNING: 1 broken segments (containing 43 documents) detected

@NightOwl888
Copy link
Contributor

Well, the good news is that this seems to be 100% stable on .NET Framework.

The bad news is that it is not performing as well on .NET Core. I attempted to swap the m_staleFiles set with new JCG.HashSet<T>().AsConcurrent() and it seemed to make Windows worse, but Linux and macOS better (NOTE: I rolled that change back). They still all randomly fail, but it is on the order of:

  • Windows: Once every ~150 power failures
  • Linux: Once every ~350 power failures
  • macOS: Once every ~500 power failures

This is still a huge improvement over 4.8.0-beta00016, which would get a corrupt index about once every ~30 power failures.

Here are the results of the runs with the swapped hashset implementation with [Repeat(25)] (which runs the test 250 times, since the original test runs it 10 times):

https://dev.azure.com/shad0962/Experiments/_build/results?buildId=2343&view=results
https://dev.azure.com/LuceneNET-Temp/Lucene.NET/_build/results?buildId=2257&view=results
https://dev.azure.com/lucene-net-temp2/Lucene.NET/_build/results?buildId=1014&view=results
https://dev.azure.com/lucene-net-temp3/Lucene.NET/_build/results?buildId=850&view=results
https://dev.azure.com/lucene-net-temp4/Main/_build/results?buildId=273&view=results

Note that I left the commit with the [Repeat(25)] attribute on this branch. It should be reverted before merging to master to restore the original behavior.


I also improved the TestIndexWriterOnJRECrash test by using a socket instead of a temp file to transfer the process id to kill from the forked test to back to the original process. The original Java test crashed the JRE from the inside using a low level function, but .NET doesn't allow killing the current process from the inside so we needed 2-way communication between the 2 test processes.

@paulirwin
Copy link
Contributor Author

Very nice! Do you think that maybe the thread-unsafety of FSDirectory (see my comment above about the race conditions) could be causing that? And maybe if .NET 5+ performs better than .NET FX, that is exacerbating it? I'll look at improving the thread safety here next and see if that can have better results. Because one very real possibility in this code (and AFAICT Java Lucene 4.8 too) under heavy load is that an fsync/FlushFileBuffers gets missed because of thread races.

@NightOwl888
Copy link
Contributor

Very nice! Do you think that maybe the thread-unsafety of FSDirectory (see my comment above about the race conditions) could be causing that? And maybe if .NET 5+ performs better than .NET FX, that is exacerbating it? I'll look at improving the thread safety here next and see if that can have better results. Because one very real possibility in this code (and AFAICT Java Lucene 4.8 too) under heavy load is that an fsync/FlushFileBuffers gets missed because of thread races.

Now that you mention it, it is possible that is the cause, (or a contributor) since we could be Fsyncing a partially written buffer to disk. While building a production quality implementation may take some time, if we locked all access to write, flush, and sync on all files, it would make testing the scenario pretty simple. I will set that up and run the check.

paulirwin added 2 commits May 15, 2024 20:21
This is necessary to prevent race conditions, even though this code is
not in the upstream Java code. A thread could try to add an item to the
collection after it has been synced in `Sync` but before it is removed
from the collection, then the file is removed from the collection,
resulting in a missed sync.
@paulirwin
Copy link
Contributor Author

@NightOwl888 I just pushed up a change that synchronizes access to the HashSet, let me know what you think.

@paulirwin
Copy link
Contributor Author

The performance is even better with this new approach. Of course some heavily multi-threaded writes/syncs might see a hit, but at least it's more thread-safe now.

| Method                  | Job      | Runtime  | IterationCount | LaunchCount | WarmupCount | N  | Mean     | Error     | StdDev   | Ratio | RatioSD |
|------------------------ |--------- |--------- |--------------- |------------ |------------ |--- |---------:|----------:|---------:|------:|--------:|
| IndexDocumentsBenchmark | .NET 7.0 | .NET 7.0 | Default        | Default     | Default     | 10 | 32.06 ms |  0.771 ms | 2.135 ms |  1.00 |    0.00 |
| IndexDocumentsBenchmark | .NET 8.0 | .NET 8.0 | Default        | Default     | Default     | 10 | 31.57 ms |  0.781 ms | 2.189 ms |  0.99 |    0.09 |

…nd refactored to ensure the TcpListener and Process are cleaned up at the end of each test iteration. This makes it run ~20% faster.
@turowicz
Copy link

turowicz commented Jul 24, 2024

@paulirwin @NightOwl888 Is this moving forward? Do we need to wait for this to get merged in order to use .NET 8 as per #933?

Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! This is good to merge.

@paulirwin paulirwin merged commit 0cf7218 into apache:master Aug 12, 2024
199 checks passed
@paulirwin paulirwin deleted the issue/933 branch August 12, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance decrease 30x when running on .NET 8
3 participants