Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Failing Test: Lucene.Net.Index.TestIndexWriterOnJRECrash::TestNRTThreads_Mem() #894

Open
1 task done
NightOwl888 opened this issue Jan 14, 2024 · 5 comments
Labels
is:bug is:task A chore to be done pri:high test-failure up-for-grabs This issue is open to be worked on by anyone
Milestone

Comments

@NightOwl888
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Task description

This test was ported and added in #786 (to close #768).

Unfortunately, it fails intermittently. A user shared in Lucene.net corrupted index (segments.gen) on StackOverflow that this appears to be due to a real problem that happens in production.

The failure may or may not be related to the deprecation of thread interrupts (#555) in Lucene.NET which were supported in the Java version. We make a best effort to support them using UninterrruptableMonitor instead of lock statements, but since a lock may be taken in any library we depend on that may throw ThreadInterruptedException from the action of taking a lock, we cannot 100% guarantee that we can catch every one of these exceptions in order to rollback an in-process commit. In Java, taking a lock does not throw an exception in any case.

@NightOwl888 NightOwl888 added up-for-grabs This issue is open to be worked on by anyone is:bug test-failure pri:normal is:task A chore to be done labels Jan 14, 2024
NightOwl888 added a commit to NightOwl888/lucenenet that referenced this issue Jan 14, 2024
…ed [AwaitsFix] attribute because this test fails intermittently (see apache#894).
NightOwl888 added a commit that referenced this issue Jan 14, 2024
…ed [AwaitsFix] attribute because this test fails intermittently (see #894).
@NightOwl888 NightOwl888 added this to the 4.8.0 milestone Oct 24, 2024
@NightOwl888
Copy link
Contributor Author

I am raising priority on this and adding it as a production blocker because we are missing durability (ACID). A commit done simultaneously with a power outage will result in a corrupt index about 1 time out of 200. This only occurs on .NET Core, not on .NET Framework.

We looked at this a bit already. Now that we have completed fsync support, we expected this to automatically work. Since it doesn't fail on .NET Framework, it seems probable there is something going on at a very deep level inside of .NET Core where the behavior has changed.

@RockNHawk
Copy link

RockNHawk commented Oct 30, 2024

If this is only happening on .NET Core and not .NET Framework, and you talked about #555, then the key may be that Thread.Abort() was removed and the . NET Core has changed behavior.
This means that the application would have to change its design to use CancellationToken to cancel the operation and we could either catch OperactionCancelledException or use another lock that supports CancellationToken instead of Monitor.

With CancellationToken support, the application can shut down gracefully and all threads can receive the CancellationToken notification and stop working gracefully.
However, if the application quits unexpectedly (Taskmgr.exe forces a shutdown or unplugs), CancellationToken can do nothing.

So, if this is the case:

We're using Lucene.NET library (4.8.0-beta000015) for our search engine. Our server was shut down unexpectedly and ever since then we're getting this error

the lock does not cause database corruption.

@RockNHawk
Copy link

RockNHawk commented Oct 30, 2024

We looked at this a bit already. Now that we have completed fsync support, we expected this to automatically work. Since it doesn't fail on .NET Framework, it seems probable there is something going on at a very deep level inside of .NET Core where the behavior has changed.

In this case, could we use an OS-level disk monitoring and analysis tool to document the differences between . NET Framework and .NET Core, then try to modify the code to make behavior consistently?

@NightOwl888
Copy link
Contributor Author

Thanks for putting some thought into this. Indeed these sound like 2 good places to start.

I looked at ThreadAbortException a bit. We don't catch it directly, but it subclasses SystemException. This corresponds roughly to RuntimeException in Java. There are only a few cases where we use the .IsRuntimeException() extension method in the core assembly, and in all of those cases we are rethrowing the exception. So this does not seem likely to be the problem.

I checked with ChatGPT to get some high-level analysis about how Lucene 4.8.0 manages transactional behavior:

Apache Lucene 4.8.0 uses a combination of write-ahead logging and checkpointing, implemented through its own data structures and mechanisms rather than traditional database methods, to maintain consistency and ensure durability in the event of a crash.

Here’s how Lucene 4.8.0 specifically achieves this:

Segment-Based Architecture: Lucene writes data in immutable segments, which helps it maintain consistency. Each time data is added, modified, or deleted, Lucene creates a new segment rather than modifying existing ones. This approach minimizes the risk of data corruption since it doesn’t overwrite existing data but rather appends new changes.

Commit Points: Lucene uses commit points to mark stable, consistent snapshots of the index. During indexing, changes are stored in memory and periodically flushed to disk as new segments. When a commit is requested, Lucene creates a commit point by writing a new segments_N file, which serves as a reference to the current state of the index. The segments_N file acts as a checkpoint that points to the set of segments that make up a consistent view of the index.

Write-Ahead Log (WAL) with Transaction Log File: For any changes made to the index, Lucene uses a transaction log file to record pending changes before they are committed. This file logs operations that are in progress, providing a record in case of a crash. Upon restart, Lucene can check this transaction log to replay any operations that were interrupted and restore the index to a consistent state.

Two-Phase Commit (Optional): While Lucene 4.8.0 does not use a full distributed two-phase commit protocol, it does have an optional two-phase commit feature for situations where multiple processes might be writing to the index concurrently (e.g., when using a directory like NFS). In such cases, this feature can help coordinate commits across multiple nodes to prevent partial writes.

Crash Recovery: If there’s a sudden failure, Lucene relies on the segments_N file to determine the last consistent commit point. Since Lucene only updates the segments_N file after writing all necessary data to disk, the system can revert to the last valid snapshot by loading the most recent segments_N file.

Through these approaches, Lucene 4.8.0 achieves consistency by committing data in a safe manner and relying on immutable segment files and a clear commit structure, which protect the index from corruption even in case of power loss or unexpected crashes.

I then went a little bit further to try to get some clues as to specifically what is failing. It gave us a few things we can experiment with to determine if they help to solve the problem.

We ported the transaction code in Lucene.NET 4.8.0. It is behaving correctly on .NET Framework (that is, the TestIndexWriterOnJRECrash test is passing). However, on .NET Core we sometimes get a failure during a simulated power outage. What difference between .NET Framework and .NET Core is likely to be causing this?

ChatGPT said:

If Lucene.NET 4.8.0's TestIndexWriterOnJRECrash test is failing on .NET Core but not on .NET Framework, this might stem from differences in file handling and caching behaviors between the two runtimes, especially regarding how they handle file system flushes and durability guarantees.

Here are some likely causes:

File System Flushing and Durability Guarantees:

.NET Framework and .NET Core differ in how they handle file system writes, particularly around ensuring data is flushed to disk. In .NET Framework, file writes may be more aggressive in flushing data to disk when using FileStream.Flush(true) (with true forcing a flush to disk). However, .NET Core on certain platforms, particularly on Linux, might treat flushes differently, not always ensuring data reaches persistent storage immediately.
Lucene relies on calling fsync (or Flush(true)) to ensure that all data is written to disk before committing the segments_N file. However, on .NET Core, especially on non-Windows platforms, there may be differences in how this flush is handled by the OS, potentially causing unflushed data to remain in memory when a crash is simulated.

File Locking Behavior Differences:

Lucene’s commit process relies on exclusive file locks to protect the segments_N file and other index files during updates. .NET Core and .NET Framework have some differences in file locking implementations and behaviors, especially when running on Linux. On Linux, .NET Core uses advisory locking rather than mandatory locking, which might not always prevent other processes or threads from accessing the file.

Platform-Specific File Caching:

.NET Core, especially on non-Windows platforms, may defer certain file operations to in-memory cache layers longer than .NET Framework does, potentially leading to timing issues in crash simulation tests. If the simulated crash occurs before the OS writes the data from cache to disk, the index could be left in an inconsistent state.

Differences in FileOptions.WriteThrough:

On .NET Core, FileOptions.WriteThrough can be specified for file operations to bypass OS caches and write directly to disk. However, this behavior is more consistent on Windows than on other platforms, where it might not have the same impact. Using FileOptions.WriteThrough could potentially improve durability on .NET Core by ensuring data reaches disk more reliably.

Potential Fixes

To improve consistency in this crash simulation test, try the following adjustments:

Explicit Flush and File Sync: Ensure that FileStream.Flush(true) is used for every critical file operation and commit phase in Lucene.NET on .NET Core, as this may improve data durability.

Use FileOptions.WriteThrough: If running on .NET Core and supported by the environment, open file streams with FileOptions.WriteThrough to minimize caching issues and enforce writes directly to disk.

Platform-Specific Configuration: If these changes do not stabilize the test, it may be necessary to adjust Lucene.NET’s file handling configuration based on platform detection (e.g., applying extra flushes or locks on .NET Core for Linux vs. Windows).
By accounting for these differences in file handling between .NET Framework and .NET Core, especially around data flushing and durability, you can improve the reliability of the simulated crash tests on both platforms.

FileStream.Flush(true) was the thing that was killing performance on .NET 8.0, which is why we put the fsync call out of band like it was in Java, so it isn't likely to be the cause. However, it wouldn't hurt to review when we are calling Flush() to make sure it is happening everywhere it should be.

FileOptions.WriteThrough seems like something we should investigate. In fact, it might not hurt to review all of the options that we are passing to FileStream instances, since this differs so greatly from Java.

Using OS-level disk monitoring might also help to narrow down the problem, but it does seem like it might be a shortcut to try these other suggestions first.

Would you like to give it a shot to see whether one of these suggestions fixes the test?

Wow - ChatGPT makes this analysis work so much easier!

@RockNHawk
Copy link

Yes, but not until I'm done with the tricky stuff at hand, I haven't touched code in almost 1 year, but happy to contribute any help to lucene.net as soon as I can on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is:bug is:task A chore to be done pri:high test-failure up-for-grabs This issue is open to be worked on by anyone
Projects
None yet
Development

No branches or pull requests

2 participants