-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor multi-threaded indexing performance #935
Comments
I would say this is very much not expected behavior. Having your benchmark code would be very helpful! Thanks for the report. |
this sounds too much difference between java / c#. a test program would be great to analyze this. |
Concurrent document writing support is currently broken due to a bug mentioned in #886 (comment). No objections if someone wants to contribute a solution to that issue, but this isn't considered a blocker for the Lucene.NET 4.8.0 release. |
@NightOwl888 I remember this was an old bug relevant only to older dotnet core. |
@NightOwl888 If you apply the changes to DocumentsWriterFlushControl from that branch on top of the latest can you still reproduce it? I have no luck on 5.0, 6.0 or 7.0 - but asking because maybe I am doing something wrong >.< (I have not run the entire test suite yet, there is allot of test and I have problems compiling the IKVM stuff) @superkelvint Sharing the source code would certainly be relevant, Parallel.ForEach would not be my first choice here. |
Yes, I just got approval to do so, and will be providing that shortly. |
Here is the MultithreadedIndexer.cs file. I'm not a C# programmer so perhaps I am doing something silly that's slowing things down. |
Nope. This happens on all operating systems and .NET versions. I got it to fail on .NET 7 after that post. We cannot rule out that the problem lies in the test or the test framework, but it almost certainly is in our codebase. We have a configuration that never existed in Lucene. We are somewhere between 4.8.0 and 4.8.1 on the documents writer stuff. We should go through the JIRA issues in the Changes.txt to determine if we have all of the changes that are relevant to the documents writer. That may or may not be contributing to this.
IKVM can be tricky to get to compile the first time because it doesn't properly clean the build. I suggest closing the IDE and then doing the following:
If upon reopening you still have issues compiling, report the issue to ikvm-maven. The team is pretty responsive. On Azure DevOps, I was able to reproduce the test failure, but this time it took 60 runs and it only happened once: https://dev.azure.com/lucene-net-temp4/Main/_build/results?buildId=266&view=ms.vss-test-web.build-test-results-tab&runId=191370&resultId=100202&paneView=debug I was also able to get it to fail locally with the seed from #886. Using the I just put the attribute in the It also failed on the first run with the seed from the above Azure DevOps failure on [assembly: Lucene.Net.Util.RandomSeed("0x9f28125320b5e6c3")]
[assembly: NUnit.Framework.SetCulture("lrc-IR")] I also got it to fail on |
@jeme thoughts on the included code within MultithreadedIndexer.cs file? |
@pc-LMS I have actually not looked further into it as I have used the time I have for Lucene (here and there) to attempt to reproduce the problem linked by @NightOwl888 locally (as that would ease things hugely for correcting it) or just trying to unravel the code in an attempt to get a theoretical idea of what might be wrong, so far no luck on reproducing it though. |
@jeme - I am a bit perplexed why you cannot reproduce this. While it is not trivial to do, it also is not that hard and I have reproduced it on multiple machines. It is possible that if the problem is a timing issue, perhaps the particular machine that you are experimenting with is somehow perfectly synchronized to miss the error, too fast/slow, or the timing accuracy of the underlying OS is sufficient to where it doesn't happen. Have you attempted to reproduce on more than one machine and/or OS? |
@NightOwl888 No, I have not tried other machines or OS'es for now, I have tried running the tests under various .NET versions with the seeds provided, I would have to install VM's for trying other OS'es for now (or dust off my old 2012 Mac book PRO and boot that into OSX, again it would take some time). But I could perhaps find a few other machines to run it on, my machine is a bit dated (i9-7900X, 128GB Ram and M.2 SSD's). If you can reproduce it locally, what are you running on? (OS etc). Also, are you running all the test suites, just that project containing that test, just that suite (file) or can you reproduce it with just running that one test alone? Because if this is timing and mine just runs slower pr. thread (10 cores means lower clock speed after all + the IPC has increased a bit since then I guess). |
I am on Windows 10 x64 running with an I7-8850H, 2.60Ghz, 6 Cores, 12 Logical Processors, 32GB RAM, and M.2 SSD's. I am unable to reproduce the failure locally without the FYI - I did a new manual revert that is up to date with master here: https://github.com/NightOwl888/lucenenet/tree/fix/GH-935-documentswriter-concurrency |
Ok, I was actually in the thinking that this was because my CPU was a slow old one, but maybe it's the other way around. It passes just fine, takes a while with repeat:1000 obviously, but since it at max starts 6 threads I tried upping that seeing as it would never occupy all Cores(10)/Logical(20) on my system. Thinking that it could also be if the system was put under stress, even with a random between 32 to 64 threads, my system is just not stressed at all, but I still wonder if stressing things could make it easier to reproduce, so my plan is to do 2 things after work today. A) Move the code to my old MacBook (with windows) and run the tests there, that would run on a Core i7 i7 3820QM. Thanks for the updated source/branch. So status is that the test passes for me with:
(what you can reproduce it with) And it also passes with in addition to the above:
|
Interesting. I tried with I think I worked out what is happening, though. Issue 1The error indicates that Digging into the We cannot pass this line unless Therefore, to fail in this way, I checked and there are 4 calls to So, moving the Issue 2After that fix, we still got some failures. But it was also clear from the stack trace what was happening. If The documentation for ReentrantLock.tryLock() states:
So, it can return false, but I don't think it ever does in practice since a .NET has no such behavior, but I don't think it really matters which order the locks are taken in as long as This probably won't perform as well because the method is supposed to return immediately and may end up deadlocking, but so far it seems to perform better and never deadlock. I am running the tests now to see if the above changes work on all platforms, but locally this seems promising. |
…): The DocumentsWriterThreadPool.Reset() method must be called within the context of a lock because there is a possible race condition with other callers of DocumentsWriterThreadPool.Reset(). See apache#935.
…ntLock.tryLock() method barges to the front of the queue instead of returning false like Monitor.TryEnter(). Use Monitor.Enter(object, ref bool) instead, which always returns true. We get locks in a different order, but I am not sure whether that matters. Fixes apache#935. Closes apache#886.
I have just submitted #940 with a patch. Well, there were 3 different attempts. While identifying the problem wasn't too difficult, finding a solution for Issue 2 turned out to be quite challenging due to how performance sensitive of an area this is. If interested parties could run some benchmarks and/or review the PR to provide some feedback on how to improve this approach, suggestions would be welcomed. I have already attempted I also gave |
Not really as I actually tested on the various .net versions, other things came up as we had a break down in one of our systems due to a windows update, so I have sadly been busy elsewhere. But I am actually having another timing related bug in another codebase which I can easily reproduce at work with a single run of a test (always fails after ~max 2min, a run is 10min), but it's far harder to reproduce the same bug at home on the same PC as I am having trouble reproducing this issue (takes 2-3 runs). So I don't know. Running the Lucene tests at work has proven to be a challenge though, I don't know if it's because I use ReSharper that the Lucene solution seems to be very slugish for me to work with, or it's something else. But I always have challenges getting started working on anything. But great if you found a fix. |
This is super helpful. I'll work to reproduce the behavior you observed in Luncene.NET and then test to see PR #940 shows improved performance. I'm curious, for your StringFields, how many did the typical doc have, and what was the typical length of these fields? |
If it is difficult to reproduce, can try to remote and debug the bug using DNSpy on the user's computer. |
@superkelvint Which specific version of Java Lucene did you use in your comparison to Lucene.NET? |
@superkelvint Thank you for providing some code to show how you are using Lucene.NET. That helps a lot. Maybe we can work together to make sense out of what you are seeing. One challenge in creating any kind of benchmark is having enough data to index so that it takes some time for LuceneNET to complete. LuceneNET is very fast, so thousands of records of data are needed. I went on a hunt for an open-source data set that we could use. Ultimately I chose the Book Dataset from Kaggle.com, which is available under a CC0 public domain license. The dataset has approx. 271K records and each record contains several fields (Book Title, Author, URL to cover photo, etc.). To allow the program to load data in parallel, I created 21 copies of this data which is 5.4 million records total. Because sometimes it’s nice to be able to run the program more quickly when tweaking settings to see if they matter much, I also create 21 “small” data files, which contain about 50,000 records each for a total of 1.05 million records total. A single flag at the top of the program causes it to index big or small files. Either way, it’s indexing 21 files. Using this console app I tested the performance of Lucene.NET Beta 16, vs, the current master vs, PR#940 and I get more or less the same results from each. I tried different levels of parallelism to see how that affected performance and also tried different levels of I’m providing this full working .NET console app that uses .NET 8 including the data used and solution file and structure. So you can easily run this on your hardware and try different values for the number of threads used to read the data files, the Ideally run the console using the big data set, which is how I have the app configured by default. The app auto-deletes the LuceneIndex folder before each run. So you can run it multiple times in a row and you will be left with just the LuceneIndex from the last run in the LuceneIndex folder. I would love to hear back your Lucene.NET 4.8 results vs Java results and have you contribute back your java code port of this console app. I'd also like to understand exactly which version of Java Lucene you are benchmarking against. On my machine, Lucene.NET can index the 5.65 million book records in 47 seconds, which is a rate of approximately 119K records per second. That average time includes the final commit time. To my eyes that seems pretty darn fast. But if the Java version can really index these same documents much faster that would be great to know. |
) * Lucene.Net.Index.DocumentsWriterFlushControl: Reverted changes from 963e10c that made the documents writer non-concurrent (but threadsafe) * BUG: Lucene.Net.Index.DocumentsWriterFlushControl::AddFlushableState(): The DocumentsWriterThreadPool.Reset() method must be called within the context of a lock because there is a possible race condition with other callers of DocumentsWriterThreadPool.Reset(). See #935. * BUG: Lucene.Net.Support.Threading.ReentrantLock: In Java, the ReentrantLock.tryLock() method barges to the front of the queue instead of returning false like Monitor.TryEnter(). Use Monitor.Enter(object, ref bool) instead, which always returns true. We get locks in a different order, but I am not sure whether that matters. Fixes #935. Closes #886. * Lucene.Net.Index.TestRollingUpdates::TestUpdateSameDoc(): Added [Repeat(1000)] to try to reproduce on Azure DevOps (to be reverted). * Lucene.Net.Index.DocumentsWriterPerThreadPool: Removed MinContendedThreadState() method because it has no callers. This simplifies the design of ReentrantLock, since we don't need to artificially keep track of "queued threads". * Lucene.Net.Support: Added aggressive inlining for ReentrantLock and UninterruptableMonitor on cascaded methods. * run-tests-on-os.yml: Increase blame hang timeout so we can run a longer test (to be reverted) * Lucene.Net.Support.Threading.ReentrantLock: Use TryEnter() instead of Enter(). Enter() causes deadlocks in other tests, so need to localize this change to DocumentsWriterFlushControl. * Lucene.Net.Index.DocumentsWriterFlushControl::InternalTryCheckoutForFlush(): Use ReentrantLock.Lock() instead of ReentrantLock.TryLock() because Java relies on "barging" behavior instead of returning false when the current thread isn't next in the queue. We cannot do that, but we can wait for the lock to become available instead. * Lucene.Net.Support.Threading.ReeentrantLock(): Added an overload of TryLock() that accepts a timeout (TimeSpan) * Lucene.Net.Index.DocumentsWriterFlushControl: Use timeouts to allow some time for threads to reach the beginning of the wait queue. In Java, they are automatically put at the beginning of the queue, but since we cannot do that in .NET, we wait a little bit. * Lucene.Net.Index.DocumentsWriterFlushControl: Base the number of milliseconds to wait on whether the current process is 64 or 32 bit. * Lucene.Net.Index::DocumentsWriter: Added a constant TryLockTimeoutMilliseconds that is used by callers of ReentrantLock.TryLock() to set the default value. * Lucene.Net.Support: Added QueueExtensions class to polyfill the missing TryDequeue() and TryPeek() methods on netstandard2.0 and .NET Framework * SWEEP: Lucene.Net.Index: Removed timeouts for ReentrantLock.TryLock(). * Lucene.Net.Support.Threading.ReentrantLock: Changed the implementation to use unfair locking similar to how it was done in Java. We track the queue and use ManualResetEventSlim to control entry into the lock for queued tasks. Ported some of the ReentrantLock tests from Apache Harmony. * Revert "Lucene.Net.Support: Added QueueExtensions class to polyfill the missing TryDequeue() and TryPeek() methods on netstandard2.0 and .NET Framework" This reverts commit e5a65e9cd8bbf996fb599ff76e8d6b9f90babe4b. * Lucene.Net.csproj: Removed dependency on Microsoft.Extensions.ObjectPool * Lucene.Net.Support.Threading.ReentrantLock::Lock(): Use UninterruptableMonitor.TryEnter() instead of UninterruptableMonitor.Enter() so we can control what happens while the thread waits. We simply call Thread.Yield() to allow TryLock() to proceed before any waiting threads. Commented tests that depend on IsLocked because the property was removed. * Revert "run-tests-on-os.yml: Increase blame hang timeout so we can run a longer test (to be reverted)" This reverts commit b30e4abb576c8bfde3337a92de927a10747f88ae. * Revert "Lucene.Net.Index.TestRollingUpdates::TestUpdateSameDoc(): Added [Repeat(1000)] to try to reproduce on Azure DevOps (to be reverted)." This reverts commit d8fca410dafd1bf5529e8200034e1e2e5be83f07. * Lucene.Net.Support.Threading.ReentrantLockTest::TestUnlock_IllegalMonitorStateException() Removed unused exception variable and added a comment to indicate success so we don't have to suppress warnings * Lucene.Net.Support.Threading.UninterruptableMonitor: Elminated RetryEnter() recursive methods to avoid overflowing the stack and moved the logic into the catch blocks. Also added documentation.
Is there an existing issue for this?
Describe the bug
I wrote a multithreaded indexer in Java for Lucene 4.8 and one for Lucene.net 4.8. I tried to do pretty much a line-for-line port. However, there is a substantial difference in the performance characteristics of the two. In Java, I used a ThreadPoolExecutor with CallerRunsPolicy, and in C# I used Parallel.ForEach.
I can provide both Java and .NET files, but before doing so, I wanted to understand if this is expected behavior and therefore not a bug.
I am observing identical behavior in .NET 8 as well as .NET 6, so I don't think its related to #933
Expected Behavior
Multi-threaded indexing performance scales with the number of threads used.
Steps To Reproduce
No response
Exceptions (if any)
No response
Lucene.NET Version
4.8
.NET Version
8.0.204 and 6.0.29
Operating System
Windows 10
Anything else?
No response
The text was updated successfully, but these errors were encountered: