-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark custom #2345
base: master
Are you sure you want to change the base?
Benchmark custom #2345
Conversation
… better resource utilizations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I love this idea, to improve the concurrency of ReutersContentSource
!
This source is very inefficient since it opens/closes a new file per document. This inefficiency is why we added the LineFileDocs
representation instead, which uses a single file and one readLine()
per document, and (in Lucene's nightly benchmarks) the chunked binary form of LineFileDocs
to even further reduce single-thread contention in reading/creating documents to index.
In general when testing Lucene's indexing performance it is vital to make the source of the documents as absolutely trivial as possible, to create as pure an indexing performance test as we can.
I left some small code style feedback, and also I did not understand how this concurrency optimization is functionally correct since you cannot ensure .getId() % N
always distributes across all values 0 .. N
?
* name = f.toRealPath() + "_" +iteration; | ||
* } | ||
*/ | ||
if (!threadIndexCreated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (threadIndexCreated == false) {
instead (to reduce chance of accidental future refactoring bugs)? This likely won't pass our code style checker (gradle precommit
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do the required changes.
int inputFilesSize = inputFiles.size(); | ||
|
||
/* | ||
* synchronized (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete this old code? You are replacing it with a more concurrent version, yay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will delete the commented codes.
threadIndex[index]++; | ||
|
||
// Sanity check, if # threads is greater than # input files, wrap index | ||
if (index >= inputFilesSize) index %= inputFilesSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the index %= inputFilesSize
to newline and inside { ... }
body?
|
||
int index = (int) Thread.currentThread().getId() % threadIndex.length; | ||
int fIndex = index + threadIndex[index] * threadIndex.length; | ||
threadIndex[index]++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused how this approach ensures that we will indeed index every document in the inputFiles
?
Thread.currentThread().getId() % threadIndex.length
is not guaranteed to reach every possible int from 0 .. threadIndex.length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, getId() is controlled by JVM but in our case, all threadIndex are getting initialized at once. Hence, there is high chance of getting guaranteed sequence of thread id, as we also observed. However, we understand your concern and tweaked our code in such a way that it guaranteed to reach every possible int from 0 .. threadIndex.length. We achieved it by setting a unique thread name and parsing the same for calculating the index value.
@@ -146,4 +172,11 @@ public synchronized void resetInputs() throws IOException { | |||
nextFile = 0; | |||
iteration = 0; | |||
} | |||
|
|||
private synchronized void createThreadIndex() { | |||
if (!threadIndexCreated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== false
instead? Or maybe change to assert threadIndexCreated == false
since you also check this up above with a real if
already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do the required changes.
Description
Lucene Benchmark Scaling Problem with Reuters Corpus
While Indexing 1 million documents with reuters21578 (plain text Document derived from reuters21578 corpus), we observed that with higher number of Index threads, the Index throughput does not scale and degrades. Existing implementation with synchronization block allows only one thread to pick up a document/file from list, at any given time – this code is part of getNextDocData() in ReutersContentSource.java. With multiple index threads, this becomes a thread contention bottleneck and does not allow the system CPU resource to be used efficiently.
Solution
We developed a strategy to distribute total number of files across multiple number of Indexing threads, so that these threads work independently and parallelly.
Tests
We mainly modified existing getNextDocData(), which is not altering functionality, hence not added any new test cases.
Checklist
Please review the following and check all that apply:
[ ] I have created a Jira issue and added the issue ID to my pull request title.
master
branch../gradlew check
.[ ] I have added tests for my changes.
[ ] I have added documentation for the Ref Guide (for Solr changes only).