-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring on index benchmark execution and ControlledExecutor #50
Conversation
Simplified some logic in ControlledExecutor such that it can better adheres to the rpm/max execution count defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested these changes, but nothing jumps out as obviously wrong. Thanks for tackling this.
Thank you so much @chatman! This is one sizeable refactoring so take your time! I don't have time yet to write any unit test cases 😓 but I did some simple testing on my end (mostly single threaded tho):
I have also started using that for our use case - having light indexing and heavy querying concurrently with fixed duration for both. Would be great if you can test on your use cases as well! 💪🏼 |
Sure, I'll update you by tomorrow on how the testing goes. THanks! |
@chatman thanks for approving it! Im wondering if you have a chance to run through the test cases on ur end too? 😊 |
I'll test them today. Thank your for the patience :-) |
* @return | ||
* @throws IOException | ||
*/ | ||
List<String> readDocs(int count) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextBatch()
would be better name?
import java.io.IOException; | ||
import java.util.List; | ||
|
||
public interface DocReader extends AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not a DocReader
. This is agnostic of docs. it is a BatchedLinesReader
or we should get rid of this interface altogether and use Iterator<List<String>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for bringing that up!
I named it DocReader
as it's specific to our usage (reading docs as lines of string) even though it looks pretty much like a BatchedLinesReader
as u mention. The implementing class does expect to read in lines and perform any pre-processing (remove _version_
for example) such that the line read are ready to be consumed as Doc as is.
Probably cannot just use Iterator
as it's a bit too generic, code reader could be hard to reason what it exactly is and this interface does more than just iteration (for example reading a batch of input docs) and could have extra methods added for other doc specific operations in the future :)
|
||
@Override | ||
public void handle(Map<String, Object> record, String path) { | ||
idParsed = record.get(benchmark.idField) instanceof String ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if idField is missing you may get an NPE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the bean should be idField
defeault as "id". This Parser is refactoring of existing code from https://github.com/fullstorydev/solr-bench/pull/50/files/07565fdeb86670e2e32057cb893d0605a836a14e#diff-c443b311a116a519dcdbf35939db1b752571af41e44a38f37368dc4770092ab1L318 so it should be rather safe 😊
while (!exit && (inputDocs = docReader.readDocs(benchmark.batchSize)) != null) { //can read more than batch size, just use batch size as a sensible value | ||
for (String inputDoc : inputDocs) { | ||
rdr.streamRecords(new StringReader(inputDoc), idParser); | ||
Slice targetSlice = docCollection.getRouter().getTargetSlice(idParser.idParsed, null, null, null, docCollection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle missing id field?
for (String inputDoc : inputDocs) { | ||
rdr.streamRecords(new StringReader(inputDoc), idParser); | ||
Slice targetSlice = docCollection.getRouter().getTargetSlice(idParser.idParsed, null, null, null, docCollection); | ||
List<String> shardDocs = shardVsDocs.get(targetSlice.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
computeIfAbsent()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! 👍🏼
} | ||
} | ||
} | ||
shardVsDocs.forEach((shard, docs) -> { //flush the remaining ones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, accordig to this logic , if shard1
has crossed the batch size, all shards are uploaded? is this the intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great questions!
This is to mimic the old behavior at https://github.com/fullstorydev/solr-bench/pull/50/files/07565fdeb86670e2e32057cb893d0605a836a14e#diff-c443b311a116a519dcdbf35939db1b752571af41e44a38f37368dc4770092ab1L358
which after all lines are read from the source, it would flush out whatever is left that are added to shardVsDocs map, yet didn't pushed out as it has not reached the batch size. Take note that the logic here would only send out the rest if exit
is not true (ie noone has called close()
on this supplier yet). Therefore we can break in down into below scenarios:
- If noone has explicitly called
close
, the reader would just read all the lines and this code inIndexBatchSupplier
would flush the remaining docs inshardVsDocs
- If someone has called
close
,IndexBatchSupplier
should not attempt to queue any more pending batches. Take note that no current code would do this while the supplier is active. This is just to put into place for the completeness of implementation (and probably some future usage can make use of this)
Take note that this supplier is unaware of duration/maxExecution
or maxDocs
settings for the benchmark. Such 2 restrictions are handled elsewhere:
duration/maxExecution
byControlledExecuctor
, such executor can decide to drop the remaining task (if duration elapsed) or continue with the rest of the submitted batch (if submitted count reachesmaxExecution
)maxDocs
is handled by theDocReader
, once it reaches themaxDocs
,readDocs
method should returnnull
@patsonluk , I ran them on a few suites yesterday and didn't notice any problems. |
I've created an issue here, we can revisit later. |
… the exception bubble up and the main flow to handle it (interrupt corresponding tasks)
This reverts commit 0e93b21.
Description
A refactoring effort to address:
duration-sec
is defined forIndexBenchmark
but it currently has no effectControlledExecutor
which controls duration/rate for querying testControlledExecutor
has certain discrepancy for the rpm configured vs the actual rpm achieved especially in high throughput scenariosSolution
IndexBatchSupplier
which "supply" tasks (Callable
) which each callable is a index batch writing operation, refactored existing indexing logic to use such supplier and pass that toControlledExecutor
. SuchIndexBatchSupplier
ensures better separation of concerns of the 2 major tasks:DocReader
/FileDocReader
that respectmaxDocs
)ControlledExecutor
such that the controls (duration/rpm) are more well defined and precisely respected.This is a pretty big refactoring, but I'm hoping this could simplify/solidify the query/index design and code implementation. Could really use some code review and thoughts! @chatman Many thanks! 🙇🏼