Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring on index benchmark execution and ControlledExecutor #50

Merged
merged 15 commits into from
Feb 27, 2023

Conversation

patsonluk
Copy link
Contributor

Description

A refactoring effort to address:

  1. duration-sec is defined for IndexBenchmark but it currently has no effect
  2. Indexing test uses its own implementation for rate control, and does not share the ControlledExecutor which controls duration/rate for querying test
  3. ControlledExecutor has certain discrepancy for the rpm configured vs the actual rpm achieved especially in high throughput scenarios

Solution

  1. Introduced IndexBatchSupplier which "supply" tasks (Callable) which each callable is a index batch writing operation, refactored existing indexing logic to use such supplier and pass that to ControlledExecutor. Such IndexBatchSupplier ensures better separation of concerns of the 2 major tasks:
    1. the docs reading/parsing part from the input doc json file (as DocReader/FileDocReader that respect maxDocs)
    2. the indexing task generation (look up shard, forming batch etc)
  2. Simplified and rewrote some logic in ControlledExecutor such that the controls (duration/rpm) are more well defined and precisely respected.
  3. Simplified RateLimiter to regulate actual rpm closer to the target rpm provided

This is a pretty big refactoring, but I'm hoping this could simplify/solidify the query/index design and code implementation. Could really use some code review and thoughts! @chatman Many thanks! 🙇🏼

Copy link
Collaborator

@chatman chatman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested these changes, but nothing jumps out as obviously wrong. Thanks for tackling this.

Base automatically changed from patsonluk/benchmark-bean-refactoring to master February 15, 2023 16:58
@patsonluk
Copy link
Contributor Author

I haven't tested these changes, but nothing jumps out as obviously wrong. Thanks for tackling this.

Thank you so much @chatman! This is one sizeable refactoring so take your time!

I don't have time yet to write any unit test cases 😓 but I did some simple testing on my end (mostly single threaded tho):

  1. Test on having different combination of the restriction factors - duration, maxDocs/runCount. Verified the expected behavior
  2. Verified the rpm rate after the changes.
  3. Ensure index tasks still fulling index all docs when no restrictions are given

I have also started using that for our use case - having light indexing and heavy querying concurrently with fixed duration for both.

Would be great if you can test on your use cases as well! 💪🏼

@chatman
Copy link
Collaborator

chatman commented Feb 15, 2023

Sure, I'll update you by tomorrow on how the testing goes. THanks!

@patsonluk
Copy link
Contributor Author

@chatman thanks for approving it! Im wondering if you have a chance to run through the test cases on ur end too? 😊

@chatman
Copy link
Collaborator

chatman commented Feb 18, 2023

Im wondering if you have a chance to run through the test cases on ur end too?

I'll test them today. Thank your for the patience :-)

* @return
* @throws IOException
*/
List<String> readDocs(int count) throws IOException;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextBatch() would be better name?

import java.io.IOException;
import java.util.List;

public interface DocReader extends AutoCloseable {
Copy link
Collaborator

@noblepaul noblepaul Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not a DocReader. This is agnostic of docs. it is a BatchedLinesReader
or we should get rid of this interface altogether and use Iterator<List<String>>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing that up!

I named it DocReader as it's specific to our usage (reading docs as lines of string) even though it looks pretty much like a BatchedLinesReader as u mention. The implementing class does expect to read in lines and perform any pre-processing (remove _version_ for example) such that the line read are ready to be consumed as Doc as is.

Probably cannot just use Iterator as it's a bit too generic, code reader could be hard to reason what it exactly is and this interface does more than just iteration (for example reading a batch of input docs) and could have extra methods added for other doc specific operations in the future :)


@Override
public void handle(Map<String, Object> record, String path) {
idParsed = record.get(benchmark.idField) instanceof String ?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if idField is missing you may get an NPE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the bean should be idField defeault as "id". This Parser is refactoring of existing code from https://github.com/fullstorydev/solr-bench/pull/50/files/07565fdeb86670e2e32057cb893d0605a836a14e#diff-c443b311a116a519dcdbf35939db1b752571af41e44a38f37368dc4770092ab1L318 so it should be rather safe 😊

while (!exit && (inputDocs = docReader.readDocs(benchmark.batchSize)) != null) { //can read more than batch size, just use batch size as a sensible value
for (String inputDoc : inputDocs) {
rdr.streamRecords(new StringReader(inputDoc), idParser);
Slice targetSlice = docCollection.getRouter().getTargetSlice(idParser.idParsed, null, null, null, docCollection);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle missing id field?

for (String inputDoc : inputDocs) {
rdr.streamRecords(new StringReader(inputDoc), idParser);
Slice targetSlice = docCollection.getRouter().getTargetSlice(idParser.idParsed, null, null, null, docCollection);
List<String> shardDocs = shardVsDocs.get(targetSlice.getName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeIfAbsent() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! 👍🏼

}
}
}
shardVsDocs.forEach((shard, docs) -> { //flush the remaining ones
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, accordig to this logic , if shard1 has crossed the batch size, all shards are uploaded? is this the intent?

Copy link
Contributor Author

@patsonluk patsonluk Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions!

This is to mimic the old behavior at https://github.com/fullstorydev/solr-bench/pull/50/files/07565fdeb86670e2e32057cb893d0605a836a14e#diff-c443b311a116a519dcdbf35939db1b752571af41e44a38f37368dc4770092ab1L358

which after all lines are read from the source, it would flush out whatever is left that are added to shardVsDocs map, yet didn't pushed out as it has not reached the batch size. Take note that the logic here would only send out the rest if exit is not true (ie noone has called close() on this supplier yet). Therefore we can break in down into below scenarios:

  1. If noone has explicitly called close, the reader would just read all the lines and this code in IndexBatchSupplier would flush the remaining docs in shardVsDocs
  2. If someone has called close, IndexBatchSupplier should not attempt to queue any more pending batches. Take note that no current code would do this while the supplier is active. This is just to put into place for the completeness of implementation (and probably some future usage can make use of this)

Take note that this supplier is unaware of duration/maxExecution or maxDocs settings for the benchmark. Such 2 restrictions are handled elsewhere:

  1. duration/maxExecution by ControlledExecuctor, such executor can decide to drop the remaining task (if duration elapsed) or continue with the rest of the submitted batch (if submitted count reaches maxExecution)
  2. maxDocs is handled by the DocReader, once it reaches the maxDocs, readDocs method should return null

@chatman
Copy link
Collaborator

chatman commented Feb 21, 2023

@patsonluk , I ran them on a few suites yesterday and didn't notice any problems.

@chatman
Copy link
Collaborator

chatman commented Feb 21, 2023

I don't have time yet to write any unit test cases

I've created an issue here, we can revisit later.
#53

@patsonluk patsonluk merged commit df09bf2 into master Feb 27, 2023
@patsonluk patsonluk deleted the patsonluk/controlled-index-benchmark branch February 27, 2023 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants