Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(7/5) [nexus-db-queries] Benchmark for VMM reservation #7498

Open
wants to merge 125 commits into
base: main
Choose a base branch
from

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Feb 6, 2025

Following-up on the affinity work, I wanted to validate that the additional logic for affinity groups does not make the performance of the instance reservation query any worse than it was before.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cursory first review, I want to look closer at the actual benchmarks, but figured I'd go ahead and leave some quick notes on the surrounding code.

Comment on lines +12 to +13
//! It may be worth refactoring some of these functions to a test utility
//! crate to avoid the de-duplication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively ,we could perhaps just publicly export them under a "test utils" feature, or something? That might be a simpler solution to de-duplicate this without having to go make a whole new crate for them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this refactor in 36ddafc

Comment on lines 36 to 47
ContentionQuery {
sql: "SELECT table_name, index_name, num_contention_events::TEXT FROM crdb_internal.cluster_contended_indexes",
description: "Indexes which are experiencing contention",
},
ContentionQuery {
sql: "SELECT table_name,num_contention_events::TEXT FROM crdb_internal.cluster_contended_tables",
description: "Tables which are experiencing contention",
},
ContentionQuery {
sql: "WITH c AS (SELECT DISTINCT ON (table_id, index_id) table_id, index_id, num_contention_events AS events, cumulative_contention_time AS time FROM crdb_internal.cluster_contention_events) SELECT i.descriptor_name as table_name, i.index_name, c.events::TEXT, c.time::TEXT FROM crdb_internal.table_indexes AS i JOIN c ON i.descriptor_id = c.table_id AND i.index_id = c.index_id ORDER BY c.time DESC LIMIT 10;",
description: "Top ten longest contention events, grouped by table + index",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to be annoying, but...i don't suppose we could wrap some of these queries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

Comment on lines +103 to +104
// You can also set the "SHOW_CONTENTION" environment variable to display
// additional data from CockroachDB tables about contention statistics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lovely!

print!("|");
total_len += width + 3;
}
println!("");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, i thought clippy disliked printlns with empty strings in them, but perhaps i made that up...

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, though I have a few small comments. Thanks for putting this together!

One other general remark: it'd be good to have some checked-in instructions for running the tests (assuming I haven't overlooked them). Do I just run cargo bench? Is there some initial setup required to establish a baseline that Criterion can detect regressions against?

// For example: if the total number of vmms has no impact on the next provisioning
// request, we should see similar durations for "100 vmms reserved" vs "1 vmm
// reserved". However, if more vmms actually make reservation times slower, we'll see
// the "100 vmm" case take longer than the "1 vmm" case. The same goes for tasks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: partial comment? (The line ends with a colon.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, fixed in b8c5024

Comment on lines +357 to +359
// due to contention. We normally bubble this out to users,
// rather than stalling the request, but in this particular
// case, we choose to retry immediately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible/worthwhile to report the number of retries as a lower-is-better benchmarked metric, or does criterion just deal with durations? Extra retries will hurt the average duration, too, but I can see it being useful to have an explicit metric for this, especially if "real" retries produce a user error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the answer may be "no" per this bit of the docs ("Note that as of version 0.3.0, only timing measurements are supported, and only a single measurement can be used for one benchmark. These restrictions may be lifted in future versions."). Alas. I'm leaving the comment here, though, in case there's some other clever way to display this that I haven't thought of yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's okay right now that we're tracking time -- if it helps, it's a decent proxy for retries, even though, as you say, we don't actually acknowledge the number of user-visible retries here.

I'm mostly doing this so I can force the system into contention, which is where benchmarking is most insightful anyway.

Comment on lines +81 to +82
// Number of vmms to provision from the task-under-test
vmms: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I'm following everything: this also determines the number of instances the test creates, and each task will end up allocating one VMM for each of these instances (i.e., the VMMs/instances are not partitioned by task). Is that correct? (I'm 99% sure it is from the phrase "task-under-test," just making sure I haven't missed something.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, each "task" operates on a totally distinct set of instances/vmms from the other tasks.

},
],
},
// TODO create a test for "policy = Fail" groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you might've been planning to file one already but this seems like it's probably worth an issue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #7628

@smklein
Copy link
Collaborator Author

smklein commented Feb 25, 2025

One other general remark: it'd be good to have some checked-in instructions for running the tests (assuming I haven't overlooked them). Do I just run cargo bench? Is there some initial setup required to establish a baseline that Criterion can detect regressions against?

I tried to comment this in:

/////////////////////////////////////////////////////////////////
//
// BENCHMARKS
//
// You can run these with the following command:
//
// ```bash
// cargo bench -p nexus-db-queries
// ```
//
// You can also set the "SHOW_CONTENTION" environment variable to display
// additional data from CockroachDB tables about contention statistics.

But I can make a README too!

Criterion has command line args that can be used to specify how to create a baseline, but you can also create a baseline by just "Running it once, then running it again later". The baseline is just "whatever you ran last time".

Base automatically changed from sled-resource-vmm to main February 25, 2025 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants