-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-5441 Adds the time_series_lastpoint_high_cardinality
workload.
#1218
base: master
Are you sure you want to change the base?
Conversation
time_series_lastpoint_high_cardinality
workload.time_series_lastpoint_high_cardinality
workload.
Thanks @yun-soo! Would you be able to fill out the New Workload Request form? Also just want to confirm that for this PR only the TimeSeriesLastpointHighCardinality workload file needs to be reviewed, and the changes to the TimeSeriesLastpoint workload/phase file will be dealt with in #1219? Do you have an idea of how noisy this workload is? We'd generally recommend running x5 patches to get a sense of this. You can use the performance analyzer to automate this (see the third method here). |
Thanks a lot for the review, @alicedoherty!
Filed the request. I noticed that build variant list is not consistent with the up-to-date one.
Yes, only TimeSeriesLastpointHighCardinality needs to be reviewed and they will in #1219
Just scheduled one here. Will update this as soon as I will get results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yun-soo! Sorry for the delay, was focusing on skunkworks. Thanks for calling out that the variants in the form are outdated, I've opened PERF-5488.
Do you have any idea why the Latency95thPercentile for MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate is so noisy? Since this seems like one of the key metrics you care about this would make triaging changepoints pretty hard for this measurement.
Also as an FYI in the perf analyzer if you select to show the "CoV (%)" it will show the calculated coefficient of variation (the main measure of noise our team uses) for each measurement in your patches to save you some time when doing a noise analysis.
$gte: v7.0 | ||
mongodb_setup: | ||
$eq: | ||
- standalone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've gotten rid of our regular standalone variants. Having this line will just mean it will only be running on the ARM and Intel 3-Node ReplSet 2023-11 variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies but I don't understand what you're suggesting here. Could you clarify on it so that I know what should be changed? I could not exactly correlate these names with the updated build variants.
These lines are exactly same as in the companion workload TimeSeriesLastpoint.yml
. And I was able to see that this workload is enabled on perf-3-node-replSet-intel.intel.aws.2023-11 and perf-3-node-replSet.arm.aws.2023-11 and perf-3-node-replSet-all-feature-flags.arm.2023-11. I think these would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry the way I worded that was confusing 😅 We've removed our regular standalone variants (SERVER-89499). Having $eq: standalone, standalone-XXX
won't cause any issues, it just means it won't match an existing variant, and therefore won't run on any standalone variants. I just wanted to call this out so you weren't surprised that the test isn't running on standalone variants despite having this line in your code.
For clarity it's probably best to remove the standalone
lines here.
(JFYI WRT the correlation between these lines and our variants - our variants are defined in perf.yml and perf_branching.yml. Each variant has a mongodb_setup: XXX
line. In your Genny file if you write AutoRun: When: mongodb_setup: $eq: replica
this means it will run on any variant that has mongodb_setup: replica
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah~ makes sense now!! Thanks for detailed explanation. Really appreciate it! Removed standalone(-*)
for clarity just like you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review! Addressed comments or asked follow up questions.
Sorry for the delay, was focusing on skunkworks. Thanks for calling out that the variants in the form are outdated, I've opened PERF-5488.
No worries and I was also focusing on skunkworks as many others were. Thanks for filing the ticket.
Do you have any idea why the Latency95thPercentile for MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate is so noisy? Since this seems like one of the key metrics you care about this would make triaging changepoints pretty hard for this measurement.
I don't have a good idea on why yet. I'll spend some time to figure out why but it'll take some time. I'll get back to this PR when I have a good idea. Do you happen to know what could cause such noisiness in other workloads?
Also as an FYI in the perf analyzer if you select to show the "CoV (%)" it will show the calculated coefficient of variation (the main measure of noise our team uses) for each measurement in your patches to save you some time when doing a noise analysis.
Thanks for letting me know. What's the threshold which the team decides a metric is too noisy on? I noticed that CoVs were there, but I didn't know how to interpret those values.
$gte: v7.0 | ||
mongodb_setup: | ||
$eq: | ||
- standalone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies but I don't understand what you're suggesting here. Could you clarify on it so that I know what should be changed? I could not exactly correlate these names with the updated build variants.
These lines are exactly same as in the companion workload TimeSeriesLastpoint.yml
. And I was able to see that this workload is enabled on perf-3-node-replSet-intel.intel.aws.2023-11 and perf-3-node-replSet.arm.aws.2023-11 and perf-3-node-replSet-all-feature-flags.arm.2023-11. I think these would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you happen to know what could cause such noisiness in other workloads?
Looking at the existing time_series_lastpoint test it seems much of the underlying test design/queries are noisy. If you're getting useful signal from that existing test I think it would be okay to merge this test with similar levels, however, I definitely think it's worth looking into the noise of Latency95thPercentile for MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate as a CV of 73% (!) and 12% seems like an indication that something is not testing what you'd expect. Since the really bad noise is limited to this specific phase and (potentially) ARM it might be a good idea to do some profiling, t2 analysis, etc. on that specific part of the test (looking at one of the really good vs bad runs). There are some general resources documented here, but feel free to ask in #ask-devprod-performance if you need more guidance.
What's the threshold which the team decides a metric is too noisy on?
We've struggled to define a specific threshold. Ideally we would like it <1% but at the moment with our current testing infrastructure this seems a little unrealistic. Personally, for tests like this I'd be concerned about metrics with CV >5%.
If this is a time-sensitive feature you want to test - it could also be an option to tell the build barons to ignore the really noisy metric (since it's not the whole task) and open a follow-on ticket to investigate this further.
$gte: v7.0 | ||
mongodb_setup: | ||
$eq: | ||
- standalone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry the way I worded that was confusing 😅 We've removed our regular standalone variants (SERVER-89499). Having $eq: standalone, standalone-XXX
won't cause any issues, it just means it won't match an existing variant, and therefore won't run on any standalone variants. I just wanted to call this out so you weren't surprised that the test isn't running on standalone variants despite having this line in your code.
For clarity it's probably best to remove the standalone
lines here.
(JFYI WRT the correlation between these lines and our variants - our variants are defined in perf.yml and perf_branching.yml. Each variant has a mongodb_setup: XXX
line. In your Genny file if you write AutoRun: When: mongodb_setup: $eq: replica
this means it will run on any variant that has mongodb_setup: replica
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely think it's worth looking into the noise of Latency95thPercentile for MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate as a CV of 73% (!) and 12% seems like an indication that something is not testing what you'd expect. Since the really bad noise is limited to this specific phase and (potentially) ARM it might be a good idea to do some profiling, t2 analysis, etc. on that specific part of the test (looking at one of the really good vs bad runs). There are some general resources documented here, but feel free to ask in #ask-devprod-performance if you need more guidance.
Thanks for detailed guidance and really appreciate it! I'll definitely look into those two metrics. I wanted to measure improved performance for the higher cardinality workload thanks to a new optimization because TimeSeriesLastpoint.yml
had a bug (so a noise) and also didn't give my team a good signal about the new optimization but don't want to get too much noise from this new workload either.
If this is a time-sensitive feature you want to test - it could also be an option to tell the build barons to ignore the really noisy metric (since it's not the whole task) and open a follow-on ticket to investigate this further.
Actually, that's not the case. I confirmed that the feature actually improves the performance in other workloads and this workload is for future performance regression testing, which is not urgent. I wanted to give a closure to the feature. I'm in the middle of something for the other project and so will get back to you with my analysis and any changes necessary to this workload.
$gte: v7.0 | ||
mongodb_setup: | ||
$eq: | ||
- standalone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah~ makes sense now!! Thanks for detailed explanation. Really appreciate it! Removed standalone(-*)
for clarity just like you suggested.
|
||
- LoadConfig: | ||
Path: *phasePath | ||
Key: QuiesceActor | ||
Parameters: | ||
MaxPhases: 23 | ||
MaxPhases: *MaxPhases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It might be clearer if the active phases for the quiesce actor are defined here rather than inheriting the defaults.
That way all the phase info is in this file, and it lines up with what is done for TimeSeriesLastpointHighCardinality.yml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in #1219.
|
||
################################################################################################## | ||
# Following 4 actors including the hidden QuiesceActor in phase 20 run a lastpoint query with a | ||
# sort and group on meta subfield ascending and time descending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say meta subfield descending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! done.
|
||
- LoadConfig: | ||
Path: *phasePath | ||
Key: QuiesceActor | ||
Parameters: | ||
MaxPhases: 23 | ||
MaxPhases: *MaxPhases | ||
|
||
AutoRun: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are deprecating support for standalone variants.
Can you remove the AutoRun conditions for the standalone setups?
This would also then match when the HighCardinality workload gets scheduled.
Also should the branch conditional be "$gte: v7.0" like it is for the HighCardinality one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jawwadasghar, addressed your comments
|
||
################################################################################################## | ||
# Following 4 actors including the hidden QuiesceActor in phase 20 run a lastpoint query with a | ||
# sort and group on meta subfield ascending and time descending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! done.
|
||
- LoadConfig: | ||
Path: *phasePath | ||
Key: QuiesceActor | ||
Parameters: | ||
MaxPhases: 23 | ||
MaxPhases: *MaxPhases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in #1219.
|
||
- LoadConfig: | ||
Path: *phasePath | ||
Key: QuiesceActor | ||
Parameters: | ||
MaxPhases: 23 | ||
MaxPhases: *MaxPhases | ||
|
||
AutoRun: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Jira Ticket: PERF-5441
Whats Changed
Adds the
time_series_lastpoint_high_cardinality
workload.Patch Testing Results
https://spruce.mongodb.com/version/663f01f365c3b80007df9990/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC