feat(validator): add support to validate essential metrics produced by Kepler #1834

vprashar2929 · 2024-11-04T11:20:57Z

This commit introduces functionality to validate essential metrics produced by Kepler
The following comparisons are included:

Node Exporter Comparison
- Validates node_rapl_<package|core|dram> metrics against kepler_node_<package|core|dram>{dev}
Kepler Process Comparison
- Compares kepler_process_<package|core|dram|platform|other|uncore>{latest} metrics to
  kepler_process_<package|core|dram|platform|other|uncore>{dev}
Kepler Node Comparison
- Validates kepler_node_<package|core|dram|platform|other|uncore>{latest} against
  kepler_node_<package|core|dram|platform|other|uncore>{dev}

Additionally, the following changes are made to existing functionality:

Adds a new metric_validations.yaml file which includes promql queries for comparisons along with threshold values
Update the existing stressor.sh script to now support few more parameters to make it more flexible
- warmup time: time to wait before starting the stressor
- cooldown time: time to wait after the stressor is finished
- repeats: number of times to repeat the stressor. Since for
  regression test we don't want to repeat the stressor multiple times
Adds a new validator-regression.yaml file which includes the configuration for the regression test

github-actions · 2024-11-04T11:22:14Z

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: This pull request enhances the validator module by introducing a new validate_metrics command to the CLI, which validates Kepler metrics and generates reports. Key modifications include:

Added validate_metrics command with --duration and --report-dir options
Modified PrometheusJob named tuple to include dev and latest fields
Updated load function to initialize new fields when loading configuration from a file
Introduced max_mae parameter to validation_from_yaml function for Maximum Absolute Error computation
Added new functions: validate_metrics, ScriptResult, and write_md_report

Impact: These changes expand the validator's capabilities for handling and validating Kepler metrics, but do not affect exported function signatures or global data structures. The external interface and behavior are modified, requiring configuration file updates.

Observations/Suggestions:

The changes seem well-structured and organized, with clear intentions to enhance the validator's features.
It would be beneficial to include tests for the new validate_metrics command and its associated functions to ensure their correctness and robustness.
Consider adding documentation or comments to explain the purpose and usage of the new max_mae parameter and the ScriptResult and write_md_report functions.
Review the configuration file updates required for these changes to ensure a smooth transition for users.

e2e/tools/validator/metric_validations.yaml

e2e/tools/validator/src/validator/config/__init__.py

sthaha · 2024-11-05T22:56:15Z

e2e/tools/validator/metric_validations.yaml

+
+validations:
+  # absolute power comparison
+  - name: Total - absolute


Can we also validate these invariants in the same version of dev

kepler_node_<pkg|core|uncore|dram|other>{dev} = sum of ( process_<pkg|core|uncore|dram|other>{dev} )

kepler_node_<pkg|core|..> = node_exporter_rapl_<pkg|core...>
*sum( kepler_process_bpf_cpu ) = node_exporter_cpu_time

for kepler_node_<pkg|core|dram...>{dev} = sum of (process_<pkg|core|dram....>){dev} do you mean
MAE of sum(rate(kepler_node<pkg|core|dram>){dev}[20s]) and sum(rate(process_<pkg|core|dram>{dev}[20s]))?

e2e/tools/validator/src/validator/cli/__init__.py

vprashar2929 · 2024-12-16T12:53:09Z

e2e/tools/validator/validator-regression.yaml

+    metal: metal  # Job name for metal metrics, default is metal
+
+  url: http://localhost:9090 # Prometheus server URL
+  rate_interval: 60s  # Rate interval for Promql, default is 20s, typically 4 x $scrape_interval


Explicitly using rate interval as 60s because:

Prometheus scrape Interval = 3s Data points for 12s Interval(i.e 4* scrape interval) = 12/3 = 4 data points Data points for 60s interval = 60/3 = 20 data points

With 20 data points, we get a smoother and more reliable estimate. When comparing two sum(rate(...)) a stable rate reduces the variability in MAE calculations leading to more accurate assessments.

vimalk78 · 2024-12-16T13:52:56Z

manifests/compose/monitoring/prometheus/prometheus.yml

@@ -1,5 +1,5 @@
 global:
-  scrape_interval: 5s # Set the scrape interval to every 5 seconds. Default is every 1 minute.
+  scrape_interval: 3s # Set the scrape interval to every 5 seconds. Default is every 1 minute.


check why changed scrape interval, and update comment accordingly

setting scrape every 3 seconds rather than every 5 seconds, over a typical time window will collect significantly more data

vprashar2929 · 2024-12-17T04:38:34Z

Here is sample CI run that would look like for reference once we have this merged: https://github.com/sustainable-computing-io/kepler-metal-ci/actions/runs/12366281744/job/34512777104

My idea is to use the equinix runners on demand on PR's. Reviewers or authors can add a comment in the PR something like /test-regression which will trigger a workflow like this which can test if metrics produced by PR code base Kepler are off to what is already present in latest

…y Kepler This commit introduces functionality to validate essential metrics produced by Kepler The following comparisons are included: - Node Exporter Comparison - Validates `node_rapl_<package|core|dram>` metrics against `kepler_node_<package|core|dram>{dev}` - Kepler Process Comparison - Compares `kepler_process_<package|core|dram|platform|other|uncore>{latest}` metrics to `kepler_process_<package|core|dram|platform|other|uncore>{dev}` - Kepler Node Comparison - Validates `kepler_node_<package|core|dram|platform|other|uncore>{latest}` against `kepler_node_<package|core|dram|platform|other|uncore>{dev}` Additionally, the following changes are made to existing functionality: - Adds a new `metric_validations.yaml` file which includes promql queries for comparisons along with threshold values - Update the existing `stressor.sh` script to now support few more parameters to make it more flexible - warmup time: time to wait before starting the stressor - cooldown time: time to wait after the stressor is finished - repeats: number of times to repeat the stressor. Since for regression test we don't want to repeat the stressor multiple times - Adds a new `validator-regression.yaml` file which includes the configuration for the regression test Signed-off-by: vprashar2929 <[email protected]>

vprashar2929 marked this pull request as draft November 4, 2024 11:21

vprashar2929 force-pushed the add-kep-reg branch 4 times, most recently from 827aefd to 564fa4c Compare November 5, 2024 17:49

vprashar2929 requested a review from sthaha November 5, 2024 17:53

sthaha reviewed Nov 5, 2024

View reviewed changes

e2e/tools/validator/metric_validations.yaml Outdated Show resolved Hide resolved

sthaha reviewed Nov 5, 2024

View reviewed changes

e2e/tools/validator/src/validator/config/__init__.py Outdated Show resolved Hide resolved

sthaha reviewed Nov 5, 2024

View reviewed changes

e2e/tools/validator/src/validator/cli/__init__.py Show resolved Hide resolved

vprashar2929 force-pushed the add-kep-reg branch 6 times, most recently from 33d2963 to de1649f Compare November 9, 2024 14:18

vprashar2929 changed the title ~~feat(validator): add support to validate kepler metrics~~ feat(validator): add support to validate essential metrics produced by Kepler Nov 9, 2024

vprashar2929 force-pushed the add-kep-reg branch from de1649f to 5fa8028 Compare November 9, 2024 17:28

vprashar2929 marked this pull request as ready for review November 10, 2024 14:53

vprashar2929 mentioned this pull request Dec 4, 2024

chore(ci): migrate mock-acpi workflow to GH runner #1882

Merged

vprashar2929 force-pushed the add-kep-reg branch from 5fa8028 to 3c994bb Compare December 16, 2024 12:46

vprashar2929 commented Dec 16, 2024

View reviewed changes

vprashar2929 requested review from vimalk78, KaiyiLiu1234 and rootfs December 16, 2024 12:54

vprashar2929 force-pushed the add-kep-reg branch 2 times, most recently from 28889fe to fe32d16 Compare December 16, 2024 13:06

vimalk78 reviewed Dec 16, 2024

View reviewed changes

vprashar2929 force-pushed the add-kep-reg branch from fe32d16 to 990c844 Compare December 16, 2024 18:44

vprashar2929 force-pushed the add-kep-reg branch from 990c844 to b06242b Compare December 19, 2024 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(validator): add support to validate essential metrics produced by Kepler #1834

feat(validator): add support to validate essential metrics produced by Kepler #1834

vprashar2929 commented Nov 4, 2024 •

edited

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading

sthaha Nov 5, 2024

vprashar2929 Nov 9, 2024

vprashar2929 Dec 16, 2024

vimalk78 Dec 16, 2024

vprashar2929 Dec 16, 2024

vprashar2929 commented Dec 17, 2024

feat(validator): add support to validate essential metrics produced by Kepler #1834

Are you sure you want to change the base?

feat(validator): add support to validate essential metrics produced by Kepler #1834

Conversation

vprashar2929 commented Nov 4, 2024 • edited Loading

github-actions bot commented Nov 4, 2024 • edited Loading

sthaha Nov 5, 2024

Choose a reason for hiding this comment

vprashar2929 Nov 9, 2024

Choose a reason for hiding this comment

vprashar2929 Dec 16, 2024

Choose a reason for hiding this comment

vimalk78 Dec 16, 2024

Choose a reason for hiding this comment

vprashar2929 Dec 16, 2024

Choose a reason for hiding this comment

vprashar2929 commented Dec 17, 2024

vprashar2929 commented Nov 4, 2024 •

edited

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading