Fixes null pointer issues with Interval Results map #1255
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
We have identified a critical issue where there are significant timestamp inconsistencies between the recorded data for CPU, memory, and GPU metrics. Specifically, this issue arises when there are only a few data points available for CPU and memory metrics, while GPU metrics have a much higher frequency of recorded entries. For instance, in a recent workload, only two data points were available for CPU and memory over a given time range, whereas there were 30 entries for GPU. However, 28 of these GPU records did not have any corresponding CPU and memory data within a reasonable time proximity.
When attempting to map these records, the absence of matching timestamps for CPU and memory results in the creation of new entries in the GPU map without any corresponding CPU and memory metrics. This leads to situations where the metrics map for a given pod contains only GPU values. Consequently, when performing pod-level calculations that rely on a complete set of metrics (CPU, memory, and GPU), the process encounters null values, causing errors or crashes.
To address this issue temporarily, we are implementing a quick fix that drops GPU records with non-matching timestamps (with a ±5 minutes buffer). Only GPU records that are in sync with CPU and memory timestamps will be considered for further processing.
Fixes #1254
Type of change
How has this been tested?
Tested with latest hackathon branch metrics profile and a manual test of recommendations generation for a Job based workload
Test Configuration
Checklist 🎯
Additional information
Include any additional information such as links, test results, screenshots here