Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes null pointer issues with Interval Results map #1255

Conversation

bharathappali
Copy link
Member

Description

We have identified a critical issue where there are significant timestamp inconsistencies between the recorded data for CPU, memory, and GPU metrics. Specifically, this issue arises when there are only a few data points available for CPU and memory metrics, while GPU metrics have a much higher frequency of recorded entries. For instance, in a recent workload, only two data points were available for CPU and memory over a given time range, whereas there were 30 entries for GPU. However, 28 of these GPU records did not have any corresponding CPU and memory data within a reasonable time proximity.

When attempting to map these records, the absence of matching timestamps for CPU and memory results in the creation of new entries in the GPU map without any corresponding CPU and memory metrics. This leads to situations where the metrics map for a given pod contains only GPU values. Consequently, when performing pod-level calculations that rely on a complete set of metrics (CPU, memory, and GPU), the process encounters null values, causing errors or crashes.

To address this issue temporarily, we are implementing a quick fix that drops GPU records with non-matching timestamps (with a ±5 minutes buffer). Only GPU records that are in sync with CPU and memory timestamps will be considered for further processing.

Fixes #1254

Type of change

  • Bug fix
  • New feature
  • Docs update
  • Breaking change (What changes might users need to make in their application due to this PR?)
  • Requires DB changes

How has this been tested?

Tested with latest hackathon branch metrics profile and a manual test of recommendations generation for a Job based workload

  • New Test X
  • Functional testsuite

Test Configuration

  • Kubernetes clusters tested on:

Checklist 🎯

  • Followed coding guidelines
  • Comments added
  • Dependent changes merged
  • Documentation updated
  • Tests added or updated

Additional information

Include any additional information such as links, test results, screenshots here

@bharathappali bharathappali self-assigned this Aug 9, 2024
@khansaad khansaad added bug Something isn't working recommendation labels Aug 9, 2024
@khansaad khansaad added this to the Kruize 0.0.24_rm Release milestone Aug 9, 2024
Copy link
Contributor

@msvinaykumar msvinaykumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chandrams chandrams merged commit 23c3ccb into kruize:202407-hackathon Aug 9, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working recommendation
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants