Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittently no recommendations are observed for a job eventhough there is a data available #1339

Open
kusumachalasani opened this issue Oct 10, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@kusumachalasani
Copy link
Contributor

Describe the bug
No recommendations are observed for a "job" intermittently - although there is a data and the recommendations for the same experiment were generated earlier.

How to reproduce it
On NERC, create an experiment using

[{
  "version": "v2.0",
  "experiment_name": "monitor_gpu_ttm",
  "cluster_name": "default",
  "performance_profile": "resource-optimization-local-monitoring",
  "mode": "monitor",
  "target_cluster": "local",
  "datasource": "prometheus-1",
  "kubernetes_objects": [
    {
      "type": "statefulset",
      "name": "training-ttm",
      "namespace": "unpartitioned-namespace",
      "containers": [
        {
          "container_image_name": "kruizehub/ttm:v1",
          "container_name": "training-container"
        }
      ]
    }
  ],
  "trial_settings": {
    "measurement_duration": "15min"
  },
  "recommendation_settings": {
    "threshold": "0.1"
  }
}]

curl -X POST http://kruize-openshift-tuning.apps.nerc-ocp-test-2.nerc.mghpcc.org/generateRecommendations?experiment_name=monitor_gpu_ttm
[
{
"cluster_name": "default",
"experiment_type": "container",
"kubernetes_objects": [
{
"type": "job",
"name": "training-ttm",
"namespace": "unpartitioned-namespace",
"containers": [
{
"container_image_name": "kruizehub/ttm:v1",
"container_name": "training-container",
"recommendations": {
"version": "1.0",
"notifications": {
"120001": {
"type": "info",
"message": "There is not enough data available to generate a recommendation.",
"code": 120001
}
},
"data": {}
}
}
]
}
],
"version": "v2.0",
"experiment_name": "monitor_gpu_ttm"
}
]

Expected behavior
Few minutes earlier to it, recommendations for the same job were observed.

curl -X POST http://kruize-openshift-tuning.apps.nerc-ocp-test-2.nerc.mghpcc.org/generateRecommendations?experiment_name=monitor_gpu_ttm
[
  {
    "cluster_name": "default",
    "experiment_type": "container",
    "kubernetes_objects": [
      {
        "type": "job",
        "name": "training-ttm",
        "namespace": "unpartitioned-namespace",
        "containers": [
          {
            "container_image_name": "kruizehub/ttm:v1",
            "container_name": "training-container",
            "recommendations": {
              "version": "1.0",
              "notifications": {
                "111000": {
                  "type": "info",
                  "message": "Recommendations Are Available",
                  "code": 111000
                }
              },
              "data": {
                "2024-10-01T20:11:00.000Z": {
                  "notifications": {
                    "224001": {
                      "type": "error",
                      "message": "Amount field is missing in the Memory Section",
                      "code": 224001
                    },
                    "524002": {
                      "type": "critical",
                      "message": "Memory Limit Not Set",
                      "code": 524002
                    },
                    "524001": {
                      "type": "critical",
                      "message": "Memory Request Not Set",
                      "code": 524001
                    },
                    "223001": {
                      "type": "error",
                      "message": "Amount field is missing in the CPU Section",
                      "code": 223001
                    },
                    "111101": {
                      "type": "info",
                      "message": "Short Term Recommendations Available",
                      "code": 111101
                    },
                    "523001": {
                      "type": "critical",
                      "message": "CPU Request Not Set",
                      "code": 523001
                    },
                    "423001": {
                      "type": "warning",
                      "message": "CPU Limit Not Set",
                      "code": 423001
                    }
                  },
                  "monitoring_end_time": "2024-10-01T20:11:00.000Z",
                  "current": {},
                  "recommendation_terms": {
                    "short_term": {
                      "duration_in_hours": 24.0,
                      "notifications": {
                        "112101": {
                          "type": "info",
                          "message": "Cost Recommendations Available",
                          "code": 112101
                        },
                        "112102": {
                          "type": "info",
                          "message": "Performance Recommendations Available",
                          "code": 112102
                        }
                      },
                      "monitoring_start_time": "2024-09-30T20:11:00.000Z",
                      "recommendation_engines": {
                        "cost": {
                          "pods_count": 1,
                          "confidence_level": 0.0,
                          "config": {
                            "requests": {
                              "memory": {
                                "amount": 7.6173901824E9,
                                "format": "bytes"
                              },
                              "cpu": {
                                "amount": 10.487843658239633,
                                "format": "cores"
                              }
                            },
                            "limits": {
                              "memory": {
                                "amount": 7.6173901824E9,
                                "format": "bytes"
                              },
                              "nvidia.com/mig-7g.40gb": {
                                "amount": 1.0,
                                "format": "cores"
                              },
                              "cpu": {
                                "amount": 10.487843658239633,
                                "format": "cores"
                              }
                            }
                          },
                          "variation": {
                            "requests": {
                              "memory": {
                                "amount": 7.6173901824E9,
                                "format": "bytes"
                              },
                              "cpu": {
                                "amount": 10.487843658239633,
                                "format": "cores"
                              }
                            },
                            "limits": {
                              "memory": {
                                "amount": 7.6173901824E9,
                                "format": "bytes"
                              },
                              "cpu": {
                                "amount": 10.487843658239633,
                                "format": "cores"
                              }
                            }
                          },
                          "notifications": {}
                        },
                        "performance": {
                          "pods_count": 1,
                          "confidence_level": 0.0,
                          "config": {
                            "requests": {
                              "memory": {
                                "amount": 7.6173901824E9,
                                "format": "bytes"
                              },
                              "cpu": {
                                "amount": 10.487843658239633,
                                "format": "cores"
                              }
                            },
                            "limits": {
                              "memory": {
                                "amount": 7.6173901824E9,
                                "format": "bytes"
                              },
                              "nvidia.com/mig-7g.40gb": {
                                "amount": 1.0,
                                "format": "cores"
                              },
                              "cpu": {
                                "amount": 10.487843658239633,
                                "format": "cores"
                              }
                            }
                          },
                          "variation": {
                            "requests": {
                              "memory": {
                                "amount": 7.6173901824E9,
                                "format": "bytes"
                              },
                              "cpu": {
                                "amount": 10.487843658239633,
                                "format": "cores"
                              }
                            },
                            "limits": {
                              "memory": {
                                "amount": 7.6173901824E9,
                                "format": "bytes"
                              },
                              "cpu": {
                                "amount": 10.487843658239633,
                                "format": "cores"
                              }
                            }
                          },
                          "notifications": {}
                        }
                      },
                      "plots": {
                        "datapoints": 4,
                        "plots_data": {
                          "2024-10-01T08:11:00.000Z": {},
                          "2024-10-01T20:11:00.000Z": {
                            "cpuUsage": {
                              "min": 0.0,
                              "q1": 0.0,
                              "median": 10.487843658239633,
                              "q3": 10.487843658239633,
                              "max": 10.487843658239633,
                              "format": "cores"
                            },
                            "memoryUsage": {
                              "min": 1.30064384E9,
                              "q1": 1.49932032E9,
                              "median": 6.347825152E9,
                              "q3": 6.347825152E9,
                              "max": 6.347825152E9,
                              "format": "bytes"
                            }
                          },
                          "2024-10-01T14:11:00.000Z": {},
                          "2024-10-01T02:11:00.000Z": {}
                        }
                      }
                    },
                    "medium_term": {
                      "duration_in_hours": 168.0,
                      "notifications": {
                        "120001": {
                          "type": "info",
                          "message": "There is not enough data available to generate a recommendation.",
                          "code": 120001
                        }
                      }
                    },
                    "long_term": {
                      "duration_in_hours": 360.0,
                      "notifications": {
                        "120001": {
                          "type": "info",
                          "message": "There is not enough data available to generate a recommendation.",
                          "code": 120001
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        ]
      }
    ],
    "version": "v2.0",
    "experiment_name": "monitor_gpu_ttm"
  }
]

Relevant logs
No errors were observed.

Environment:

  • openshift
  • 4.16
  • RHEL
@kusumachalasani kusumachalasani added the bug Something isn't working label Oct 10, 2024
@dinogun dinogun added this to the Kruize 0.2 Release milestone Oct 10, 2024
@dinogun dinogun moved this to Todo in Monitoring Oct 10, 2024
@bharathappali bharathappali self-assigned this Oct 10, 2024
@rbadagandi1 rbadagandi1 removed this from the Kruize 0.3 Release milestone Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

4 participants