Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SARC-330] Implémenter les alertes : Proportion de jobs GPU avec stats prometheus spécifique aux GPUs sur un noeud donné plus bas qu’un threshold X #135

Merged
merged 7 commits into from
Nov 19, 2024

Conversation

notoraptor
Copy link
Contributor

@nurbal ! Prêt pour une review !

…s prometheus spécifique aux GPUs sur un noeud donné plus bas qu’un threshold X
min_jobs_per_group: Optional[Union[int, Dict[str, int]]] = None,
nb_stddev=2,
with_gres_gpu=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ce nouveau parametre n'est pas testés dans test_check_prometheus_scraping_stats, et donc on teste uniquement le cas des jobs CPU.

@@ -81,24 +91,41 @@ def check_prometheus_stats_occurrences(
clip_time = True
df = load_job_series(start=start, end=end, clip_time=clip_time)

# Parse minimum_runtime, and select only jobs where
# elapsed time >= minimum runtime and allocated.gres_gpu == 0
Copy link
Collaborator

@nurbal nurbal Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... d'ailleurs avant on ignorait les jobs GPU, on dirait bien ^^

@@ -175,3 +201,43 @@ def check_prometheus_stats_occurrences(
logger.warning(
f"[{cluster_name}] no Prometheus data available: no job found"
)


def check_prometheus_stats_for_gpu_jobs(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... au temps pour moi, je n'étais pas bien réveillé. test_check_prometheus_stats_for_gpu_jobs teste bien les GPU à travers l'appel à check_prometheus_stats_for_gpu_jobs. Pas vraiment un test unitaire, mais c'est ok pour moi :-)

@nurbal nurbal merged commit 6cd17ec into master Nov 19, 2024
7 checks passed
@notoraptor notoraptor deleted the sarc-330 branch November 19, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants