Feature: Adaptive interval for job failure checks #58

luator · 2024-01-11T11:10:10Z

The following discussion from !71 should be addressed:

discussion 1: (+1 comment)
I would start with a min interval, then double this interval on every subsequent call, up to a max interval. E.g. from 5secs min to 60secs max, it would go like:
- 5s
- 10s
- 20s
- 40s
- 60s
- 60s
- ...
discussion 2: Also use it for Condor

Yes, it could make sense to share the throttling implementation with slurm_cluster_system and then just use a much higher threshold (like once a second). It's just reading a file, but the MPI cluster also does not like if you hammer it's filesystem too much.

luator · 2024-01-11T11:21:19Z

The purpose of this is mostly to quickly detect if something is fundamentally wrong that makes all jobs fail, right? That is, it is enough to do this only once in the beginning? I was wondering if it would make sense to reset when new jobs are submitted, but depending on the number and duration of jobs this might again lead to over-polling the system.

I'd probably start with a slightly higher value (let's say 10s) but increase a bit more slowly as in my experience so far, it sometimes take a bit until Slurm actually starts the job, so lot's of checking in the very beginning might not be that useful.

By Felix Widmaier on 2024-01-11T12:21:19 (imported from GitLab)

luator added the question Further information is requested label Apr 16, 2024

mseitzer removed the question Further information is requested label Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Adaptive interval for job failure checks #58

Feature: Adaptive interval for job failure checks #58

luator commented Jan 11, 2024

luator commented Jan 11, 2024

Feature: Adaptive interval for job failure checks #58

Feature: Adaptive interval for job failure checks #58

Comments

luator commented Jan 11, 2024

luator commented Jan 11, 2024