Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Adaptive interval for job failure checks #58

Open
2 tasks
luator opened this issue Jan 11, 2024 · 1 comment
Open
2 tasks

Feature: Adaptive interval for job failure checks #58

luator opened this issue Jan 11, 2024 · 1 comment
Labels

Comments

@luator
Copy link
Member

luator commented Jan 11, 2024

The following discussion from !71 should be addressed:

  • discussion 1: (+1 comment)

    I would start with a min interval, then double this interval on every subsequent call, up to a max interval. E.g. from 5secs min to 60secs max, it would go like:

    • 5s
    • 10s
    • 20s
    • 40s
    • 60s
    • 60s
    • ...
  • discussion 2: Also use it for Condor

    Yes, it could make sense to share the throttling implementation with slurm_cluster_system and then just use a much higher threshold (like once a second). It's just reading a file, but the MPI cluster also does not like if you hammer it's filesystem too much.

@luator
Copy link
Member Author

luator commented Jan 11, 2024

The purpose of this is mostly to quickly detect if something is fundamentally wrong that makes all jobs fail, right? That is, it is enough to do this only once in the beginning? I was wondering if it would make sense to reset when new jobs are submitted, but depending on the number and duration of jobs this might again lead to over-polling the system.

I'd probably start with a slightly higher value (let's say 10s) but increase a bit more slowly as in my experience so far, it sometimes take a bit until Slurm actually starts the job, so lot's of checking in the very beginning might not be that useful.

By Felix Widmaier on 2024-01-11T12:21:19 (imported from GitLab)

@luator luator added the question Further information is requested label Apr 16, 2024
@mseitzer mseitzer removed the question Further information is requested label Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants