-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blackhole Detection: stop accepting jobs if they are consumed at a rate higher than the configured limit and declare the Glidein a blackhole #331
Labels
FEATURE
For FEATURES
Comments
mambelli
changed the title
Blackhole Detection Logs: Publishing Expression/ClassAd Attributes to the StartdLogs
Blackhole Detection: stop accepting jobs if they are consumed at a rate higher than the configured limit
Jan 10, 2024
From the discussion at the monthly HTCondor-Fermilab meetings:
|
mambelli
changed the title
Blackhole Detection: stop accepting jobs if they are consumed at a rate higher than the configured limit
Blackhole Detection: stop accepting jobs if they are consumed at a rate higher than the configured limit and declare the Glidein a blackhole
Jan 18, 2024
Material about the topic:
Plan of work:
|
2 tasks
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Sometimes even if the Glidein tests are successful, it fails quickly all the jobs running on it and asks for new ones.
This behavior is nicknamed blackhole.
By setting a maximum consumption rate and declaring unfit (and unable to accept/match new jobs) a Glidein consuming jobs faster than the configured rate, we can protect the system from undetected failures.
This is the mechanism to be implemented by this ticket.
A new HTCSS feature called STARTD_LATCH_EXPRS (https://opensciencegrid.atlassian.net/browse/HTCONDOR-171) should be useful to implement an irreversible condition triggering the blackhole status.
This feature request is being ported from Redmine ticket R#23253 to 3.10.x
Blackhole Detection Logs: Publishing Expression/ClassAd Attributes to the StartdLogs
Description:
The results of an expression evaluating if a node is a blackhole(R#19214) are published in the machine classAd. We would like to see them in the StartdLog. The logs about blackhole detection are covered in condor logs and glidein logs( client directory in the Factory).
We were discussing with HTCondor team about publishing an expression/classad to the StartdLog and/or an external file. Between the different ideas, it came up that we cannot use a hook that can be triggered each time an attribute changes the value as it would be impractical.
If we wanna use a startd_cron that periodically checks and publishes the value (a script accessing the machine classAd and writing out the interesting value), it could work. Although we wouldn't be able to write a message to the StartLog very easily.
TODO: Implement the periodically checking and writing in the startd logs with startd_cron.
NOTE to keep in mind from TJ: Check if there is something in the glidein mechanism that would give us a cron-like place to put a hook that would be better than using STARTD_CRON.
This feature request is related to Redmine Feature R#19214 - Add a configurable limit to the rate of jobs running and fail the glidein if the rate is passed. The notes/description of R#19214, carried over from Redmine, is attached: glideinwms-19214.pdf
The text was updated successfully, but these errors were encountered: