Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle HTCondor "Unable to locate local daemon" Error #12172

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hassan11196
Copy link
Member

Fixes #9703

Status

in development

Description

This pull request adds Exception handling to catch this Unable to locate local daemon error thrown when a schedd instance is created i.e. schedd = htcondor.Schedd() in the SimpleCondorPlugin.py

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

No

@hassan11196 hassan11196 self-assigned this Nov 19, 2024
@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 1 warnings
    • 45 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/89/artifact/artifacts/PullRequestReport.html

@hassan11196
Copy link
Member Author

Hi @amaltaro,

I have wrapped the bossAir submit call in JobSubmitterPoller and the track call in StatusPoller.
however, as we are re-raising the Exception wouldn't the issue remain?

Should I modify the Exception type to something specific and handle it somewhere upstream?
Let me know what you think.

Thanks.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hassan11196 thank you for proposing this fix.

This is not a complete review, as I need to look into the exception propagation with more attention, but I think we should move all lines schedd = htcondor.Schedd() in this module under the try/except clause as well (according to the tracebacks reported in the original issue).

With that, I believe some of these try/except that you provided are no longer relevant.

@hassan11196
Copy link
Member Author

@hassan11196 thank you for proposing this fix.

This is not a complete review, as I need to look into the exception propagation with more attention, but I think we should move all lines schedd = htcondor.Schedd() in this module under the try/except clause as well (according to the tracebacks reported in the original issue).

With that, I believe some of these try/except that you provided are no longer relevant.

I looked at all of schedd = htcondor.Schedd() instances and thier exceptions were handled upstream in several places, but I agree that since its the source we should wrap it, maybe refactor it into a separate method with exception handling.

Let me know what you think. Thanks for looking into it.

@mapellidario mapellidario self-requested a review December 9, 2024 15:12
Copy link
Member

@mapellidario mapellidario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hassan11196 , changes look good and reflect the discussion you had with alan in the original issue. I have a small question though :)

Comment on lines 764 to 765
myThread.transaction.rollback()
raise WMException(msg) from ex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure i understand the need for this rollback, since it seems that if you raise a wmexception that the rollback already happens here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mapellidario, I agree it was redundant, I have received the rollback from the submitJobs method.

@hassan11196
Copy link
Member Author

Thank you for the review @mapellidario, can you provide a suggestion on how to handle the exception thrown at
https://github.com/hassan11196/WMCore/blob/handle-condor-local-daemon/src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py#L764

That is re-raised in the algorithm method?
https://github.com/hassan11196/WMCore/blob/024268088c77e29805feb2fd44c8ca529f1ee7bc/src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py#L842

If this is not handled, the component will crash

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 1 warnings
    • 45 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/178/artifact/artifacts/PullRequestReport.html

@mapellidario
Copy link
Member

I thought about this for a while. The only way out that I see is adding a new exception, let's say WMSoftException, that you throw it when bossair fail (L765), that you catch it one level up (L840), that you rollback the transaction without raising again.

you can start defining the new exception in the jobsubmitterpoller module, than in case we need it in more places we can create a new file, even a simple would do for the time being:

class WMSoftException(Exception):
    pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JobStatusLite crashing while tracking jobs - unable to locate daemon
4 participants