Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SiteListPoller more verbose #12253

Merged
merged 1 commit into from
Feb 7, 2025
Merged

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Feb 6, 2025

Fixes #12039

Status

In development

Description

The component no longer crashes, but I have the impression it is not doing what it is supposed to do, so here we go with extra log records to have a clue of what is going on internally in the component.

UPDATE: the way we list active workflows in the agent, with self.listActiveWflows.execute(), only works if the workflow has been acquired from LQ to WMBS. That means workflows with single WQE could potentially not go through the update sitelist process. The solution for this problem is to look into local workqueue elements in Available status.

Summary of changes provided are:

  • we identify both active and not-yet-active workflows
  • remove unused code for the in disk pickle file (given that now we are updating it in the local workqueue database)
  • make the worker thread more verbose, such that we have a clue of what is going on

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

Complement to:
#12123
and
#12232
and
#12245

External dependencies / deployment changes

None

@dmwm-bot
Copy link

dmwm-bot commented Feb 6, 2025

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 2 warnings
    • 4 comments to review
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/337/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Feb 6, 2025

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 4 comments to review
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/339/artifact/artifacts/PullRequestReport.html

Find not-yet-active workflows; remove unused spec logic

Join lists of workflows; sanitize correctly
@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 6, 2025

I have updated the initial PR description, but there are 2 modes of active work to be considered:

  1. active work in the agent (which we were basing on WMBS data)
  2. active work in the agent, but not yet inserted into WMBS (hence, only active at the local workqueue level)

I did not foresee to have the case 2) above, so the current implementation was missing it.

I have modified this code such that:

  • we identify both active and not-yet-active workflows
  • remove unused code for the in disk pickle file (given that now we are updating it in the local workqueue database)
  • make the worker thread more verbose, such that we have a clue of what is going on

With this, I tested that:

  • local workqueue elements are properly updated (only those in the workqueue database and Available status). This includes both SiteWhitelist and SiteBlacklist
  • if nothing changes in the spec, the component does NOT try to update the elements/spec again (as there are no changes!)

Example of log:

2025-02-06 20:03:28,442:139624890496768:INFO:SiteListPoller:There is a total of 2 common active workflows in the agent and wmstats
2025-02-06 20:03:28,467:139624890496768:INFO:SiteListPoller:Updating amaltaro_TaskChain_ProdMinBias_Nvidia_Agent239_Val_250206_141329_341:
2025-02-06 20:03:28,468:139624890496768:INFO:SiteListPoller:  siteWhitelist ['T2_CH_CERN', 'T1_US_FNAL'] => ['T2_CH_CERN']
2025-02-06 20:03:28,468:139624890496768:INFO:SiteListPoller:  siteBlacklist [] => ['T1_US_FNAL']
2025-02-06 20:03:28,577:139624890496768:INFO:SiteListPoller:Successfully updated elements for workflow 'amaltaro_TaskChain_ProdMinBias_Nvidia_Agent239_Val_250206_141329_341', under WQ states: ['Available'
] and spec at: http://127.0.0.1:5984/workqueue/amaltaro_TaskChain_ProdMinBias_Nvidia_Agent239_Val_250206_141329_341/spec

@dmwm-bot
Copy link

dmwm-bot commented Feb 6, 2025

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 4 comments to review
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/340/artifact/artifacts/PullRequestReport.html

Copy link
Member

@mapellidario mapellidario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks alan, looks good to me

@amaltaro amaltaro merged commit ea27dbc into dmwm:master Feb 7, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update workflow spec files in WMAgents upon site list changes
3 participants