Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support change to SiteWhitelist/SiteBlacklist in ReqMgr2 for active workflows #12037

Closed
amaltaro opened this issue Jul 11, 2024 · 7 comments · Fixed by #12077, #12120, #12108 or #12111
Closed

Support change to SiteWhitelist/SiteBlacklist in ReqMgr2 for active workflows #12037

amaltaro opened this issue Jul 11, 2024 · 7 comments · Fixed by #12077, #12120, #12108 or #12111

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Jul 11, 2024

Impact of the new feature
ReqMgr2

Is your feature request related to a problem? Please describe.
This is a sub-task of this meta-issue: #8323
Towards providing a feature that allows workflow site lists to be changed while workflows are active in the system.

Describe the solution you'd like
With this ticket, we are supposed to implement changes to ReqMgr2 such that SiteWhitelist and SiteBlacklist are allowed to be updated in workflows that are in one of the given statuses:

staging, acquired, running-open

Note that this change needs to be persisted in two places:

  • workflow high level description (json)
  • workflow spec file (the WMWorkload object file)

If possible, we should also validate whether the sitelists is actually changing or not. If none of them change, then nothing should be persisted in the database and no extra actions need to be triggered.

The reason we are not going to accept these changes for workflows in assigned and staged is such that we can guarantee no data race between services dealing with the same workflow.

An initial point for this development will have to be made here:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/DataStructs/RequestStatus.py#L118

Describe alternatives you've considered
Unless explicitly requested, I don't think we need to add these capabilities to the ReqMgr2 UI.

Additional context
None

@amaltaro
Copy link
Contributor Author

@todor-ivanov it looks like a bug sneaked in with this development. Can you please look ASAP into:
#12107

I don't think we need to reopen this ticket, but please make the relevant reference in the upcoming PR.

@amaltaro
Copy link
Contributor Author

And here we go with a new issue for ACDC workflows being assigned through the WEB UI:

[26/Sep/2024:18:18:26]  Updating request "amaltaro_ACDC_ReReco_MINIAOD_HG2409_Val_240926_181808_1164" with these user-provided args: {'RequestStatus': 'assigned', 'RequestPriority': 313000, 'Team': 'testbed-vocms0193', 'SiteWhitelist': ['
T1_US_FNAL', 'T2_CH_CERN'], 'SiteBlacklist': '', 'AcquisitionEra': 'Integ_Test', 'ProcessingString': 'ReReco_MINIAOD_HG2409_Val_Alanv11', 'ProcessingVersion': 11, 'Dashboard': 'integration', 'MergedLFNBase': '/store/backfill/1', 'TrustSit
elists': 'False', 'UnmergedLFNBase': '/store/unmerged', 'MinMergeSize': 2147483648, 'MaxMergeSize': 4294967296, 'MaxMergeEvents': 100000000, 'BlockCloseMaxWaitTime': 66400, 'BlockCloseMaxFiles': 500, 'BlockCloseMaxEvents': 25000000, 'Bloc
kCloseMaxSize': 5000000000000, 'SoftTimeout': 129600, 'GracePeriod': 300, 'TrustPUSitelists': 'False', 'CustodialSites': '', 'NonCustodialSites': '', 'Override': {'eos-lfn-prefix': 'root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/TESTBED'}, 'SubscriptionPriority': 'Low'}
[26/Sep/2024:18:18:26]  Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 189, in validate
    self._validateRequestBase(param, safe, validate_request_update_args, requestName)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 101, in _validateRequestBase
    workload, r_args = valFunc(args, self.config, self.reqmgr_db_service, param)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Utils/Validation.py", line 104, in validate_request_update_args
    workload.validateArgumentForAssignment(reqArgsDiff)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkload.py", line 1945, in validateArgumentForAssignment
    validateArgumentsUpdate(schema, argumentDefinition)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 293, in validateArgumentsUpdate
    _validateArgumentOptions(arguments, argumentDefinition, "assign_optional")
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 153, in _validateArgumentOptions
    raise WMSpecFactoryException(msg)
WMCore.WMSpec.WMSpecErrors.WMSpecFactoryException: <@========== WMException Start ==========@>
Exception Class: WMSpecFactoryException
Message: Argument 'Team' is mandatory and must be provided by the user. Its definition is: {'default': '', 'type': <function safeStr at 0x7f41e1df20d0>, 'assign_optional': False, 'validate': <function StdBase.getWorkloadAssignArgs.<locals>.<lambda> at 0x7f41dc037af0>, 'optional': True, 'null': False, 'attr': 'team'}

From what I can tell, the problem is with the function _validate_request_allowed_args in:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/Utils/Validation.py#L92C19-L92C49

which is comparing values provided by the user vs values already in the workload. Given that this is an ACDC workflow, it comes with a bunch of those attributes inherited from the parent workflow.

In other words, while this logic of checking what the user provided against the actual values work for standard workflows, it won't work for Resubmission ones. I am actually suspicious that workflow assignment might even be missing the call to some setters, given that it gets assigned with information already in the workload (e.g. ProcessingString, AcquisitionEra, etc).

I am reopening this ticket.

@amaltaro amaltaro reopened this Sep 26, 2024
@amaltaro amaltaro moved this from Done to In Progress in WMCore quarterly developments Sep 26, 2024
@todor-ivanov
Copy link
Contributor

hi @amaltaro , well the origin of the problem is that with ACDCs we call the validation method for a workflow in status Assigned and we end up here:

elif reqArgsDiff["RequestStatus"] == 'assigned':
workload.validateArgumentForAssignment(reqArgsDiff)

I am currently debugging this exact workflow from testbed, trying to fully simulate the conditions. I'll post my findings here

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Sep 27, 2024

and here is the diff between the dictionaries upon applying _validate_request_allowed_args:

In [17]: test.assertDictEqual(reqArgs, reqArgsDiff)
...
AssertionError: {'Req[48 chars]00, 'Team': 'testbed-vocms0193', 'SiteWhitelis[754 chars]Low'} != {'Req[48 chars]00, 'SiteWhitelist': ['T1_US_FNAL', 'T2_CH_CER[123 chars]: ''}
- {'AcquisitionEra': 'Integ_Test',
-  'BlockCloseMaxEvents': 25000000,
-  'BlockCloseMaxFiles': 500,
-  'BlockCloseMaxSize': 5000000000000,
-  'BlockCloseMaxWaitTime': 66400,
-  'CustodialSites': '',
? ^

+ {'CustodialSites': '',
? ^

-  'Dashboard': 'integration',
-  'GracePeriod': 300,
-  'MaxMergeEvents': 100000000,
-  'MaxMergeSize': 4294967296,
-  'MergedLFNBase': '/store/backfill/1',
-  'MinMergeSize': 2147483648,
   'NonCustodialSites': '',
-  'Override': {'eos-lfn-prefix': 'root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/TESTBED'},
-  'ProcessingString': 'ReReco_MINIAOD_HG2409_Val_Alanv11',
-  'ProcessingVersion': 11,
   'RequestPriority': 313000,
   'RequestStatus': 'assigned',
   'SiteBlacklist': '',
   'SiteWhitelist': ['T1_US_FNAL', 'T2_CH_CERN'],
-  'SoftTimeout': 129600,
-  'SubscriptionPriority': 'Low',
-  'Team': 'testbed-vocms0193',
   'TrustPUSitelists': 'False',
-  'TrustSitelists': 'False',
?                           ^

+  'TrustSitelists': 'False'}
?                           ^

-  'UnmergedLFNBase': '/store/unmerged'}

Everything with - is cut off.... So no way for such a request to be validated later from validateArgumentForAssignment:

workload.validateArgumentForAssignment(reqArgsDiff)

And the reason those being cut off during this action is because none of the request arguments passed here exist in the map of allowed actions per status Assigned, here:

ALLOWED_ACTIONS_FOR_STATUS = {
"new": ["RequestPriority"],
"assignment-approved": ["RequestPriority", "Team", "SiteWhitelist", "SiteBlacklist",
"AcquisitionEra", "ProcessingString", "ProcessingVersion",
"Dashboard", "MergedLFNBase", "TrustSitelists",
"UnmergedLFNBase", "MinMergeSize", "MaxMergeSize",
"MaxMergeEvents", "BlockCloseMaxWaitTime",
"BlockCloseMaxFiles", "BlockCloseMaxEvents", "BlockCloseMaxSize",
"SoftTimeout", "GracePeriod",
"TrustPUSitelists", "CustodialSites",
"NonCustodialSites", "Override",
"SubscriptionPriority"],
"assigned": ["RequestPriority"],
"staging": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],
"staged": ["RequestPriority"],
"acquired": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],
"running-open": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],
"running-closed": ["RequestPriority"],
"failed": [],
"force-complete": [],
"completed": [],
"closed-out": [],
"announced": [],
"aborted": [],
"aborted-completed": [],
"rejected": [],
"normal-archived": [],
"aborted-archived": [],
"rejected-archived": [],
}

The way I can see out of this is - when it comes to assigned status we actually validate and return everything as it is defined in reqArgs rather than reqArgsDiff

@amaltaro,
I'll create a fix and will ask for a review.

@todor-ivanov
Copy link
Contributor

let me stand myself corrected.....
Here:

And the reason those being cut off during this action is because none of the request arguments passed here exist in the map of allowed actions per status Assigned, here

Is actually not the true reason for cutting off these parameters, because we check the allowedParameters from the source status not from the destination, meaning we take the set of allowed parameters for the current status of the workflow, which is:

In [65]: request = reqApi.reqmgr_db_service.getRequestByNames(reqName)[reqName]

In [66]: request['RequestStatus']
Out[66]: 'assignment-approved'

and as one can see all of those are allowed for assignment-approved , the actual reason for cutting them is because they have not changed with the current parameter update call through the REST. and only those which did change were preserved .... (it is for a reason that the result returned is called reqArgsDiff at the end)

This whole thing goes in support to the solution I suggested in my previous comment.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Sep 27, 2024

and to be even more strict in the investigation, some of those request parameters differ between the current request and the new set of arguments sent through the REST interface not by value... but by type:

In [78]: for param in reqArgsDiff:
    ...:     pprint(f'request[{param}]:{request[param]}: {type(request[param])}')
    ...:     pprint(f'reqArgs[{param}]:{reqArgs[param]}: {type(reqArgs[param])}')
    ...:     print('')
    ...: 
"request[RequestStatus]:assignment-approved: <class 'str'>"
"reqArgs[RequestStatus]:assigned: <class 'str'>"

"request[RequestPriority]:3: <class 'int'>"
"reqArgs[RequestPriority]:313000: <class 'int'>"

"request[SiteWhitelist]:['T2_CERN_CH', 'T1_US_FNAL']: <class 'list'>"
"reqArgs[SiteWhitelist]:['T1_US_FNAL', 'T2_CH_CERN']: <class 'list'>"

"request[SiteBlacklist]:['T1_US_FNAL']: <class 'list'>"
"reqArgs[SiteBlacklist]:: <class 'str'>"

"request[TrustSitelists]:False: <class 'bool'>"
"reqArgs[TrustSitelists]:False: <class 'str'>"

"request[TrustPUSitelists]:False: <class 'bool'>"
"reqArgs[TrustPUSitelists]:False: <class 'str'>"

"request[CustodialSites]:[]: <class 'list'>"
"reqArgs[CustodialSites]:: <class 'str'>"

"request[NonCustodialSites]:[]: <class 'list'>"
"reqArgs[NonCustodialSites]:: <class 'str'>"

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Sep 27, 2024

and here is the suggested solution: #12120
@amaltaro please take a look and maybe even apply the patch in testbed and check if it fixes the issue with this exact workflow that you used in the validation above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment