Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate state transition validation from nonState transition valdations #12120

Conversation

todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov commented Sep 27, 2024

Fixes #12037

Status

READY

Description

SUPERSEDES: #12146 && #12148

With the current change we allow SiteLists related actions for statuses: staging, acquired, running-open in ReqMgr2

Before making a call to central services for changing any of the request parameters, an additional step is executed to
to check which are the allowed parameter modifications for the given status and if the so provided new values from the rest call actually differ from the workflow parameters already defined in central couchdb.

With the current change we no longer ignore all the rest of the request arguments provided with the REST call in the cases of a change of the Request priority. See: #8457 (comment)

The current change is also meant to address the issue with vanishing parameters during assignment of ACDC workflows as explained here: #12037 (comment). Even though, the issue would have manifested itself for regular workflows as well, if they were to experience any parameter change in a state transition from assignment-approved to assigned. Currently this process is taken by Unified (and I believe no parameter change was happening during this, so automated, step) and the only manual intervention we perform at this state transition was for ACDC ... ,hence why we noticed the misbehavior with an ACDC workflow.

With the current PR we suggest a separate logic in the validation function between state and non-state transition workflow updates.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

This PR provides a cherry-pick of 3 pull requests that have been recently reverted. Changes have been originally provided by:
#12077
#12108
#12111

External dependencies / deployment changes

No

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 7 warnings
    • 46 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15251/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov as we have already found 2 issues with this development, I would like to ask you to run a representative validation for this fix. By representative, I can think of at least the following use cases:

  • create and assign standard workflow
  • create and assign ACDC workflow
  • change priority for active workflow
  • change site lists for active workflow
  • change priority and site lists for active workflow

Once we cover all these use cases, then you could look into fixing the unit tests as well.

@cmsdmwmbot
Copy link

Can one of the admins verify this patch?

@amaltaro
Copy link
Contributor

amaltaro commented Oct 3, 2024

@todor-ivanov I am trying to organize WM central services upgrade and I need to know how this validation is progressing and when you think you can finish it? If you think it can still take a few days, then we might actually revert the 2 changes that went in and give you the time you need to validate this.

@todor-ivanov
Copy link
Contributor Author

@amaltaro I am pretty sure I will not be able to finish this before mid next week.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Oct 18, 2024

hi @amaltaro:
I have patched my central services cmsweb-test1.cern.sh with those both patches: #12148 && #12120

Coming back to the tests requested here: #12120 (review)

  • create and assign standard workflow - DONE
  • create and assign ACDC workflow - Waiting for a complete Workflow
  • change priority for active workflow - DONE
  • change site lists for active workflow - DONE
  • change priority and site lists for active workflow - DONE

Here are two of the workflows I used for these tests:

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Oct 18, 2024

hi gain @amaltaro , and here are the results for an artificially created ACDC:

This is the workflow: https://cmsweb-test1.cern.ch/reqmgr2/fetch?rid=tivanov_ACDC_TaskChain_LumiMask_multiRun_SiteListsTest_v6_241018_134936_9617

  • The workflow is properly transferred from: new to assignment-parroved
  • If I try to change any parameter without a change of the status from assignment-approved then I get this error:
[18/Oct/2024:13:56:41]  SERVER REST ERROR WMCore.ReqMgr.DataStructs.RequestError.InvalidSpecParameterValue 5e296467497b0a2f6e4069a199ef7e28 (Invalid spec parameter value: There were unhandled arguments left for no-status update: ['TrustSitelists', 'TrustPUSitelists', 'CustodialSites', 'NonCustodialSites'])
[18/Oct/2024:13:56:41]    Traceback (most recent call last):
[18/Oct/2024:13:56:41]      File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Server.py", line 749, in default
[18/Oct/2024:13:56:41]        return self._call(RESTArgs(list(args), kwargs))
[18/Oct/2024:13:56:41]      File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Server.py", line 832, in _call
[18/Oct/2024:13:56:41]        obj = apiobj['call'](*safe.args, **safe.kwargs)
[18/Oct/2024:13:56:41]      File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 559, in put
[18/Oct/2024:13:56:41]        result = self._updateRequest(workload, request_args)
[18/Oct/2024:13:56:41]      File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 537, in _updateRequest
[18/Oct/2024:13:56:41]        report = self._handleNoStatusUpdate(workload, request_args, dn)
[18/Oct/2024:13:56:41]      File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 446, in _handleNoStatusUpdate
[18/Oct/2024:13:56:41]        raise InvalidSpecParameterValue(msg)
[18/Oct/2024:13:56:41]    WMCore.ReqMgr.DataStructs.RequestError.InvalidSpecParameterValue: InvalidSpecParameterValue 5e296467497b0a2f6e4069a199ef7e28 [HTTP 400, APP 1102, MSG "Invalid spec parameter value: There were unhandled arguments left for no-status update: ['TrustSitelists', 'TrustPUSitelists', 'CustodialSites', 'NonCustodialSites']", INFO None, ERR None]

Which is kind of expected, since in the assignment-approved map of allowed arguments we do have a big set, which we indeed do not suport as a NON-STATUS_UPDATE arguments, but we support them in combination with a status update.

  • If I move the workflow from assignemnt-aproved to assigned together with the change of site whitelist or priority or anything else, it all goes smoothly.

@todor-ivanov
Copy link
Contributor Author

@amaltaro

I did yet another test. With and without the patch from this PR applied:

  • Without this patch the workflow update is completely broken as you report it.
  • With this patch the workflow update proceeds only under the combination of change of parameters together with the state transition from assignment-approved to assigned

Which means we end up here:

report = self._handleAssignmentStateTransition(workload, request_args, dn)

instead of here:

report = self._handleNoStatusUpdate(workload, request_args, dn)

So which means we may need to either repeat the actions from _handleAssignementStateTransiotion in _handleNoStateTransition or we stop treating the state transition and non-state transition arguments differently so we unify those _handle* auxiliary methods as I already suggested here: #12099 (comment)
Quote:

Of course, if you ask me - I am completely up for moving the whole logic to be implemented here in a more generic way .... for all status updates, then get rid of a big chunk of code covering custom cases ... and only make the proper calls to this generic method here from upstream modules (e.g. Request in the current case)

What do you think?

@todor-ivanov
Copy link
Contributor Author

BTW, up until now we did not feel the difference for workflow parameter change with and without state update from assignment-approved (which I am explaining in my previous comment) only because we were completely ignoring anything but RequestPriority for _handleNosStatusUpdate calls. So we were not failing the call (as we correctly do now) if we do not handle properly the arguments provided, but just ignoring anything that the user sent to us..... So what I'd say here is that the correct behavior is exposed.... the question is what would we do to fix it. And the two possible paths I can see I listed in my comment above. We need to choose one of those paths.

@amaltaro any ideas?

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Oct 21, 2024

Ok @amaltaro , I tested a really nasty workaround, following the path for calling _handleAssingmentApprovedStateTransition actions from the _handleNonStateTranstion call. This works. I'll provide the workaround in my latest commit: 62a6d43 , but I must stress - I really do not like this approach.

Here are the logs from:

  • Transforming some arguments of an ACDC workflow in assignment-approved status, without actually moving it to assigned
[21/Oct/2024:12:35:26] reqmgr2-bcdccd8c6-hsmlj 188.185.122.76:54180 "GET /reqmgr2/data/request?status=assigned&detail=True HTTP/1.1" 200 OK [data: 1652 in 9632 out 35882 us ] [auth: ok ***]
[21/Oct/2024:12:35:29] reqmgr2-bcdccd8c6-hsmlj 127.0.0.1 "GET /reqmgr2/data/info HTTP/1.1" 200 OK [data: 296 in 668 out 28878 us ] [auth: OK "" "" ] [ref: "" "Go-http-client/1.1" ]
[21/Oct/2024:12:35:46]  Updating request "tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621" with these user-provided args: {'RequestPriority': 200000, 'Team': 'testbed-vocms0290', 'SiteWhitelist': '', 'SiteBlacklist': ['T1_DE_KIT', 'T1_ES_PIC', 'T1_FR_CCIN2P3', 'T1_IT_CNAF', 'T1_UK_RAL', 'T2_BE_IIHE', 'T2_BE_UCL'], 'AcquisitionEra': {'myTask1': 'RunIISummer20UL16wmLHEGENAPV', 'myTask2': 'RunIISummer20UL16SIMAPV', 'myTask3': 'RunIISummer20UL16DIGIPremixAPV', 'myTask4': 'RunIISummer20UL16HLTAPV', 'myTask5': 'RunIISummer20UL16RECOAPV', 'myTask6': 'RunIISummer20UL16MiniAODAPV'}, 'ProcessingString': {'myTask1': 'myTask1_TaskChain_Prod_SiteListsTest_v6', 'myTask2': 'myTask2_TaskChain_Prod_SiteListsTest_v6', 'myTask3': 'myTask3_TaskChain_Prod_SiteListsTest_v6', 'myTask4': 'myTask4_TaskChain_Prod_SiteListsTest_v6', 'myTask5': 'myTask5_TaskChain_Prod_SiteListsTest_v6', 'myTask6': 'myTask6_TaskChain_Prod_SiteListsTest_v6'}, 'ProcessingVersion': {'myTask1': 11, 'myTask2': 12, 'myTask3': 13, 'myTask4': 14, 'myTask5': 15, 'myTask6': 16}, 'Dashboard': 'production', 'MergedLFNBase': '/store/backfill/1', 'TrustSitelists': 'False', 'UnmergedLFNBase': '/store/unmerged', 'MinMergeSize': 2147483648, 'MaxMergeSize': 4294967296, 'MaxMergeEvents': 100000000, 'BlockCloseMaxWaitTime': 66400, 'BlockCloseMaxFiles': 500, 'BlockCloseMaxEvents': 25000000, 'BlockCloseMaxSize': 5000000000000, 'SoftTimeout': 129600, 'GracePeriod': 300, 'TrustPUSitelists': 'True', 'CustodialSites': '', 'NonCustodialSites': '', 'Override': {'eos-lfn-prefix': 'root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/PRODUCTION'}, 'SubscriptionPriority': 'Low'}
[21/Oct/2024:12:35:46]  Updated priority of "tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621" to: 200000
[21/Oct/2024:12:35:46]  Unhandled argument for no-status update: Team
[21/Oct/2024:12:35:46]  Updated SiteWhitelist of "tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621", with:  
[21/Oct/2024:12:35:46]  Updated SiteBlacklist of "tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621", with:  ['T1_DE_KIT', 'T1_ES_PIC', 'T1_FR_CCIN2P3', 'T1_IT_CNAF', 'T1_UK_RAL', 'T2_BE_IIHE', 'T2_BE_UCL']
[21/Oct/2024:12:35:46]  Unhandled argument for no-status update: TrustPUSitelists
[21/Oct/2024:12:35:46]  CurrentRequest status: assignment-approved
[21/Oct/2024:12:35:46]  Handling assignment-approved arguments differently!
[21/Oct/2024:12:35:46]  Assign request tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621, input args: {'RequestPriority': 200000, 'Team': 'testbed-vocms0290', 'SiteWhitelist': '', 'SiteBlacklist': ['T1_DE_KIT', 'T1_ES_PIC', 'T1_FR_CCIN2P3', 'T1_IT_CNAF', 'T1_UK_RAL', 'T2_BE_IIHE', 'T2_BE_UCL'], 'TrustPUSitelists': 'True'} ...
[21/Oct/2024:12:35:46] reqmgr2-bcdccd8c6-hsmlj 188.185.16.223:35786 "PUT /reqmgr2/data/request/tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621 HTTP/1.1" 200 OK [data: 3598 in 110 out 455516 us ] [auth: ok ***]
  • Transitioning the same workflow from assignment-approved to assigned with again few arguments changed:
[21/Oct/2024:12:40:36] reqmgr2-bcdccd8c6-hsmlj 188.184.96.94:20730 "GET /reqmgr2/data/wmagentconfig/vocms0290.cern.ch HTTP/1.1" 200 OK [data: 1567 in 306 out 16027 us ] [auth: ok "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst1/CN=718748/CN=Robot: cms t1" "" ] [ref: "https://cmsweb-test1.cern.ch" "WMCore.Services.Requests/v002" ]
[21/Oct/2024:12:40:55]  Updating request "tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621" with these user-provided args: {'RequestStatus': 'assigned', 'RequestPriority': 200000, 'Team': 'testbed-vocms0290', 'SiteWhitelist': ['T1_US_FNAL', 'T2_CH_CERN'], 'SiteBlacklist': '', 'AcquisitionEra': {'myTask1': 'RunIISummer20UL16wmLHEGENAPV', 'myTask2': 'RunIISummer20UL16SIMAPV', 'myTask3': 'RunIISummer20UL16DIGIPremixAPV', 'myTask4': 'RunIISummer20UL16HLTAPV', 'myTask5': 'RunIISummer20UL16RECOAPV', 'myTask6': 'RunIISummer20UL16MiniAODAPV'}, 'ProcessingString': {'myTask1': 'myTask1_TaskChain_Prod_SiteListsTest_v6', 'myTask2': 'myTask2_TaskChain_Prod_SiteListsTest_v6', 'myTask3': 'myTask3_TaskChain_Prod_SiteListsTest_v6', 'myTask4': 'myTask4_TaskChain_Prod_SiteListsTest_v6', 'myTask5': 'myTask5_TaskChain_Prod_SiteListsTest_v6', 'myTask6': 'myTask6_TaskChain_Prod_SiteListsTest_v6'}, 'ProcessingVersion': {'myTask1': 11, 'myTask2': 12, 'myTask3': 13, 'myTask4': 14, 'myTask5': 15, 'myTask6': 16}, 'Dashboard': 'production', 'MergedLFNBase': '/store/backfill/1', 'TrustSitelists': 'False', 'UnmergedLFNBase': '/store/unmerged', 'MinMergeSize': 2147483648, 'MaxMergeSize': 4294967296, 'MaxMergeEvents': 100000000, 'BlockCloseMaxWaitTime': 66400, 'BlockCloseMaxFiles': 500, 'BlockCloseMaxEvents': 25000000, 'BlockCloseMaxSize': 5000000000000, 'SoftTimeout': 129600, 'GracePeriod': 300, 'TrustPUSitelists': 'True', 'CustodialSites': '', 'NonCustodialSites': '', 'Override': {'eos-lfn-prefix': 'root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/PRODUCTION'}, 'SubscriptionPriority': 'Low'}
[21/Oct/2024:12:40:55]  Assign request tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621, input args: {'RequestStatus': 'assigned', 'RequestPriority': 200000, 'Team': 'testbed-vocms0290', 'SiteWhitelist': ['T1_US_FNAL', 'T2_CH_CERN'], 'SiteBlacklist': [], 'AcquisitionEra': {'myTask1': 'RunIISummer20UL16wmLHEGENAPV', 'myTask2': 'RunIISummer20UL16SIMAPV', 'myTask3': 'RunIISummer20UL16DIGIPremixAPV', 'myTask4': 'RunIISummer20UL16HLTAPV', 'myTask5': 'RunIISummer20UL16RECOAPV', 'myTask6': 'RunIISummer20UL16MiniAODAPV'}, 'ProcessingString': {'myTask1': 'myTask1_TaskChain_Prod_SiteListsTest_v6', 'myTask2': 'myTask2_TaskChain_Prod_SiteListsTest_v6', 'myTask3': 'myTask3_TaskChain_Prod_SiteListsTest_v6', 'myTask4': 'myTask4_TaskChain_Prod_SiteListsTest_v6', 'myTask5': 'myTask5_TaskChain_Prod_SiteListsTest_v6', 'myTask6': 'myTask6_TaskChain_Prod_SiteListsTest_v6'}, 'ProcessingVersion': {'myTask1': 11, 'myTask2': 12, 'myTask3': 13, 'myTask4': 14, 'myTask5': 15, 'myTask6': 16}, 'Dashboard': 'production', 'MergedLFNBase': '/store/backfill/1', 'TrustSitelists': False, 'UnmergedLFNBase': '/store/unmerged', 'MinMergeSize': 2147483648, 'MaxMergeSize': 4294967296, 'MaxMergeEvents': 100000000, 'BlockCloseMaxWaitTime': 66400, 'BlockCloseMaxFiles': 500, 'BlockCloseMaxEvents': 25000000, 'BlockCloseMaxSize': 5000000000000, 'SoftTimeout': 129600, 'GracePeriod': 300, 'TrustPUSitelists': True, 'CustodialSites': [], 'NonCustodialSites': [], 'Override': {'eos-lfn-prefix': 'root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/PRODUCTION'}, 'SubscriptionPriority': 'Low', 'HardTimeout': 129900} ...
[21/Oct/2024:12:40:56] reqmgr2-bcdccd8c6-hsmlj 188.185.16.223:59098 "PUT /reqmgr2/data/request/tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621 HTTP/1.1" 200 OK [data: 3561 in 110 out 236476 us ] [auth: ok

There is a side effect, though, Besides the fact that we get into to spiral of calls within calls of methods which could be aligned sequentially and only skip the irrelevant ones.... but anyway. The side effect I am speaking is actually something I think I've seen in the past, which is - Once you update any of the request parameters in the web interface while the ACDC is in assignment-approved but you do not update the status to assigned the web interface looses the the default values for Team and SiteWhiteList and you need to beware not to hit the button Submit without checking those, because otherwise you'll end up with the following error:

[21/Oct/2024:12:40:30]  Updating request "tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621" with these user-provided args: {'RequestStatus': 'assigned', 'RequestPriority': 200000, 'Team': '', 'SiteWhitelist': ['T1_US_FNAL', 'T2_CH_CERN'], 'SiteBlacklist': '', 'AcquisitionEra': {'myTask1': 'RunIISummer20UL16wmLHEGENAPV', 'myTask2': 'RunIISummer20UL16SIMAPV', 'myTask3': 'RunIISummer20UL16DIGIPremixAPV', 'myTask4': 'RunIISummer20UL16HLTAPV', 'myTask5': 'RunIISummer20UL16RECOAPV', 'myTask6': 'RunIISummer20UL16MiniAODAPV'}, 'ProcessingString': {'myTask1': 'myTask1_TaskChain_Prod_SiteListsTest_v6', 'myTask2': 'myTask2_TaskChain_Prod_SiteListsTest_v6', 'myTask3': 'myTask3_TaskChain_Prod_SiteListsTest_v6', 'myTask4': 'myTask4_TaskChain_Prod_SiteListsTest_v6', 'myTask5': 'myTask5_TaskChain_Prod_SiteListsTest_v6', 'myTask6': 'myTask6_TaskChain_Prod_SiteListsTest_v6'}, 'ProcessingVersion': {'myTask1': 11, 'myTask2': 12, 'myTask3': 13, 'myTask4': 14, 'myTask5': 15, 'myTask6': 16}, 'Dashboard': 'production', 'MergedLFNBase': '/store/backfill/1', 'TrustSitelists': 'False', 'UnmergedLFNBase': '/store/unmerged', 'MinMergeSize': 2147483648, 'MaxMergeSize': 4294967296, 'MaxMergeEvents': 100000000, 'BlockCloseMaxWaitTime': 66400, 'BlockCloseMaxFiles': 500, 'BlockCloseMaxEvents': 25000000, 'BlockCloseMaxSize': 5000000000000, 'SoftTimeout': 129600, 'GracePeriod': 300, 'TrustPUSitelists': 'True', 'CustodialSites': '', 'NonCustodialSites': '', 'Override': {'eos-lfn-prefix': 'root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/PRODUCTION'}, 'SubscriptionPriority': 'Low'}
[21/Oct/2024:12:40:30]  Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 189, in validate
    self._validateRequestBase(param, safe, validate_request_update_args, requestName)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 101, in _validateRequestBase
    workload, r_args = valFunc(args, self.config, self.reqmgr_db_service, param)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Utils/Validation.py", line 102, in validate_request_update_args
    workload.validateArgumentForAssignment(request_args)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkload.py", line 1945, in validateArgumentForAssignment
    validateArgumentsUpdate(schema, argumentDefinition)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 293, in validateArgumentsUpdate
    _validateArgumentOptions(arguments, argumentDefinition, "assign_optional")
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 160, in _validateArgumentOptions
    arguments[arg] = _validateArgument(arg, arguments[arg], argDef)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 101, in _validateArgument
    _validateArgFunction(argument, value, argumentDefinition["validate"])
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 133, in _validateArgFunction
    raise WMSpecFactoryException(msg)
WMCore.WMSpec.WMSpecErrors.WMSpecFactoryException: <@========== WMException Start ==========@>
Exception Class: WMSpecFactoryException
Message: Argument 'Team' with value '', doesn't pass the validate function.
It's definition is:
                              "validate": lambda x: len(x) > 0},

	ClassName : None
	ModuleName : WMCore.WMSpec.WMWorkloadTools
	MethodName : _validateArgFunction
	ClassInstance : None
	FileName : /usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py
	LineNumber : 133
	ErrorNr : 0

Traceback: 

<@---------- WMException End ----------@>

[21/Oct/2024:12:40:30]  SERVER REST ERROR WMCore.ReqMgr.DataStructs.RequestError.InvalidSpecParameterValue b605a54fcc92e4fbf18b20eabe763378 (Invalid spec parameter value: Argument 'Team' with value '', doesn't pass the validate function.
It's definition is:
                              "validate": lambda x: len(x) > 0},
)
[21/Oct/2024:12:40:30]    Traceback (most recent call last):
[21/Oct/2024:12:40:30]      File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Server.py", line 749, in default
[21/Oct/2024:12:40:30]        return self._call(RESTArgs(list(args), kwargs))
[21/Oct/2024:12:40:30]      File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Server.py", line 828, in _call
[21/Oct/2024:12:40:30]        v(apiobj, request.method, api, param, safe)
[21/Oct/2024:12:40:30]      File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 224, in validate
[21/Oct/2024:12:40:30]        raise InvalidSpecParameterValue(msg) from None
[21/Oct/2024:12:40:30]    WMCore.ReqMgr.DataStructs.RequestError.InvalidSpecParameterValue: InvalidSpecParameterValue b605a54fcc92e4fbf18b20eabe763378 [HTTP 400, APP 1102, MSG 'Invalid spec parameter value: Argument \'Team\' with value \'\', doesn\'t pass the validate function.\nIt\'s definition is:\n                              "validate": lambda x: len(x) > 0},\n', INFO None, ERR None]
[21/Oct/2024:12:40:30] reqmgr2-bcdccd8c6-hsmlj 188.185.16.223:33620 "PUT /reqmgr2/data/request/tivanov_ACDC_TaskChain_Prod_SiteListsTest_v6_241021_114436_6621 HTTP/1.1" 400 Bad Request [data: 3544 in 895 out 53804 us ] [auth: ok  ***]

@todor-ivanov todor-ivanov force-pushed the feature_SeteWhitelist_SupportChangeInReqmgr2_fix-12307 branch from 0111d4d to 62a6d43 Compare October 21, 2024 13:18
@amaltaro
Copy link
Contributor

@todor-ivanov given that this PR was started while we had the 3 pull requests merged into master/head, wouldn't it make this development more clean if we apply the changes in this PR on top of #12148?

Or having a second look into the commits in this one, it looks like it has all the commits from #12148. If that is the case, should we close out #12148 to avoid any possible confusion?

In addition, please make sure the initial PR description is up-to-date.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Oct 23, 2024

hi @amaltaro that was my idea as well. I am pretty much in favor of closing #12148. While testing the fix about the broken assignment-approved transition, provided here with my last two commits, I had to rebase on top of it #12148 already, so the current PR has all that was there as well.

I'll merge the two PR descriptions as well.

@amaltaro
Copy link
Contributor

And before I forget to say it once again, when validating this feature a couple of weeks ago, I noticed that we could update the site lists in ReqMgr2 Web UI as well.

It looks like the original idea for get_modifiable_properties was to render fields in the Web UI, instead of making that an authoritative list of parameters that are allowed to change in each status.

If we want to repurpose that, I think it is fine. But I am not very comfortable with allowing users to change site lists in the Web UI. It's very much error-prone, and given the cost of this operation across the system, we need to keep it as low as we can.

Said that, @todor-ivanov can you please check with the P&R team on the actual needs to have this feature in the Web UI as well? If they are happy having this feature only through REST API - programmatically - that would be the best IMO.

@todor-ivanov todor-ivanov requested a review from amaltaro October 23, 2024 12:34
@todor-ivanov
Copy link
Contributor Author

Hi @amaltaro, if I am to revert this at this stage, it would require a significant refactoring of the whole idea behind the whole change. I am not even sure it would be possible, because this is how it is validated - this is my understanding of the code even at the current stage. WE were simply ignoring anything sent with the user's request and just setting up the priority field (all the rest we were setting to zero). I am not sure we are actually changing any behavior. To it seems we are all good. . And I actually do find it quite useful to have it exposed to the WEB UI. At the end adding a feature to be available through one interface and not through the another... would be yet one more if we should remember forever.

@amaltaro
Copy link
Contributor

The way the REST and Web UI are constructed is different, so there is no conditional statement involved in this story.

I also agree that having it in the Web UI is useful. But my concern is on typos and mistakes by using the Web UI, which can perhaps lead to error. Right now, Web UI is only used for ACDC assignment, AFAIK.
I am curious to know the position of the P&R team on this. Can you please communicate it with Ahmed/Hassan?

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Oct 23, 2024

typos and mistakes by using the Web UI,

the SiteLists are drop down menus.

@todor-ivanov
Copy link
Contributor Author

And we got the P&R reply -The'd prefer to have the WEB UI interface as well

@amaltaro
Copy link
Contributor

Thank you Todor. That would have been my answer as well, from the user perspective.

@khurtado
Copy link
Contributor

test this please

@anpicci
Copy link
Contributor

anpicci commented Oct 28, 2024

@amaltaro @todor-ivanov I propose to keep going with the proposed solution, with claryfing to operators they will be accountable for any disruptions induced with improper use of this functionality.

I have only question:

  • This solution prevents anyone to pick a site that isn't part of our resource site pool, am I correct? For example, I can select T1_FNAL_US, but I cannot select T2_FNAL_US

Btw, @todor-ivanov there are some failing checks for Jenkins

@todor-ivanov
Copy link
Contributor Author

hi @anpicci :

This solution prevents anyone to pick a site that isn't part of our resource site pool, am I correct? For example, I can select T1_FNAL_US, but I cannot select T2_FNAL_US

Yes. The site names are predefined and already populated in the list of possibilities:
siteLists_assignment-approved_WEBUI

Btw, @todor-ivanov there are some failing checks for Jenkins

I cannot make sense of these so far. @khurtado how should I read them? I see you have cancelled the tests, are those reliable at this stage?

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, please find some comments and requests along the code.

@@ -22,11 +23,48 @@
from WMCore.WMSpec.WMWorkloadTools import loadSpecClassByType, setArgumentsWithDefault
from WMCore.Cache.GenericDataCache import GenericDataCache, MemoryCacheStruct


def workqueue_stat_validation(request_args):
stat_keys = ['total_jobs', 'input_lumis', 'input_events', 'input_num_files']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you forgot to update this code with the new constant variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are importing ALLOWED_STAT_KEYS, so you can replace this line by that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

workload.setPriority(reqArgs['RequestPriority'])
cherrypy.log('Updated priority of "{}" to: {}'.format(workload.name(), reqArgs['RequestPriority']))
elif reqArg == "SiteWhitelist":
workload.setSiteWhitelist(reqArgs["SiteWhitelist"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lost track of where I provided this feedback, but we need to validate the site list provided according to the StdBase base class. Similarly to the other site list.
Check out the validation function here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1148

cherrypy.log('Updated SiteBlacklist of "{}", with: {}'.format(workload.name(), reqArgs['SiteBlacklist']))
else:
reqArgsNothandled.append(reqArg)
cherrypy.log("Unhandled argument for no-status update: %s" % reqArg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be somehow verbose. I would suggest moving this line (well, with the relevant update) under the if reqArgsNothandled: conditional block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to .. in the next PR this whole line completely vanishes....

if reqArgsNothandled:
if reqStatus == 'assignment-approved':
cherrypy.log(f"Handling assignment-approved arguments differently!")
self._handleAssignmentStateTransition(workload, request_args, dn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to _handleAssignmentStateTransition should only be performed when we are making a workflow assignment, which is defined from a state transition from assignment-approved to assigned, which we do not have in here. What am I missing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this exactly is the line which solves the bug because of which this PR was stopped. And in relation also to my previous comment - as long as we do not want to refactor the whole logic how we make those request parameters changes we will still have to treat assignment-approved status separately even when it comes to non-state transition changes.
the alternative would be to re-implement all the calls from the method _handleAssignmentStateTransition (basically copy paste them) here as well.

reqArgsNothandled.append(reqArg)
cherrypy.log("Unhandled argument for no-status update: %s" % reqArg)

reqStatus = self.reqmgr_db_service.getRequestByNames(workload.name())[workload.name()]['RequestStatus']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is line of code is really needed, then we should only execute it on demand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is needed, as long as we want to not approach the whole action in a generic way and unite all request parameters changes in a common mechanism. we will have to know about the status of the request even during non-status update actions.

else:
msg = "There are invalid arguments for no-status update: %s" % request_args
raise InvalidSpecParameterValue(msg)
reqArgs = deepcopy(request_args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already discussed this line in your other PR, but can you please remind me why we need to have a full copy of the object at this place?
From what I can tell, we only pop RequestName in the method validate_request_update_args(), which is called here:

def validate(self, apiobj, method, api, param, safe):

So making this is a tardy use of deepcopy, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the whole chain of calls for validating and implementing a change to the request is done through destructive methods, this one not making any difference. And some of the logic (also some of it was mentioned by you yourself during one of those meetings) was relying on those destructive calls during the validation etc. etc. Hence we'd like to act here only through a completely separate copy of the request_arguments (which is really nothing big in terms of size) and not alter the already existing logic of validating the request arguments. So in short - to stay on the safe side.


from Utils.PythonVersion import PY3
from Utils.Utilities import encodeUnicodeToBytesConditional
from WMCore.Lexicon import procdataset
from WMCore.REST.Auth import authz_match
from WMCore.ReqMgr.DataStructs.Request import initialize_request_args, initialize_clone
from WMCore.ReqMgr.DataStructs.RequestError import InvalidStateTransition, InvalidSpecParameterValue
from WMCore.ReqMgr.DataStructs.RequestStatus import check_allowed_transition, STATES_ALLOW_ONLY_STATE_TRANSITION
from WMCore.ReqMgr.DataStructs.RequestStatus import check_allowed_transition, get_modifiable_properties, STATES_ALLOW_ONLY_STATE_TRANSITION, ALLOWED_STAT_KEYS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break it in multiple lines, e.g.:

from WMCore.ReqMgr.DataStructs.RequestStatus import (check_allowed_transition, get_modifiable_properties,
                                STATES_ALLOW_ONLY_STATE_TRANSITION, ALLOWED_STAT_KEYS)

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 15 warnings
    • 171 comments to review
  • Pycodestyle check: succeeded
    • 10 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/80/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

This whole PR is transient, providing a bugfix related to assignment-approved non-state transition actions which was found during the validation of the previous WMCore tag. The next PR #12099 is going to remake a big part of this logic, and we should concentrate on discussing the solution there. So I'd suggest we merge this one (it has already been properly tested and proved it fixes the bug reported) and I rebase #12099 on top of this one, such that we can continue our work on this feature.

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 15 warnings
    • 171 comments to review
  • Pycodestyle check: succeeded
    • 10 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/81/artifact/artifacts/PullRequestReport.html

Add proper checks for allowed request properties changes && Stop reducing request_args only to RequestPriority for noStatusTransition actions

Extend allowed arguments including stat_keys and RequestStatus && Call the relevant modifiers from _handleNoStatusUpdate

Fix missing RequestName from request_args on mutipple calls of validate_request_update_args

Add proper log messages for Sitelists changes

Remove redundant validation calls && Update reqdb couch with single action && Move reduceReport to Utils.

Typo

Remove forgotten commented lines of code

Review comments

Source files pylint fixes

Review comments

Exclude RequestStatus from the returned values of get_modifiable_properties

Remove commented lines

Separate state transition validation from nonState transition valdations

Handle assignment-approved arguments differently inside _handleNoSatusUpdate calls

Review comments
Unit tests

Unit tests pylint fixes

Unit tests - remove tests for reduceReport

Unit tests

Unit tests
@todor-ivanov todor-ivanov force-pushed the feature_SeteWhitelist_SupportChangeInReqmgr2_fix-12307 branch from f9df765 to 49b07c5 Compare November 11, 2024 14:17
@todor-ivanov
Copy link
Contributor Author

squashed and rebased as well

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 15 warnings
    • 171 comments to review
  • Pycodestyle check: succeeded
    • 10 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/82/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

@todor-ivanov unless you have a strong reason not to, I think it would be better to bring in #12099 on top of these changes - or at least, the relevant ReqMgr2 changes.

This will make traceability and review of these changes (which are implemented and then partially/totally deleted in the upcoming PR) much easier. The other PR is supposed to be fixing "Update workqueue elements upon workflow site list change", so it would be better to have only that relevant code in there as well.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Nov 15, 2024

hi @amaltaro did you mean:

@amaltaro
Copy link
Contributor

I was suggesting something similar to the first approach, but it does not have to be a full merge of the two PRs because that means you would be mixing:

  • support to live update of the parameters in ReqMr2
  • and support to propagate live updates to the workqueue elements

It would be great if we could cover the ReqMgr2 related changes in this PR only (of course, other than what is required for the WorkQueue update).
In different words, if this code adds some piece of code that the next PR is actually deleting, then please merge those together such that we don't even add anything into master.

@todor-ivanov
Copy link
Contributor Author

@amaltaro I have rebased #12099 on top of this PR, but currently it is not visible all that will be moved from this PR into the other, until we merge the current one into master, simply because the current changes are not yet into master. If you want to see the exact diff between them both before the current one is merged, you may look at this temp PR which I created in my repo, just for comparison: https://github.com/todor-ivanov/WMCore/pull/1/files

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov other than a comment along the code, I have 2 more general comments:

  1. I think I requested it in my previous review. We need to validate the site lists provided by the user, see StdBase for the actual validation function.
  2. this PR introduces 2 extra calls to ReqMgr2, retrieving the workflow data for some (pre-)processing, as in:

if possible, we should fetch the request content from ReqMgr2 only once.

@@ -22,11 +23,48 @@
from WMCore.WMSpec.WMWorkloadTools import loadSpecClassByType, setArgumentsWithDefault
from WMCore.Cache.GenericDataCache import GenericDataCache, MemoryCacheStruct


def workqueue_stat_validation(request_args):
stat_keys = ['total_jobs', 'input_lumis', 'input_events', 'input_num_files']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are importing ALLOWED_STAT_KEYS, so you can replace this line by that.

@todor-ivanov
Copy link
Contributor Author

hi @amaltaro

about:

  1. I think I requested it in my previous review. We need to validate the site lists provided by the user, see StdBase for the actual validation function.

I think the contents of the request are already validated, but even if not it would be much better to have this development added while working with the next PR

  1. this PR introduces 2 extra calls to ReqMgr2, retrieving the workflow data for some (pre-)processing, as in:
    
    in the Validation.py module: https://github.com/dmwm/WMCore/pull/12120/files#diff-87fd02ae1a3e3ffa5aec6f34177fcf3471187007cfb2d9a668bf4ef8f726b12aR107
    in the Request.py module: https://github.com/dmwm/WMCore/pull/12120/files#diff-120ee6838284a3d1c1799f511da7f147179d0a955f87d0da6fc8b58a8b66c794R444

I can relate to this comment. I was already thinking of uniting those two calls in a single one, but again it will happen much easier when we move all workload content modification operations to WorkloadHelper.

So lets merge this one and move this discussion there. The current code is tested and not going to break RequestManager if it goes to master branch.

@amaltaro amaltaro merged commit 867b44e into dmwm:master Nov 22, 2024
1 of 3 checks passed
@amaltaro
Copy link
Contributor

For the site validation, we need to validate it as soon as possible in the chain. Having said that, the proper place would be to either use the Request.py or the Validation.py modules, instead of WMWorkload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support change to SiteWhitelist/SiteBlacklist in ReqMgr2 for active workflows
6 participants