-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate state transition validation from nonState transition valdations #12120
Separate state transition validation from nonState transition valdations #12120
Conversation
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@todor-ivanov as we have already found 2 issues with this development, I would like to ask you to run a representative validation for this fix. By representative, I can think of at least the following use cases:
- create and assign standard workflow
- create and assign ACDC workflow
- change priority for active workflow
- change site lists for active workflow
- change priority and site lists for active workflow
Once we cover all these use cases, then you could look into fixing the unit tests as well.
Can one of the admins verify this patch? |
@todor-ivanov I am trying to organize WM central services upgrade and I need to know how this validation is progressing and when you think you can finish it? If you think it can still take a few days, then we might actually revert the 2 changes that went in and give you the time you need to validate this. |
@amaltaro I am pretty sure I will not be able to finish this before mid next week. |
hi @amaltaro: Coming back to the tests requested here: #12120 (review)
Here are two of the workflows I used for these tests: |
hi gain @amaltaro , and here are the results for an artificially created ACDC: This is the workflow: https://cmsweb-test1.cern.ch/reqmgr2/fetch?rid=tivanov_ACDC_TaskChain_LumiMask_multiRun_SiteListsTest_v6_241018_134936_9617
Which is kind of expected, since in the
|
I did yet another test. With and without the patch from this PR applied:
Which means we end up here:
instead of here:
So which means we may need to either repeat the actions from
What do you think? |
BTW, up until now we did not feel the difference for workflow parameter change with and without state update from @amaltaro any ideas? |
Ok @amaltaro , I tested a really nasty workaround, following the path for calling Here are the logs from:
There is a side effect, though, Besides the fact that we get into to spiral of calls within calls of methods which could be aligned sequentially and only skip the irrelevant ones.... but anyway. The side effect I am speaking is actually something I think I've seen in the past, which is - Once you update any of the request parameters in the web interface while the ACDC is in
|
0111d4d
to
62a6d43
Compare
@todor-ivanov given that this PR was started while we had the 3 pull requests merged into master/head, wouldn't it make this development more clean if we apply the changes in this PR on top of #12148? Or having a second look into the commits in this one, it looks like it has all the commits from #12148. If that is the case, should we close out #12148 to avoid any possible confusion? In addition, please make sure the initial PR description is up-to-date. |
hi @amaltaro that was my idea as well. I am pretty much in favor of closing #12148. While testing the fix about the broken I'll merge the two PR descriptions as well. |
And before I forget to say it once again, when validating this feature a couple of weeks ago, I noticed that we could update the site lists in ReqMgr2 Web UI as well. It looks like the original idea for If we want to repurpose that, I think it is fine. But I am not very comfortable with allowing users to change site lists in the Web UI. It's very much error-prone, and given the cost of this operation across the system, we need to keep it as low as we can. Said that, @todor-ivanov can you please check with the P&R team on the actual needs to have this feature in the Web UI as well? If they are happy having this feature only through REST API - programmatically - that would be the best IMO. |
Hi @amaltaro, if I am to revert this at this stage, it would require a significant refactoring of the whole idea behind the whole change. I am not even sure it would be possible, because this is how it is validated - this is my understanding of the code even at the current stage. WE were simply ignoring anything sent with the user's request and just setting up the priority field (all the rest we were setting to zero). I am not sure we are actually changing any behavior. To it seems we are all good. . And I actually do find it quite useful to have it exposed to the WEB UI. At the end adding a feature to be available through one interface and not through the another... would be yet one more |
The way the REST and Web UI are constructed is different, so there is no conditional statement involved in this story. I also agree that having it in the Web UI is useful. But my concern is on typos and mistakes by using the Web UI, which can perhaps lead to error. Right now, Web UI is only used for ACDC assignment, AFAIK. |
the SiteLists are drop down menus. |
And we got the P&R reply -The'd prefer to have the WEB UI interface as well |
Thank you Todor. That would have been my answer as well, from the user perspective. |
test this please |
@amaltaro @todor-ivanov I propose to keep going with the proposed solution, with claryfing to operators they will be accountable for any disruptions induced with improper use of this functionality. I have only question:
Btw, @todor-ivanov there are some failing checks for Jenkins |
hi @anpicci :
Yes. The site names are predefined and already populated in the list of possibilities:
I cannot make sense of these so far. @khurtado how should I read them? I see you have cancelled the tests, are those reliable at this stage? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, please find some comments and requests along the code.
@@ -22,11 +23,48 @@ | |||
from WMCore.WMSpec.WMWorkloadTools import loadSpecClassByType, setArgumentsWithDefault | |||
from WMCore.Cache.GenericDataCache import GenericDataCache, MemoryCacheStruct | |||
|
|||
|
|||
def workqueue_stat_validation(request_args): | |||
stat_keys = ['total_jobs', 'input_lumis', 'input_events', 'input_num_files'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you forgot to update this code with the new constant variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are importing ALLOWED_STAT_KEYS, so you can replace this line by that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
workload.setPriority(reqArgs['RequestPriority']) | ||
cherrypy.log('Updated priority of "{}" to: {}'.format(workload.name(), reqArgs['RequestPriority'])) | ||
elif reqArg == "SiteWhitelist": | ||
workload.setSiteWhitelist(reqArgs["SiteWhitelist"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lost track of where I provided this feedback, but we need to validate the site list provided according to the StdBase base class. Similarly to the other site list.
Check out the validation function here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1148
cherrypy.log('Updated SiteBlacklist of "{}", with: {}'.format(workload.name(), reqArgs['SiteBlacklist'])) | ||
else: | ||
reqArgsNothandled.append(reqArg) | ||
cherrypy.log("Unhandled argument for no-status update: %s" % reqArg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be somehow verbose. I would suggest moving this line (well, with the relevant update) under the if reqArgsNothandled:
conditional block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to .. in the next PR this whole line completely vanishes....
if reqArgsNothandled: | ||
if reqStatus == 'assignment-approved': | ||
cherrypy.log(f"Handling assignment-approved arguments differently!") | ||
self._handleAssignmentStateTransition(workload, request_args, dn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to _handleAssignmentStateTransition
should only be performed when we are making a workflow assignment, which is defined from a state transition from assignment-approved
to assigned
, which we do not have in here. What am I missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this exactly is the line which solves the bug because of which this PR was stopped. And in relation also to my previous comment - as long as we do not want to refactor the whole logic how we make those request parameters changes we will still have to treat assignment-approved
status separately even when it comes to non-state
transition changes.
the alternative would be to re-implement all the calls from the method _handleAssignmentStateTransition
(basically copy paste them) here as well.
reqArgsNothandled.append(reqArg) | ||
cherrypy.log("Unhandled argument for no-status update: %s" % reqArg) | ||
|
||
reqStatus = self.reqmgr_db_service.getRequestByNames(workload.name())[workload.name()]['RequestStatus'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is line of code is really needed, then we should only execute it on demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is needed, as long as we want to not approach the whole action in a generic way and unite all request parameters changes in a common mechanism. we will have to know about the status of the request even during non-status update actions.
else: | ||
msg = "There are invalid arguments for no-status update: %s" % request_args | ||
raise InvalidSpecParameterValue(msg) | ||
reqArgs = deepcopy(request_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already discussed this line in your other PR, but can you please remind me why we need to have a full copy of the object at this place?
From what I can tell, we only pop RequestName
in the method validate_request_update_args()
, which is called here:
def validate(self, apiobj, method, api, param, safe): |
So making this is a tardy use of deepcopy, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the whole chain of calls for validating and implementing a change to the request is done through destructive methods, this one not making any difference. And some of the logic (also some of it was mentioned by you yourself during one of those meetings) was relying on those destructive calls during the validation etc. etc. Hence we'd like to act here only through a completely separate copy of the request_arguments
(which is really nothing big in terms of size) and not alter the already existing logic of validating the request arguments. So in short - to stay on the safe side.
|
||
from Utils.PythonVersion import PY3 | ||
from Utils.Utilities import encodeUnicodeToBytesConditional | ||
from WMCore.Lexicon import procdataset | ||
from WMCore.REST.Auth import authz_match | ||
from WMCore.ReqMgr.DataStructs.Request import initialize_request_args, initialize_clone | ||
from WMCore.ReqMgr.DataStructs.RequestError import InvalidStateTransition, InvalidSpecParameterValue | ||
from WMCore.ReqMgr.DataStructs.RequestStatus import check_allowed_transition, STATES_ALLOW_ONLY_STATE_TRANSITION | ||
from WMCore.ReqMgr.DataStructs.RequestStatus import check_allowed_transition, get_modifiable_properties, STATES_ALLOW_ONLY_STATE_TRANSITION, ALLOWED_STAT_KEYS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please break it in multiple lines, e.g.:
from WMCore.ReqMgr.DataStructs.RequestStatus import (check_allowed_transition, get_modifiable_properties,
STATES_ALLOW_ONLY_STATE_TRANSITION, ALLOWED_STAT_KEYS)
Jenkins results:
|
This whole PR is transient, providing a bugfix related to |
Jenkins results:
|
Add proper checks for allowed request properties changes && Stop reducing request_args only to RequestPriority for noStatusTransition actions Extend allowed arguments including stat_keys and RequestStatus && Call the relevant modifiers from _handleNoStatusUpdate Fix missing RequestName from request_args on mutipple calls of validate_request_update_args Add proper log messages for Sitelists changes Remove redundant validation calls && Update reqdb couch with single action && Move reduceReport to Utils. Typo Remove forgotten commented lines of code Review comments Source files pylint fixes Review comments Exclude RequestStatus from the returned values of get_modifiable_properties Remove commented lines Separate state transition validation from nonState transition valdations Handle assignment-approved arguments differently inside _handleNoSatusUpdate calls Review comments
Unit tests Unit tests pylint fixes Unit tests - remove tests for reduceReport Unit tests Unit tests
f9df765
to
49b07c5
Compare
squashed and rebased as well |
Jenkins results:
|
@todor-ivanov unless you have a strong reason not to, I think it would be better to bring in #12099 on top of these changes - or at least, the relevant ReqMgr2 changes. This will make traceability and review of these changes (which are implemented and then partially/totally deleted in the upcoming PR) much easier. The other PR is supposed to be fixing "Update workqueue elements upon workflow site list change", so it would be better to have only that relevant code in there as well. |
hi @amaltaro did you mean:
|
I was suggesting something similar to the first approach, but it does not have to be a full merge of the two PRs because that means you would be mixing:
It would be great if we could cover the ReqMgr2 related changes in this PR only (of course, other than what is required for the WorkQueue update). |
@amaltaro I have rebased #12099 on top of this PR, but currently it is not visible all that will be moved from this PR into the other, until we merge the current one into master, simply because the current changes are not yet into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@todor-ivanov other than a comment along the code, I have 2 more general comments:
- I think I requested it in my previous review. We need to validate the site lists provided by the user, see StdBase for the actual validation function.
- this PR introduces 2 extra calls to ReqMgr2, retrieving the workflow data for some (pre-)processing, as in:
- in the Validation.py module: https://github.com/dmwm/WMCore/pull/12120/files#diff-87fd02ae1a3e3ffa5aec6f34177fcf3471187007cfb2d9a668bf4ef8f726b12aR107
- in the Request.py module: https://github.com/dmwm/WMCore/pull/12120/files#diff-120ee6838284a3d1c1799f511da7f147179d0a955f87d0da6fc8b58a8b66c794R444
if possible, we should fetch the request content from ReqMgr2 only once.
@@ -22,11 +23,48 @@ | |||
from WMCore.WMSpec.WMWorkloadTools import loadSpecClassByType, setArgumentsWithDefault | |||
from WMCore.Cache.GenericDataCache import GenericDataCache, MemoryCacheStruct | |||
|
|||
|
|||
def workqueue_stat_validation(request_args): | |||
stat_keys = ['total_jobs', 'input_lumis', 'input_events', 'input_num_files'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are importing ALLOWED_STAT_KEYS, so you can replace this line by that.
hi @amaltaro about:
I think the contents of the request are already validated, but even if not it would be much better to have this development added while working with the next PR
I can relate to this comment. I was already thinking of uniting those two calls in a single one, but again it will happen much easier when we move all So lets merge this one and move this discussion there. The current code is tested and not going to break |
For the site validation, we need to validate it as soon as possible in the chain. Having said that, the proper place would be to either use the Request.py or the Validation.py modules, instead of WMWorkload. |
Fixes #12037
Status
READY
Description
SUPERSEDES: #12146 && #12148
With the current change we allow SiteLists related actions for statuses: staging, acquired, running-open in ReqMgr2
Before making a call to central services for changing any of the request parameters, an additional step is executed to
to check which are the allowed parameter modifications for the given status and if the so provided new values from the rest call actually differ from the workflow parameters already defined in central couchdb.
With the current change we no longer ignore all the rest of the request arguments provided with the REST call in the cases of a change of the Request priority. See: #8457 (comment)
The current change is also meant to address the issue with vanishing parameters during assignment of ACDC workflows as explained here: #12037 (comment). Even though, the issue would have manifested itself for regular workflows as well, if they were to experience any parameter change in a state transition from
assignment-approved
toassigned
. Currently this process is taken by Unified (and I believe no parameter change was happening during this, so automated, step) and the only manual intervention we perform at this state transition was for ACDC ... ,hence why we noticed the misbehavior with an ACDC workflow.With the current PR we suggest a separate logic in the validation function between
state
andnon-state
transition workflow updates.Is it backward compatible (if not, which system it affects?)
YES
Related PRs
This PR provides a cherry-pick of 3 pull requests that have been recently reverted. Changes have been originally provided by:
#12077
#12108
#12111
External dependencies / deployment changes
No