-
Notifications
You must be signed in to change notification settings - Fork 47
Expose error handling services #519
Expose error handling services #519
Conversation
Skipping CI for Draft Pull Request. |
This change adds any_errors_fatal and max_fail_percentage to each playbook. These params will be leveraged via the dataplane-operator services using the interfaces added in this PR: openstack-k8s-operators/dataplane-operator#519 Related: openstack-k8s-operators/dataplane-operator#519 Signed-off-by: Brendan Shephard <[email protected]>
Unrelated Kuttl failure:
|
/test dataplane-operator-build-deploy-kuttl |
Confirmed the image can be pulled now:
|
This change adds examples for how to define max_fail_percentage and any_errors_fatal. These were added by: openstack-k8s-operators#519 Signed-off-by: Brendan Shephard <[email protected]>
|
||
// AnsibleMaxFailPercentage is used to tune service specific, allowable failure percentages during the Ansible execution | ||
// https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html#setting-a-maximum-failure-percentage | ||
// +kubebuilder:validation:Optional | ||
AnsibleMaxFailPercentage int64 `json:"ansibleMaxFailPercentage,omitempty"` | ||
|
||
// AnsibleAnyErrorsFatal is used to tune service specific, any_errors_fatal | ||
// +kubebuilder:validation:Optional | ||
AnsibleAnyErrorsFatal bool `json:"ansibleAnyErrorsFatal,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is nice having it on services, but I'm thinking it would be nice to also have them on the deployment with the other ansible parameters (so when using servicesOverride
you can set them:
https://github.com/openstack-k8s-operators/dataplane-operator/blob/main/api/v1beta1/openstackdataplanedeployment_types.go#L33-L43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would apply to all services though right? I guess it depends what we're targeting. My initial understanding of the requirement was that we want the ability to set these on a per-service basis. So that we could have different settings for bootstrap compared to telemetry for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it would apply to all services (I think it would be useful in some cases), but now I agree with #519 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it will work that way anyway. So if you just want to do it globally to all services, you could just provide the values under the ansibleVars
. If you want to be more granular about it on a per service basis, then you could set these values on the service itself.
@@ -45,7 +45,7 @@ type ServiceYAML struct { | |||
// DeployService service deployment | |||
func DeployService(ctx context.Context, helper *helper.Helper, obj client.Object, sshKeySecret string, inventorySecret string, aeeSpec dataplanev1.AnsibleEESpec, foundService dataplanev1.OpenStackDataPlaneService) error { | |||
|
|||
err := dataplaneutil.AnsibleExecution(ctx, helper, obj, foundService.Spec.Label, sshKeySecret, inventorySecret, foundService.Spec.Play, foundService.Spec.Playbook, aeeSpec) | |||
err := dataplaneutil.AnsibleExecution(ctx, helper, obj, &foundService, sshKeySecret, inventorySecret, aeeSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min and max value from here should do the trick
https://book.kubebuilder.io/reference/markers/crd-validation.html
c514dc9
to
34612ab
Compare
Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. |
34612ab
to
011c08a
Compare
Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just needs a rebase to avoid errors.
api/bases/dataplane.openstack.org_openstackdataplaneservices.yaml
Outdated
Show resolved
Hide resolved
|
||
// AnsibleAnyErrorsFatal is used to tune service specific, any_errors_fatal | ||
// +kubebuilder:validation:Optional | ||
AnsibleAnyErrorsFatal bool `json:"ansibleAnyErrorsFatal,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we would just set variables for these in ansibleVars on the NodeSet instead of adding new CRD fields. There's also an issue (https://issues.redhat.com/browse/OSPRH-1141) to add ansibleVars to OpenStackDataPlaneDeployment, so they could be set per deployment there.
What would be missing is setting them per-service, which is possible with the CRD fields as you've done here.
However, there is another issue with the CRD fields in that you can not modify the Service CRs of our standard Services since they will be reconciled by the NodeSet controller. So, I don't see how you could set these fields differently per execution, or even from what we ship without writing a custom service.
Or is the intent that you can use either the fields or set ansible variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the current proposal, it would work in both ways right. You could just adds the vars to ansibleVars and all services would receive them. Or, if you are adding your own service and you don't want it to stop the overall deployment you could configure the individual service.
We could configure our default services though, for example everything needs to pass bootstrap, configure-network, etc. But telemetry could block the entire deployment with a single failure. So tuning that service to allow a percentage of failures, or not having any_errors_fatal could be advantageous for those types of services. In that case, we could update our default service definition to tune it appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one case it's a var under ansibleVars, but when it's on the service, it's a custom field. The custom field takes precedence since it's implemented as using extra vars, but that's not clear by looking at the CRD API (that's something users would have to remember).
I'd rather see it consistently set if we're going to have it in both places. We could even call the variables field ansibleExtraVars on the Service, which makes it more clear that these variables would be set using the ansible extra vars interface, which would be familiar to ansible users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that sounds reasonable. So I don't think we want that interface to be extensible enough to set any arbitrary variable though right? Like, we want to limit this interface on the service to specifically these two error handling variables? So, if I implement this with a ansibleExtraVars
struct, then move the two currently defined parameters under that struct. Would that satisfy your concern on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to think about where we are setting ansible variables and how they are used.
We set them on the nodes and nodeSet, and those end up as host and group vars in the inventory. This makes sense.
We'll likely vars to the Deployment, and those should use the ExtraVars interface and override all other vars from the inventory (https://issues.redhat.com/browse/OSPRH-1141).
Adding variables to the Service CRD as well might make sense for this use case, but how is the precedence handled and how are they implemented? If they also use ExtraVars, that could conflict with setting them on the Deployment. I don't want to create new precedence rules between dataplane-operator CRDs that users have to remember.
That being said, I'm not sure we need to add variables for these to the Service CRD. Can't we just keep the default values in the playbooks as you've done in openstack-k8s-operators/edpm-ansible#494, use custom variables per playbook, and the user can override those per service by setting the right variable in either the node/nodeSet/Deployment?
For instance playbooks/bootstrap.yml uses edpm_bootstrap_ any_errors_fatal and edpm_bootstrap_max_fail_percentage. And we can fine tune those defaults for each service in the playbooks directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can abandon this one and just push the edpm-ansible one through
This change exposes the Ansible configuration options for AnsibleMaxFailPercentage and AnsibleAnyErrorsFatal via the Services CRD. This will be passed to the AnsibleEE pods to tune specific service failure handling. Signed-off-by: Brendan Shephard <[email protected]>
011c08a
to
d1fbfd0
Compare
d1fbfd0
to
b519434
Compare
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/ac90983f3a414aa0a07bea4939fd3015 ❌ openstack-k8s-operators-content-provider FAILURE in 7m 05s |
b519434
to
6cea9e3
Compare
This change adds any_errors_fatal and max_fail_percentage to each playbook. These params will be leveraged via the dataplane-operator services using the interfaces added in this PR: openstack-k8s-operators/dataplane-operator#519 Related: openstack-k8s-operators/dataplane-operator#519 Signed-off-by: Brendan Shephard <[email protected]>
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/b957fa747b3e40b183d1ed11d8566b82 ❌ openstack-k8s-operators-content-provider FAILURE in 8m 07s |
6cea9e3
to
55a1726
Compare
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/9585323ea2ff4fd8aabfd7d3e94d3b6e ❌ openstack-k8s-operators-content-provider FAILURE in 7m 44s |
55a1726
to
56ce07c
Compare
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/6e60a01d4cd4416bbdbb3355ef3c70f5 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 38m 01s |
6792052
to
89ac52c
Compare
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/bf0e0d0ae10c478aaa9437868887c0f1 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 16m 59s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about the high level use case of this.
By default the ansible run succeeds when every task succeeds. So if something fails it the AEE and Deployment failure will be clearly visible. However when the fail_percentage is defined there can be a success recorded AEE and Deployment CRs even if a given percentage of tasks failed. How will the human deployer know that it was a truly successful run or a partially successful run? Do we need to give some guidance (doc) to the human about it?
Personally I would be very grumpy troubleshooting a customer case where some compute behaves strangely even though the deployment seems to be succeeded just to later figure out that one particular task on that one compute failed to execute causing issue a lot later. As a random concrete example: The task putting the ssh config for the nova user fails on that particular compute. Everything seems to work until the first VM is about to cold migrated away from the given compute, which will fail due to the missing ssh config.
pkg/util/ansible_execution.go
Outdated
if service.Spec.AnsibleExtraVars.AnsibleMaxFailPercentage != 0 { | ||
fmt.Fprintf(&cmdLineArguments, "--extra-vars edpm_max_fail_percentage=%d ", service.Spec.AnsibleExtraVars.AnsibleMaxFailPercentage) | ||
} | ||
if !*service.Spec.AnsibleExtraVars.AnsibleAnyErrorsFatal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I think we want to set the edpm_max_fail_percentage
var if the AnsibleAnyErrorsFatal pointer is not nil, i.e. it was not provided by the user of the CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yeah. I think we need to validate that this isn't nil before trying to use it. I have updated that now.
This change adds any_errors_fatal and max_fail_percentage to each playbook. These params will be leveraged via the dataplane-operator services using the interfaces added in this PR: openstack-k8s-operators/dataplane-operator#519 Related: openstack-k8s-operators/dataplane-operator#519 Signed-off-by: Brendan Shephard <[email protected]>
This change adds handling for the max_fail_percentage and any_errors_fatal arguments. These are provided via --extra-vars as edpm_max_fail_percentage and edpm_any_errors_fatal when used on the OpenStackDataPlaneService CR's. Else, they can be provided globally to all services like any other Ansible variable under ansibleVars on the OpenStackDataPlaneNodeSet. If values are not provided, or values are removed. Then we will default to Ansible default values. Signed-off-by: Brendan Shephard <[email protected]>
89ac52c
to
a400797
Compare
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/ee5d9f446ff142dc9bc72c61dae29d98 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 24m 31s |
recheck |
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/79cbd578992b4fccbf2e5f8328f835c8 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 20m 42s |
recheck Failed on tempest. |
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/17cf42e9e7924e998ddb0da146d170f5 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 23m 10s |
recheck |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bshephar, jpodivin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold
|
This change adds any_errors_fatal and max_fail_percentage to each playbook. These params will be leveraged via the dataplane-operator services using the interfaces added in this PR: openstack-k8s-operators/dataplane-operator#519 Related: openstack-k8s-operators/dataplane-operator#519 Signed-off-by: Brendan Shephard <[email protected]>
Add handling for max_fail_percentage and any_errors_fatal
This change adds handling for the max_fail_percentage and
any_errors_fatal arguments. These are provided via --extra-vars as
edpm_max_fail_percentage and edpm_any_errors_fatal when used on the
OpenStackDataPlaneService CR's. Else, they can be provided globally to
all services like any other Ansible variable under ansibleVars on the
OpenStackDataPlaneNodeSet.
If values are not provided, or values are removed. Then we will default
to Ansible default values.