Skip to content
This repository has been archived by the owner on Jun 25, 2024. It is now read-only.

Expose error handling services #519

Conversation

bshephar
Copy link
Contributor

@bshephar bshephar commented Nov 14, 2023

Add handling for max_fail_percentage and any_errors_fatal

This change adds handling for the max_fail_percentage and
any_errors_fatal arguments. These are provided via --extra-vars as
edpm_max_fail_percentage and edpm_any_errors_fatal when used on the
OpenStackDataPlaneService CR's. Else, they can be provided globally to
all services like any other Ansible variable under ansibleVars on the
OpenStackDataPlaneNodeSet.

If values are not provided, or values are removed. Then we will default
to Ansible default values.

Copy link
Contributor

openshift-ci bot commented Nov 14, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

bshephar added a commit to bshephar/edpm-ansible that referenced this pull request Nov 14, 2023
This change adds any_errors_fatal and max_fail_percentage to each
playbook. These params will be leveraged via the dataplane-operator
services using the interfaces added in this PR:
openstack-k8s-operators/dataplane-operator#519

Related: openstack-k8s-operators/dataplane-operator#519
Signed-off-by: Brendan Shephard <[email protected]>
@bshephar bshephar marked this pull request as ready for review November 14, 2023 21:56
@openshift-ci openshift-ci bot requested review from rebtoor and stuggi November 14, 2023 21:57
@bshephar
Copy link
Contributor Author

Unrelated Kuttl failure:

Error running must-gather collection:
    gather did not start for pod must-gather-6z4f5: unable to pull image: ImagePullBackOff: Back-off pulling image "quay.io/openstack-k8s-operators/openstack-must-gather:latest"

@bshephar
Copy link
Contributor Author

/test dataplane-operator-build-deploy-kuttl

@bshephar
Copy link
Contributor Author

Confirmed the image can be pulled now:

❯ podman pull quay.io/openstack-k8s-operators/openstack-must-gather:latest
Trying to pull quay.io/openstack-k8s-operators/openstack-must-gather:latest...
Getting image source signatures
Copying blob sha256:9cba3a43324a1e54c728b7fb727b733f180dfcf18f24545e07d73b3fb67ff4d2
Copying blob sha256:42f5b36203c412784f366842a5e0a1b9be98741e6544df06da7c9e01d78e911f
Copying config sha256:f8f8b16929798f4173941fd8f3a676bebb00a43d222a159448186edf9ba10fa9
Writing manifest to image destination
f8f8b16929798f4173941fd8f3a676bebb00a43d222a159448186edf9ba10fa9

bshephar added a commit to bshephar/dataplane-operator that referenced this pull request Nov 15, 2023
This change adds examples for how to define max_fail_percentage and
any_errors_fatal. These were added by:
openstack-k8s-operators#519

Signed-off-by: Brendan Shephard <[email protected]>
Comment on lines 79 to 89

// AnsibleMaxFailPercentage is used to tune service specific, allowable failure percentages during the Ansible execution
// https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html#setting-a-maximum-failure-percentage
// +kubebuilder:validation:Optional
AnsibleMaxFailPercentage int64 `json:"ansibleMaxFailPercentage,omitempty"`

// AnsibleAnyErrorsFatal is used to tune service specific, any_errors_fatal
// +kubebuilder:validation:Optional
AnsibleAnyErrorsFatal bool `json:"ansibleAnyErrorsFatal,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is nice having it on services, but I'm thinking it would be nice to also have them on the deployment with the other ansible parameters (so when using servicesOverride you can set them:
https://github.com/openstack-k8s-operators/dataplane-operator/blob/main/api/v1beta1/openstackdataplanedeployment_types.go#L33-L43

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would apply to all services though right? I guess it depends what we're targeting. My initial understanding of the requirement was that we want the ability to set these on a per-service basis. So that we could have different settings for bootstrap compared to telemetry for example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it would apply to all services (I think it would be useful in some cases), but now I agree with #519 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it will work that way anyway. So if you just want to do it globally to all services, you could just provide the values under the ansibleVars. If you want to be more granular about it on a per service basis, then you could set these values on the service itself.

@@ -45,7 +45,7 @@ type ServiceYAML struct {
// DeployService service deployment
func DeployService(ctx context.Context, helper *helper.Helper, obj client.Object, sshKeySecret string, inventorySecret string, aeeSpec dataplanev1.AnsibleEESpec, foundService dataplanev1.OpenStackDataPlaneService) error {

err := dataplaneutil.AnsibleExecution(ctx, helper, obj, foundService.Spec.Label, sshKeySecret, inventorySecret, foundService.Spec.Play, foundService.Spec.Playbook, aeeSpec)
err := dataplaneutil.AnsibleExecution(ctx, helper, obj, &foundService, sshKeySecret, inventorySecret, aeeSpec)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet!

Copy link
Contributor

@jpodivin jpodivin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min and max value from here should do the trick
https://book.kubebuilder.io/reference/markers/crd-validation.html

api/v1beta1/openstackdataplaneservice_types.go Outdated Show resolved Hide resolved
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/dataplane-operator for 519,34612ab132905e6f6a80dbaab77622bcfd72b58a

@bshephar bshephar force-pushed the expose-error-handling-services branch from 34612ab to 011c08a Compare November 16, 2023 14:14
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/dataplane-operator for 519,011c08a1f07bf079cba4b800de8450cd2cff8aaa

Copy link
Contributor

@jpodivin jpodivin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just needs a rebase to avoid errors.

@openshift-ci openshift-ci bot added the lgtm label Nov 16, 2023

// AnsibleAnyErrorsFatal is used to tune service specific, any_errors_fatal
// +kubebuilder:validation:Optional
AnsibleAnyErrorsFatal bool `json:"ansibleAnyErrorsFatal,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we would just set variables for these in ansibleVars on the NodeSet instead of adding new CRD fields. There's also an issue (https://issues.redhat.com/browse/OSPRH-1141) to add ansibleVars to OpenStackDataPlaneDeployment, so they could be set per deployment there.

What would be missing is setting them per-service, which is possible with the CRD fields as you've done here.

However, there is another issue with the CRD fields in that you can not modify the Service CRs of our standard Services since they will be reconciled by the NodeSet controller. So, I don't see how you could set these fields differently per execution, or even from what we ship without writing a custom service.

Or is the intent that you can use either the fields or set ansible variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the current proposal, it would work in both ways right. You could just adds the vars to ansibleVars and all services would receive them. Or, if you are adding your own service and you don't want it to stop the overall deployment you could configure the individual service.

We could configure our default services though, for example everything needs to pass bootstrap, configure-network, etc. But telemetry could block the entire deployment with a single failure. So tuning that service to allow a percentage of failures, or not having any_errors_fatal could be advantageous for those types of services. In that case, we could update our default service definition to tune it appropriately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one case it's a var under ansibleVars, but when it's on the service, it's a custom field. The custom field takes precedence since it's implemented as using extra vars, but that's not clear by looking at the CRD API (that's something users would have to remember).

I'd rather see it consistently set if we're going to have it in both places. We could even call the variables field ansibleExtraVars on the Service, which makes it more clear that these variables would be set using the ansible extra vars interface, which would be familiar to ansible users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that sounds reasonable. So I don't think we want that interface to be extensible enough to set any arbitrary variable though right? Like, we want to limit this interface on the service to specifically these two error handling variables? So, if I implement this with a ansibleExtraVars struct, then move the two currently defined parameters under that struct. Would that satisfy your concern on this one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to think about where we are setting ansible variables and how they are used.

We set them on the nodes and nodeSet, and those end up as host and group vars in the inventory. This makes sense.

We'll likely vars to the Deployment, and those should use the ExtraVars interface and override all other vars from the inventory (https://issues.redhat.com/browse/OSPRH-1141).

Adding variables to the Service CRD as well might make sense for this use case, but how is the precedence handled and how are they implemented? If they also use ExtraVars, that could conflict with setting them on the Deployment. I don't want to create new precedence rules between dataplane-operator CRDs that users have to remember.

That being said, I'm not sure we need to add variables for these to the Service CRD. Can't we just keep the default values in the playbooks as you've done in openstack-k8s-operators/edpm-ansible#494, use custom variables per playbook, and the user can override those per service by setting the right variable in either the node/nodeSet/Deployment?

For instance playbooks/bootstrap.yml uses edpm_bootstrap_ any_errors_fatal and edpm_bootstrap_max_fail_percentage. And we can fine tune those defaults for each service in the playbooks directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can abandon this one and just push the edpm-ansible one through

This change exposes the Ansible configuration options for
AnsibleMaxFailPercentage and AnsibleAnyErrorsFatal via the Services CRD.
This will be passed to the AnsibleEE pods to tune specific service
failure handling.

Signed-off-by: Brendan Shephard <[email protected]>
@bshephar bshephar force-pushed the expose-error-handling-services branch from 011c08a to d1fbfd0 Compare November 17, 2023 09:37
@openshift-ci openshift-ci bot removed the lgtm label Nov 17, 2023
@bshephar bshephar force-pushed the expose-error-handling-services branch from d1fbfd0 to b519434 Compare November 20, 2023 15:15
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/ac90983f3a414aa0a07bea4939fd3015

openstack-k8s-operators-content-provider FAILURE in 7m 05s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
dataplane-operator-docs-preview NODE_FAILURE Node request 200-0006654296 failed in 0s

@bshephar bshephar force-pushed the expose-error-handling-services branch from b519434 to 6cea9e3 Compare November 20, 2023 23:02
bshephar added a commit to bshephar/edpm-ansible that referenced this pull request Nov 20, 2023
This change adds any_errors_fatal and max_fail_percentage to each
playbook. These params will be leveraged via the dataplane-operator
services using the interfaces added in this PR:
openstack-k8s-operators/dataplane-operator#519

Related: openstack-k8s-operators/dataplane-operator#519
Signed-off-by: Brendan Shephard <[email protected]>
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/b957fa747b3e40b183d1ed11d8566b82

openstack-k8s-operators-content-provider FAILURE in 8m 07s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ dataplane-operator-docs-preview SUCCESS in 2m 10s

@bshephar bshephar force-pushed the expose-error-handling-services branch from 6cea9e3 to 55a1726 Compare November 21, 2023 00:05
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/9585323ea2ff4fd8aabfd7d3e94d3b6e

openstack-k8s-operators-content-provider FAILURE in 7m 44s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ dataplane-operator-docs-preview SUCCESS in 2m 08s

@bshephar bshephar force-pushed the expose-error-handling-services branch from 55a1726 to 56ce07c Compare November 21, 2023 04:40
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/6e60a01d4cd4416bbdbb3355ef3c70f5

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 38m 01s
podified-multinode-edpm-deployment-crc FAILURE in 1h 22m 04s
cifmw-crc-podified-edpm-baremetal FAILURE in 23m 11s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 59s

@bshephar bshephar force-pushed the expose-error-handling-services branch 2 times, most recently from 6792052 to 89ac52c Compare November 21, 2023 06:49
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/bf0e0d0ae10c478aaa9437868887c0f1

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 16m 59s
podified-multinode-edpm-deployment-crc FAILURE in 1h 00m 08s
cifmw-crc-podified-edpm-baremetal FAILURE in 22m 17s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 50s

Copy link
Contributor

@gibizer gibizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about the high level use case of this.
By default the ansible run succeeds when every task succeeds. So if something fails it the AEE and Deployment failure will be clearly visible. However when the fail_percentage is defined there can be a success recorded AEE and Deployment CRs even if a given percentage of tasks failed. How will the human deployer know that it was a truly successful run or a partially successful run? Do we need to give some guidance (doc) to the human about it?

Personally I would be very grumpy troubleshooting a customer case where some compute behaves strangely even though the deployment seems to be succeeded just to later figure out that one particular task on that one compute failed to execute causing issue a lot later. As a random concrete example: The task putting the ssh config for the nova user fails on that particular compute. Everything seems to work until the first VM is about to cold migrated away from the given compute, which will fail due to the missing ssh config.

if service.Spec.AnsibleExtraVars.AnsibleMaxFailPercentage != 0 {
fmt.Fprintf(&cmdLineArguments, "--extra-vars edpm_max_fail_percentage=%d ", service.Spec.AnsibleExtraVars.AnsibleMaxFailPercentage)
}
if !*service.Spec.AnsibleExtraVars.AnsibleAnyErrorsFatal {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I think we want to set the edpm_max_fail_percentage var if the AnsibleAnyErrorsFatal pointer is not nil, i.e. it was not provided by the user of the CR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yeah. I think we need to validate that this isn't nil before trying to use it. I have updated that now.

bshephar added a commit to bshephar/edpm-ansible that referenced this pull request Nov 22, 2023
This change adds any_errors_fatal and max_fail_percentage to each
playbook. These params will be leveraged via the dataplane-operator
services using the interfaces added in this PR:
openstack-k8s-operators/dataplane-operator#519

Related: openstack-k8s-operators/dataplane-operator#519
Signed-off-by: Brendan Shephard <[email protected]>
This change adds handling for the max_fail_percentage and
any_errors_fatal arguments. These are provided via --extra-vars as
edpm_max_fail_percentage and edpm_any_errors_fatal when used on the
OpenStackDataPlaneService CR's. Else, they can be provided globally to
all services like any other Ansible variable under ansibleVars on the
OpenStackDataPlaneNodeSet.

If values are not provided, or values are removed. Then we will default
to Ansible default values.

Signed-off-by: Brendan Shephard <[email protected]>
@bshephar bshephar force-pushed the expose-error-handling-services branch from 89ac52c to a400797 Compare November 22, 2023 03:22
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/ee5d9f446ff142dc9bc72c61dae29d98

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 24m 31s
podified-multinode-edpm-deployment-crc FAILURE in 1h 05m 09s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 53m 47s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 55s

@bshephar
Copy link
Contributor Author

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/79cbd578992b4fccbf2e5f8328f835c8

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 20m 42s
podified-multinode-edpm-deployment-crc FAILURE in 1h 03m 38s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 55m 49s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 52s

@jpodivin
Copy link
Contributor

recheck

Failed on tempest.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/17cf42e9e7924e998ddb0da146d170f5

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 23m 10s
podified-multinode-edpm-deployment-crc FAILURE in 1h 03m 42s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 57m 22s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 59s

@fao89
Copy link
Collaborator

fao89 commented Nov 24, 2023

recheck

@openshift-ci openshift-ci bot added the lgtm label Nov 24, 2023
Copy link
Contributor

openshift-ci bot commented Nov 24, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bshephar, jpodivin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gibizer
Copy link
Contributor

gibizer commented Nov 24, 2023

/hold
The below comment from me was missed:

I'm wondering about the high level use case of this. By default the ansible run succeeds when every task succeeds. So if something fails it the AEE and Deployment failure will be clearly visible. However when the fail_percentage is defined there can be a success recorded AEE and Deployment CRs even if a given percentage of tasks failed. How will the human deployer know that it was a truly successful run or a partially successful run? Do we need to give some guidance (doc) to the human about it?

Personally I would be very grumpy troubleshooting a customer case where some compute behaves strangely even though the deployment seems to be succeeded just to later figure out that one particular task on that one compute failed to execute causing issue a lot later. As a random concrete example: The task putting the ssh config for the nova user fails on that particular compute. Everything seems to work until the first VM is about to cold migrated away from the given compute, which will fail due to the missing ssh config.

#519 (review)

@bshephar bshephar closed this Nov 24, 2023
bshephar added a commit to bshephar/edpm-ansible that referenced this pull request Dec 4, 2023
This change adds any_errors_fatal and max_fail_percentage to each
playbook. These params will be leveraged via the dataplane-operator
services using the interfaces added in this PR:
openstack-k8s-operators/dataplane-operator#519

Related: openstack-k8s-operators/dataplane-operator#519
Signed-off-by: Brendan Shephard <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants