Expose error handling services #519

bshephar · 2023-11-14T01:46:02Z

Add handling for max_fail_percentage and any_errors_fatal

This change adds handling for the max_fail_percentage and
any_errors_fatal arguments. These are provided via --extra-vars as
edpm_max_fail_percentage and edpm_any_errors_fatal when used on the
OpenStackDataPlaneService CR's. Else, they can be provided globally to
all services like any other Ansible variable under ansibleVars on the
OpenStackDataPlaneNodeSet.

If values are not provided, or values are removed. Then we will default
to Ansible default values.

openshift-ci · 2023-11-14T01:46:07Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

This change adds any_errors_fatal and max_fail_percentage to each playbook. These params will be leveraged via the dataplane-operator services using the interfaces added in this PR: openstack-k8s-operators/dataplane-operator#519 Related: openstack-k8s-operators/dataplane-operator#519 Signed-off-by: Brendan Shephard <[email protected]>

bshephar · 2023-11-15T02:22:36Z

Unrelated Kuttl failure:

Error running must-gather collection:
    gather did not start for pod must-gather-6z4f5: unable to pull image: ImagePullBackOff: Back-off pulling image "quay.io/openstack-k8s-operators/openstack-must-gather:latest"

bshephar · 2023-11-15T02:23:57Z

/test dataplane-operator-build-deploy-kuttl

bshephar · 2023-11-15T02:26:07Z

Confirmed the image can be pulled now:

❯ podman pull quay.io/openstack-k8s-operators/openstack-must-gather:latest
Trying to pull quay.io/openstack-k8s-operators/openstack-must-gather:latest...
Getting image source signatures
Copying blob sha256:9cba3a43324a1e54c728b7fb727b733f180dfcf18f24545e07d73b3fb67ff4d2
Copying blob sha256:42f5b36203c412784f366842a5e0a1b9be98741e6544df06da7c9e01d78e911f
Copying config sha256:f8f8b16929798f4173941fd8f3a676bebb00a43d222a159448186edf9ba10fa9
Writing manifest to image destination
f8f8b16929798f4173941fd8f3a676bebb00a43d222a159448186edf9ba10fa9

This change adds examples for how to define max_fail_percentage and any_errors_fatal. These were added by: openstack-k8s-operators#519 Signed-off-by: Brendan Shephard <[email protected]>

fao89 · 2023-11-15T08:03:37Z

api/v1beta1/openstackdataplaneservice_types.go

+
+	// AnsibleMaxFailPercentage is used to tune service specific, allowable failure percentages during the Ansible execution
+	// https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html#setting-a-maximum-failure-percentage
+	// +kubebuilder:validation:Optional
+	AnsibleMaxFailPercentage int64 `json:"ansibleMaxFailPercentage,omitempty"`
+
+	// AnsibleAnyErrorsFatal is used to tune service specific, any_errors_fatal
+	// +kubebuilder:validation:Optional
+	AnsibleAnyErrorsFatal bool `json:"ansibleAnyErrorsFatal,omitempty"`


I think it is nice having it on services, but I'm thinking it would be nice to also have them on the deployment with the other ansible parameters (so when using servicesOverride you can set them:
https://github.com/openstack-k8s-operators/dataplane-operator/blob/main/api/v1beta1/openstackdataplanedeployment_types.go#L33-L43

This would apply to all services though right? I guess it depends what we're targeting. My initial understanding of the requirement was that we want the ability to set these on a per-service basis. So that we could have different settings for bootstrap compared to telemetry for example.

yes, it would apply to all services (I think it would be useful in some cases), but now I agree with #519 (comment)

Ok, it will work that way anyway. So if you just want to do it globally to all services, you could just provide the values under the ansibleVars. If you want to be more granular about it on a per service basis, then you could set these values on the service itself.

fao89 · 2023-11-15T08:04:15Z

pkg/deployment/service.go

@@ -45,7 +45,7 @@ type ServiceYAML struct {
 // DeployService service deployment
 func DeployService(ctx context.Context, helper *helper.Helper, obj client.Object, sshKeySecret string, inventorySecret string, aeeSpec dataplanev1.AnsibleEESpec, foundService dataplanev1.OpenStackDataPlaneService) error {

-	err := dataplaneutil.AnsibleExecution(ctx, helper, obj, foundService.Spec.Label, sshKeySecret, inventorySecret, foundService.Spec.Play, foundService.Spec.Playbook, aeeSpec)
+	err := dataplaneutil.AnsibleExecution(ctx, helper, obj, &foundService, sshKeySecret, inventorySecret, aeeSpec)


jpodivin

min and max value from here should do the trick
https://book.kubebuilder.io/reference/markers/crd-validation.html

api/v1beta1/openstackdataplaneservice_types.go

softwarefactory-project-zuul · 2023-11-16T14:12:27Z

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/dataplane-operator for 519,34612ab132905e6f6a80dbaab77622bcfd72b58a

softwarefactory-project-zuul · 2023-11-16T14:14:09Z

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/dataplane-operator for 519,011c08a1f07bf079cba4b800de8450cd2cff8aaa

jpodivin

It just needs a rebase to avoid errors.

api/bases/dataplane.openstack.org_openstackdataplaneservices.yaml

slagle · 2023-11-16T16:52:45Z

api/v1beta1/openstackdataplaneservice_types.go

+
+	// AnsibleAnyErrorsFatal is used to tune service specific, any_errors_fatal
+	// +kubebuilder:validation:Optional
+	AnsibleAnyErrorsFatal bool `json:"ansibleAnyErrorsFatal,omitempty"`


I was thinking we would just set variables for these in ansibleVars on the NodeSet instead of adding new CRD fields. There's also an issue (https://issues.redhat.com/browse/OSPRH-1141) to add ansibleVars to OpenStackDataPlaneDeployment, so they could be set per deployment there.

What would be missing is setting them per-service, which is possible with the CRD fields as you've done here.

However, there is another issue with the CRD fields in that you can not modify the Service CRs of our standard Services since they will be reconciled by the NodeSet controller. So, I don't see how you could set these fields differently per execution, or even from what we ship without writing a custom service.

Or is the intent that you can use either the fields or set ansible variables?

Given the current proposal, it would work in both ways right. You could just adds the vars to ansibleVars and all services would receive them. Or, if you are adding your own service and you don't want it to stop the overall deployment you could configure the individual service.

We could configure our default services though, for example everything needs to pass bootstrap, configure-network, etc. But telemetry could block the entire deployment with a single failure. So tuning that service to allow a percentage of failures, or not having any_errors_fatal could be advantageous for those types of services. In that case, we could update our default service definition to tune it appropriately.

In one case it's a var under ansibleVars, but when it's on the service, it's a custom field. The custom field takes precedence since it's implemented as using extra vars, but that's not clear by looking at the CRD API (that's something users would have to remember).

I'd rather see it consistently set if we're going to have it in both places. We could even call the variables field ansibleExtraVars on the Service, which makes it more clear that these variables would be set using the ansible extra vars interface, which would be familiar to ansible users.

Ok, that sounds reasonable. So I don't think we want that interface to be extensible enough to set any arbitrary variable though right? Like, we want to limit this interface on the service to specifically these two error handling variables? So, if I implement this with a ansibleExtraVars struct, then move the two currently defined parameters under that struct. Would that satisfy your concern on this one?

I think we need to think about where we are setting ansible variables and how they are used.

We set them on the nodes and nodeSet, and those end up as host and group vars in the inventory. This makes sense.

We'll likely vars to the Deployment, and those should use the ExtraVars interface and override all other vars from the inventory (https://issues.redhat.com/browse/OSPRH-1141).

Adding variables to the Service CRD as well might make sense for this use case, but how is the precedence handled and how are they implemented? If they also use ExtraVars, that could conflict with setting them on the Deployment. I don't want to create new precedence rules between dataplane-operator CRDs that users have to remember.

That being said, I'm not sure we need to add variables for these to the Service CRD. Can't we just keep the default values in the playbooks as you've done in openstack-k8s-operators/edpm-ansible#494, use custom variables per playbook, and the user can override those per service by setting the right variable in either the node/nodeSet/Deployment?

For instance playbooks/bootstrap.yml uses edpm_bootstrap_ any_errors_fatal and edpm_bootstrap_max_fail_percentage. And we can fine tune those defaults for each service in the playbooks directly.

Ok, we can abandon this one and just push the edpm-ansible one through

This change exposes the Ansible configuration options for AnsibleMaxFailPercentage and AnsibleAnyErrorsFatal via the Services CRD. This will be passed to the AnsibleEE pods to tune specific service failure handling. Signed-off-by: Brendan Shephard <[email protected]>

softwarefactory-project-zuul · 2023-11-20T15:24:12Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/ac90983f3a414aa0a07bea4939fd3015

❌ openstack-k8s-operators-content-provider FAILURE in 7m 05s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
❌ dataplane-operator-docs-preview NODE_FAILURE Node request 200-0006654296 failed in 0s

This change adds any_errors_fatal and max_fail_percentage to each playbook. These params will be leveraged via the dataplane-operator services using the interfaces added in this PR: openstack-k8s-operators/dataplane-operator#519 Related: openstack-k8s-operators/dataplane-operator#519 Signed-off-by: Brendan Shephard <[email protected]>

softwarefactory-project-zuul · 2023-11-20T23:11:46Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/b957fa747b3e40b183d1ed11d8566b82

❌ openstack-k8s-operators-content-provider FAILURE in 8m 07s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ dataplane-operator-docs-preview SUCCESS in 2m 10s

softwarefactory-project-zuul · 2023-11-21T00:14:36Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/9585323ea2ff4fd8aabfd7d3e94d3b6e

❌ openstack-k8s-operators-content-provider FAILURE in 7m 44s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ dataplane-operator-docs-preview SUCCESS in 2m 08s

softwarefactory-project-zuul · 2023-11-21T06:19:41Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/6e60a01d4cd4416bbdbb3355ef3c70f5

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 38m 01s
❌ podified-multinode-edpm-deployment-crc FAILURE in 1h 22m 04s
❌ cifmw-crc-podified-edpm-baremetal FAILURE in 23m 11s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 59s

softwarefactory-project-zuul · 2023-11-21T08:08:07Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/bf0e0d0ae10c478aaa9437868887c0f1

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 16m 59s
❌ podified-multinode-edpm-deployment-crc FAILURE in 1h 00m 08s
❌ cifmw-crc-podified-edpm-baremetal FAILURE in 22m 17s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 50s

gibizer

I'm wondering about the high level use case of this.
By default the ansible run succeeds when every task succeeds. So if something fails it the AEE and Deployment failure will be clearly visible. However when the fail_percentage is defined there can be a success recorded AEE and Deployment CRs even if a given percentage of tasks failed. How will the human deployer know that it was a truly successful run or a partially successful run? Do we need to give some guidance (doc) to the human about it?

Personally I would be very grumpy troubleshooting a customer case where some compute behaves strangely even though the deployment seems to be succeeded just to later figure out that one particular task on that one compute failed to execute causing issue a lot later. As a random concrete example: The task putting the ssh config for the nova user fails on that particular compute. Everything seems to work until the first VM is about to cold migrated away from the given compute, which will fail due to the missing ssh config.

gibizer · 2023-11-21T08:55:58Z

pkg/util/ansible_execution.go

+			if service.Spec.AnsibleExtraVars.AnsibleMaxFailPercentage != 0 {
+				fmt.Fprintf(&cmdLineArguments, "--extra-vars edpm_max_fail_percentage=%d ", service.Spec.AnsibleExtraVars.AnsibleMaxFailPercentage)
+			}
+			if !*service.Spec.AnsibleExtraVars.AnsibleAnyErrorsFatal {


Hm. I think we want to set the edpm_max_fail_percentage var if the AnsibleAnyErrorsFatal pointer is not nil, i.e. it was not provided by the user of the CR.

Right, yeah. I think we need to validate that this isn't nil before trying to use it. I have updated that now.

This change adds any_errors_fatal and max_fail_percentage to each playbook. These params will be leveraged via the dataplane-operator services using the interfaces added in this PR: openstack-k8s-operators/dataplane-operator#519 Related: openstack-k8s-operators/dataplane-operator#519 Signed-off-by: Brendan Shephard <[email protected]>

This change adds handling for the max_fail_percentage and any_errors_fatal arguments. These are provided via --extra-vars as edpm_max_fail_percentage and edpm_any_errors_fatal when used on the OpenStackDataPlaneService CR's. Else, they can be provided globally to all services like any other Ansible variable under ansibleVars on the OpenStackDataPlaneNodeSet. If values are not provided, or values are removed. Then we will default to Ansible default values. Signed-off-by: Brendan Shephard <[email protected]>

softwarefactory-project-zuul · 2023-11-22T04:48:52Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/ee5d9f446ff142dc9bc72c61dae29d98

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 24m 31s
❌ podified-multinode-edpm-deployment-crc FAILURE in 1h 05m 09s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 53m 47s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 55s

bshephar · 2023-11-22T05:48:41Z

recheck

softwarefactory-project-zuul · 2023-11-22T07:10:49Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/79cbd578992b4fccbf2e5f8328f835c8

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 20m 42s
❌ podified-multinode-edpm-deployment-crc FAILURE in 1h 03m 38s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 55m 49s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 52s

jpodivin · 2023-11-22T10:50:52Z

recheck

Failed on tempest.

fao89 · 2023-11-22T10:53:08Z

tempest is broken for all PRs, seems to be an issue with neutron: https://logserver.rdoproject.org/29/529/f8257dd139519f166655747656659877c32c77cc/github-check/podified-multinode-edpm-deployment-crc/1651826/controller/ci-framework-data/logs/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-0812f031e363406238d47a3ba3cfb33412e9a2d143d2b4b9365c9796b80bb8aa/namespaces/openstack/pods/neutron-846f77ffc7-59wxz/logs/neutron-api.log

softwarefactory-project-zuul · 2023-11-22T12:56:16Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/17cf42e9e7924e998ddb0da146d170f5

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 23m 10s
❌ podified-multinode-edpm-deployment-crc FAILURE in 1h 03m 42s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 57m 22s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 59s

fao89 · 2023-11-24T09:28:36Z

recheck

openshift-ci · 2023-11-24T13:48:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bshephar, jpodivin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bshephar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gibizer · 2023-11-24T14:37:06Z

/hold
The below comment from me was missed:

I'm wondering about the high level use case of this. By default the ansible run succeeds when every task succeeds. So if something fails it the AEE and Deployment failure will be clearly visible. However when the fail_percentage is defined there can be a success recorded AEE and Deployment CRs even if a given percentage of tasks failed. How will the human deployer know that it was a truly successful run or a partially successful run? Do we need to give some guidance (doc) to the human about it?

Personally I would be very grumpy troubleshooting a customer case where some compute behaves strangely even though the deployment seems to be succeeded just to later figure out that one particular task on that one compute failed to execute causing issue a lot later. As a random concrete example: The task putting the ssh config for the nova user fails on that particular compute. Everything seems to work until the first VM is about to cold migrated away from the given compute, which will fail due to the missing ssh config.

#519 (review)

This change adds any_errors_fatal and max_fail_percentage to each playbook. These params will be leveraged via the dataplane-operator services using the interfaces added in this PR: openstack-k8s-operators/dataplane-operator#519 Related: openstack-k8s-operators/dataplane-operator#519 Signed-off-by: Brendan Shephard <[email protected]>

openshift-ci bot added the do-not-merge/work-in-progress label Nov 14, 2023

openshift-ci bot added the approved label Nov 14, 2023

bshephar mentioned this pull request Nov 14, 2023

Add any_errors_fatal and max_fail_percentage openstack-k8s-operators/edpm-ansible#494

Merged

bshephar marked this pull request as ready for review November 14, 2023 21:56

openshift-ci bot removed the do-not-merge/work-in-progress label Nov 14, 2023

openshift-ci bot requested review from rebtoor and stuggi November 14, 2023 21:57

bshephar mentioned this pull request Nov 15, 2023

Update samples for max_fail_percentage and any_errors_fatal #520

Closed

fao89 reviewed Nov 15, 2023

View reviewed changes

jpodivin suggested changes Nov 16, 2023

View reviewed changes

api/v1beta1/openstackdataplaneservice_types.go Outdated Show resolved Hide resolved

openshift-ci bot assigned jpodivin Nov 16, 2023

bshephar force-pushed the expose-error-handling-services branch from c514dc9 to 34612ab Compare November 16, 2023 14:12

openshift-merge-robot added the needs-rebase label Nov 16, 2023

bshephar force-pushed the expose-error-handling-services branch from 34612ab to 011c08a Compare November 16, 2023 14:14

jpodivin approved these changes Nov 16, 2023

View reviewed changes

openshift-ci bot added the lgtm label Nov 16, 2023

rebtoor reviewed Nov 16, 2023

View reviewed changes

api/bases/dataplane.openstack.org_openstackdataplaneservices.yaml Outdated Show resolved Hide resolved

slagle reviewed Nov 16, 2023

View reviewed changes

bshephar force-pushed the expose-error-handling-services branch from 011c08a to d1fbfd0 Compare November 17, 2023 09:37

openshift-ci bot removed the lgtm label Nov 17, 2023

openshift-merge-robot removed the needs-rebase label Nov 17, 2023

bshephar force-pushed the expose-error-handling-services branch from d1fbfd0 to b519434 Compare November 20, 2023 15:15

bshephar force-pushed the expose-error-handling-services branch from b519434 to 6cea9e3 Compare November 20, 2023 23:02

bshephar force-pushed the expose-error-handling-services branch from 6cea9e3 to 55a1726 Compare November 21, 2023 00:05

bshephar force-pushed the expose-error-handling-services branch from 55a1726 to 56ce07c Compare November 21, 2023 04:40

bshephar force-pushed the expose-error-handling-services branch 2 times, most recently from 6792052 to 89ac52c Compare November 21, 2023 06:49

gibizer reviewed Nov 21, 2023

View reviewed changes

bshephar force-pushed the expose-error-handling-services branch from 89ac52c to a400797 Compare November 22, 2023 03:22

jpodivin approved these changes Nov 24, 2023

View reviewed changes

openshift-ci bot added the lgtm label Nov 24, 2023

openshift-ci bot added the do-not-merge/hold label Nov 24, 2023

bshephar closed this Nov 24, 2023

Expose error handling services #519

Expose error handling services #519

Conversation

bshephar commented Nov 14, 2023 • edited Loading

openshift-ci bot commented Nov 14, 2023

bshephar commented Nov 15, 2023

bshephar commented Nov 15, 2023

bshephar commented Nov 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpodivin left a comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Nov 16, 2023

softwarefactory-project-zuul bot commented Nov 16, 2023

jpodivin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Nov 20, 2023

softwarefactory-project-zuul bot commented Nov 20, 2023

softwarefactory-project-zuul bot commented Nov 21, 2023

softwarefactory-project-zuul bot commented Nov 21, 2023

softwarefactory-project-zuul bot commented Nov 21, 2023

gibizer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Nov 22, 2023

bshephar commented Nov 22, 2023

softwarefactory-project-zuul bot commented Nov 22, 2023

jpodivin commented Nov 22, 2023

fao89 commented Nov 22, 2023

softwarefactory-project-zuul bot commented Nov 22, 2023

fao89 commented Nov 24, 2023

openshift-ci bot commented Nov 24, 2023

gibizer commented Nov 24, 2023 • edited Loading

bshephar commented Nov 14, 2023 •

edited

Loading

gibizer commented Nov 24, 2023 •

edited

Loading