You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Workflow run on a Cron schedule via a launch plan, intermittently fails due to webhook certificate verification issue. The workflow contains tasks that need to access k8s secrets.
Error: Workflow[workflow_name] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org")
Several of these errors (hundreds) are seen in the flyte-binary pod logs -
E0918 08:40:00.476756 7 workers.go:102] error syncing 'project-domain/f2e9a417488011754000': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org") E0918 08:44:00.395162 7 workers.go:102] error syncing 'project-domain/f70603b7488f10f54000': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org") E0918 09:06:00.438333 7 workers.go:102] error syncing 'project-domain/f7e13ef77882c4e5b000': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org")
Due to multiple retries only few of these surface up and cause a task to be "aborted".
Additional setup info:
The flyte-binary pod has 3 replicas.
The flyte-binary pod runs in the "flyte" namespace.
Worker pods are created in the "project-domain" namespace.
Expected behavior
Tasks should not be aborted / workflow should not fail due to webhook self-signed certificate verification error.
Additional context to reproduce
This issue is not observed on one off workflow runs, but becomes apparent when running a workflow on a LaunchPlan with a CronSchedule to execute every 2 mins.
Steps to reproduce:
Create a sample k8s secret
Create simple task / workflow to get the secret
Run the workflow on a CronSchedule of 2 mins.
Version info pasted in screenshots section.
Screenshots
Are you sure this issue hasn't been raised already?
Yes
Have you read the Code of Conduct?
Yes
The text was updated successfully, but these errors were encountered:
@rxraghu , the scenario you're describing (using multiple replicas of single-binary) is not supported. If you're reaching for that in order to achieve scalability, it's time to use the other (supported) Flyte deployments.
Hi @eapolinario - If I understand correctly, the webhook pod was a separate pod, until it was merged with flyte-binary. This section in the documentation about "Scaling the webhook" mentions that for horizontal scaling, adding multiple replicas for pods in the deployment should be sufficient. Does that not work for flyte-binary? Also, can you elaborate what are the "other" supported deployments you mention? We are using the "Single cluster" deployment since we only have one eks cluster.
Describe the bug
Workflow run on a Cron schedule via a launch plan, intermittently fails due to webhook certificate verification issue. The workflow contains tasks that need to access k8s secrets.
Error:
Workflow[workflow_name] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org")
Several of these errors (hundreds) are seen in the flyte-binary pod logs -
E0918 08:40:00.476756 7 workers.go:102] error syncing 'project-domain/f2e9a417488011754000': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org") E0918 08:44:00.395162 7 workers.go:102] error syncing 'project-domain/f70603b7488f10f54000': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org") E0918 09:06:00.438333 7 workers.go:102] error syncing 'project-domain/f7e13ef77882c4e5b000': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org")
Due to multiple retries only few of these surface up and cause a task to be "aborted".
Additional setup info:
Expected behavior
Tasks should not be aborted / workflow should not fail due to webhook self-signed certificate verification error.
Additional context to reproduce
This issue is not observed on one off workflow runs, but becomes apparent when running a workflow on a LaunchPlan with a CronSchedule to execute every 2 mins.
Steps to reproduce:
Version info pasted in screenshots section.
Screenshots
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: