Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flyte task "aborted" due to webhook certificate verification issue #4050

Open
2 tasks done
rxraghu opened this issue Sep 19, 2023 · 5 comments
Open
2 tasks done
Assignees
Labels
bug Something isn't working waiting for reporter Used for when we need input from the bug reporter

Comments

@rxraghu
Copy link

rxraghu commented Sep 19, 2023

Describe the bug

Workflow run on a Cron schedule via a launch plan, intermittently fails due to webhook certificate verification issue. The workflow contains tasks that need to access k8s secrets.

Error:
Workflow[workflow_name] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org")

Several of these errors (hundreds) are seen in the flyte-binary pod logs -

E0918 08:40:00.476756 7 workers.go:102] error syncing 'project-domain/f2e9a417488011754000': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org") E0918 08:44:00.395162 7 workers.go:102] error syncing 'project-domain/f70603b7488f10f54000': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org") E0918 09:06:00.438333 7 workers.go:102] error syncing 'project-domain/f7e13ef77882c4e5b000': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "flyte.org")

Due to multiple retries only few of these surface up and cause a task to be "aborted".

Additional setup info:

  • The flyte-binary pod has 3 replicas.
  • The flyte-binary pod runs in the "flyte" namespace.
  • Worker pods are created in the "project-domain" namespace.

Expected behavior

Tasks should not be aborted / workflow should not fail due to webhook self-signed certificate verification error.

Additional context to reproduce

This issue is not observed on one off workflow runs, but becomes apparent when running a workflow on a LaunchPlan with a CronSchedule to execute every 2 mins.

Steps to reproduce:

  1. Create a sample k8s secret
  2. Create simple task / workflow to get the secret
  3. Run the workflow on a CronSchedule of 2 mins.

Version info pasted in screenshots section.

Screenshots

Screenshot 2023-09-19 at 15 51 05

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@rxraghu rxraghu added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Sep 19, 2023
@welcome
Copy link

welcome bot commented Sep 19, 2023

Thank you for opening your first issue here! 🛠

@eapolinario
Copy link
Contributor

@rxraghu , the scenario you're describing (using multiple replicas of single-binary) is not supported. If you're reaching for that in order to achieve scalability, it's time to use the other (supported) Flyte deployments.

@eapolinario eapolinario added waiting for reporter Used for when we need input from the bug reporter and removed untriaged This issues has not yet been looked at by the Maintainers labels Sep 29, 2023
@rxraghu
Copy link
Author

rxraghu commented Oct 2, 2023

Hi @eapolinario - If I understand correctly, the webhook pod was a separate pod, until it was merged with flyte-binary. This section in the documentation about "Scaling the webhook" mentions that for horizontal scaling, adding multiple replicas for pods in the deployment should be sufficient. Does that not work for flyte-binary? Also, can you elaborate what are the "other" supported deployments you mention? We are using the "Single cluster" deployment since we only have one eks cluster.

@rxraghu
Copy link
Author

rxraghu commented Oct 11, 2023

Hey @eapolinario - any thoughts on this? Do you think having multiple replicas might have something to do with this error?

@msemelman
Copy link

Bumping as I am also interested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for reporter Used for when we need input from the bug reporter
Projects
None yet
Development

No branches or pull requests

3 participants