Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate config using STDIN and /dev/fd/0 #10296

Closed
wants to merge 52 commits into from

Conversation

ryanrolds
Copy link

@ryanrolds ryanrolds commented Nov 14, 2024

Description

To address an issue with large configs passed as arguments causing a arguments to long error during validation; This PR adjusts how the config is provided to Envoy. The config is now fed to the process via STDIN and the --config-yaml is set to the file descriptor for STDIN.

Code changes

  • Switch to STDIN for injecting the config during validation
  • Pass CLUSTER_NAME to kube2e test running
  • Adjusted instructions for running kube3e tests locally.
  • Added make target for kind setup
  • Increased timeout of validation webhook in tests and added note to docs

Context

A customer with a large config reported the error.

Interesting decisions

Initially there were discussions about saving a file inside of the container, but concerns about read-only root filesystems were raised. A volume would address that issue, but is a more complex solution. To avoid the need for a volume in some cases, I opted to use STDIN to provide the config to the program and read the STDIN FD to the config.

Testing steps

I've applied the large config yaml in the test with and without the fix confirming that is was broken and is now fixed.

% make kind-setup
% helm upgrade --install -n gloo-system --create-namespace gloo ./_test/gloo-1.0.0-ci1.tgz --values ./test/kubernetes/e2e/tests/manifests/common-recommendations.yaml
% make -B kind-reload-gloo
% kubectl create namespace full-envoy-validation-test
% kubectl apply -f test/kubernetes/e2e/features/validation/testdata/valid-resources/large-configuration.yaml

Notes for reviewers

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

ryanrolds and others added 27 commits November 11, 2024 13:00
jenshu pushed a commit that referenced this pull request Nov 15, 2024
@ryanrolds
Copy link
Author

I've made the changes that switch the full Envoy validation to being passed in via STDIN and /dev/fd/0. This changes appears to be working, but has uncovered an issue with the configuration becoming large and taking minutes to validate.

Steps to reproduce the timeout I'm seeing:

% ./ci/kind/setup-kind.sh
% helm upgrade --install -n gloo-system --create-namespace gloo ./_test/gloo-1.0.0-ci1.tgz --values ./test/kubernetes/e2e/tests/manifests/full-envoy-validation-helm.yaml
% kubectl create namespace full-envoy-validation-test
% SKIP_INSTALL=true TEAR_DOWN=false go test -v ./test/kubernetes/e2e/tests -run ^TestFullEnvoyValidation/FullEnvoyValidation$

Repeatedly running the test will cause the config size validated to grow to the point of taking longer than the timeout:

{"level":"info","ts":"2024-11-18T19:22:13.471Z","logger":"gloo.v1.event_loop.setup.gateway-validation-webhook.gateway-validator.proxy-validator.translator.compute_route_config.listener-::-8080-routes","caller":"runner/run.go:42","msg":"full envoy validation of 1340 size completed in 486.440292ms","version":"1.0.0-ci1","Kind":"gateway.solo.io/v1, Kind=VirtualService","Namespace":"full-envoy-validation-test","Name":"httpbin-1","UID":"9a8c7be7-f5ef-4089-9b7f-2652039f6d04","PatchOperation":"CREATE","UserInfo":{"username":"kubernetes-admin","groups":["kubeadm:cluster-admins","system:authenticated"]}}
{"level":"info","ts":"2024-11-18T19:22:13.924Z","logger":"gloo.v1.event_loop.setup.gateway-validation-webhook.gateway-validator.proxy-validator.translator.compute_route_config.listener-::-8080-routes","caller":"runner/run.go:42","msg":"full envoy validation of 1145 size completed in 452.091542ms","version":"1.0.0-ci1","Kind":"gateway.solo.io/v1, Kind=VirtualService","Namespace":"full-envoy-validation-test","Name":"httpbin-1","UID":"9a8c7be7-f5ef-4089-9b7f-2652039f6d04","PatchOperation":"CREATE","UserInfo":{"username":"kubernetes-admin","groups":["kubeadm:cluster-admins","system:authenticated"]}}
...
{"level":"info","ts":"2024-11-18T19:22:16.215Z","logger":"gloo.v1.event_loop.setup.gateway-validation-webhook.gateway-validator.proxy-validator.translator.compute_route_config.listener-::-8080-routes","caller":"runner/run.go:42","msg":"full envoy validation of 325160 size completed in 855.29125ms","version":"1.0.0-ci1","Kind":"gateway.solo.io/v1, Kind=VirtualService","Namespace":"full-envoy-validation-test","Name":"httpbin-2","UID":"2446a129-f09b-4074-8c67-08e687746c6f","PatchOperation":"CREATE","UserInfo":{"username":"kubernetes-admin","groups":["kubeadm:cluster-admins","system:authenticated"]}}
{"level":"info","ts":"2024-11-18T19:22:17.063Z","logger":"gloo.v1.event_loop.setup.gateway-validation-webhook.gateway-validator.proxy-validator.translator.compute_route_config.listener-::-8080-routes","caller":"runner/run.go:42","msg":"full envoy validation of 326810 size completed in 844.33175ms","version":"1.0.0-ci1","Kind":"gateway.solo.io/v1, Kind=VirtualService","Namespace":"full-envoy-validation-test","Name":"httpbin-2","UID":"2446a129-f09b-4074-8c67-08e687746c6f","PatchOperation":"CREATE","UserInfo":{"username":"kubernetes-admin","groups":["kubeadm:cluster-admins","system:authenticated"]}}
{"level":"info","ts":"2024-11-18T19:22:17.897Z","logger":"gloo.v1.event_loop.setup.gateway-validation-webhook.gateway-validator.proxy-validator.translator.compute_route_config.listener-::-8080-routes","caller":"runner/run.go:42","msg":"full envoy validation of 324965 size completed in 824.386709ms","version":"1.0.0-ci1","Kind":"gateway.solo.io/v1, Kind=VirtualService","Namespace":"full-envoy-validation-test","Name":"httpbin-2","UID":"2446a129-f09b-4074-8c67-08e687746c6f","PatchOperation":"CREATE","UserInfo":{"username":"kubernetes-admin","groups":["kubeadm:cluster-admins","system:authenticated"]}}
...
{"level":"info","ts":"2024-11-18T19:22:58.233Z","logger":"gloo.v1.event_loop.setup.gateway-validation-webhook.gateway-validator.proxy-validator.translator.compute_route_config.listener-::-8080-routes","caller":"runner/run.go:42","msg":"full envoy validation of 20710952 size completed in 23.977979095s","version":"1.0.0-ci1","Kind":"gateway.solo.io/v1, Kind=VirtualService","Namespace":"full-envoy-validation-test","Name":"httpbin-1","UID":"9af1e8ce-a9bf-4224-83e9-d36c370c2d48","PatchOperation":"DELETE","UserInfo":{"username":"kubernetes-admin","groups":["kubeadm:cluster-admins","system:authenticated"]}}
{"level":"info","ts":"2024-11-18T19:23:22.758Z","logger":"gloo.v1.event_loop.setup.gateway-validation-webhook.gateway-validator.proxy-validator.translator.compute_route_config.listener-::-8080-routes","caller":"runner/run.go:42","msg":"full envoy validation of 20712602 size completed in 24.223770011s","version":"1.0.0-ci1","Kind":"gateway.solo.io/v1, Kind=VirtualService","Namespace":"full-envoy-validation-test","Name":"httpbin-1","UID":"9af1e8ce-a9bf-4224-83e9-d36c370c2d48","PatchOperation":"DELETE","UserInfo":{"username":"kubernetes-admin","groups":["kubeadm:cluster-admins","system:authenticated"]}}
...
{"level":"info","ts":"2024-11-18T19:24:36.227Z","logger":"gloo.v1.event_loop.setup.gateway-validation-webhook.gateway-validator.proxy-validator.translator.compute_route_config.listener-::-8080-routes","caller":"runner/run.go:42","msg":"full envoy validation of 20710325 size completed in 24.106387803s","version":"1.0.0-ci1","Kind":"gateway.solo.io/v1, Kind=VirtualService","Namespace":"full-envoy-validation-test","Name":"httpbin-1","UID":"9af1e8ce-a9bf-4224-83e9-d36c370c2d48","PatchOperation":"DELETE","UserInfo":{"username":"kubernetes-admin","groups":["kubeadm:cluster-admins","system:authenticated"]}}
{"level":"info","ts":"2024-11-18T19:28:56.334Z","logger":"gloo.v1.event_loop.setup.gateway-validation-webhook.gateway-validator.proxy-validator","caller":"runner/run.go:42","msg":"full envoy validation of 227839355 size completed in 4m18.553049994s","version":"1.0.0-ci1","Kind":"gateway.solo.io/v1, Kind=VirtualService","Namespace":"full-envoy-validation-test","Name":"httpbin-1","UID":"9af1e8ce-a9bf-4224-83e9-d36c370c2d48","PatchOperation":"DELETE","UserInfo":{"username":"kubernetes-admin","groups":["kubeadm:cluster-admins","system:authenticated"]}}
{"level":"info","ts":"2024-11-18T19:29:01.345Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"runner/run.go:42","msg":"full envoy validation of 3595643 size completed in 4.904727085s","version":"1.0.0-ci1"}

It looks like in some cases the timeout is a background process running during the tests that also causes a full envoy validation that holds the validating webhook for too long. I'm also seeing OOMKilled after running the tests several time without starting the controller.

I have heap dumps from before and after running the tests if they would be useful. Here a heap png output from after the tests have been run 2-3 times.

profile006

@ryanrolds
Copy link
Author

This archive contains heap dumps taken w/o validation and w/ validation.

Archive.zip

@ryanrolds
Copy link
Author

I'm handing this off to Nathan. I've taken it far as I can (I don't know the internals of Gloo well enough to address the config growth issue) which is blocking the tests from passing.

Copy link

github-actions bot commented Nov 25, 2024

Visit the preview URL for this PR (updated for commit f6fd30b):

https://gloo-edge--pr10296-rolds-envoy-large-va-dn2mqd5g.web.app

(expires Wed, 04 Dec 2024 14:43:02 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@nfuden nfuden mentioned this pull request Dec 2, 2024
4 tasks
@sam-heilbron
Copy link

With #10417 merged, I think this PR can be closed. @ryanrolds

@nfuden
Copy link
Collaborator

nfuden commented Dec 3, 2024

merged branch of this

@nfuden nfuden closed this Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants