Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support zero downtime rollout restart of target deployments. #529

Open
dangarthwaite opened this issue Feb 14, 2023 · 1 comment
Open

Support zero downtime rollout restart of target deployments. #529

dangarthwaite opened this issue Feb 14, 2023 · 1 comment
Labels
help wanted Extra attention is needed

Comments

@dangarthwaite
Copy link

What happened?

Doing a rollout restart of the verify service results in a small window of downtime.

What did you expect to happen?

rollout restarts of a targeted application should have zero failed requests.

How'd it happen?

$ kubectl rollout restart -n ingress deploy/verify &&
  while sleep .25; do 
  curl -sv https://verify.example.com/healthcheck 2>&1 | grep -E '^< '; 
done
deployment.apps/verify restarted
< HTTP/2 302
< date: Mon, 13 Feb 2023 22:55:54 GMT
< content-type: text/html; charset=UTF-8
< content-length: 1411
< location: https://sso.example.com/.pomerium/sign_in?pomerium_expiry=1676329254&pomerium_idp_id=C3HESNcvnqS4eqcjUna9rVax8spGEWe2sC8F65GTt2ip&pomerium_issued=1676328954&pomerium_redirect_uri=https%3A%2F%2Fverify.example.com%2Fhealthcheck&pomerium_signature=3-t-KBAxkDVUDmG29m5yUifNAnatGs_qwvB5cmnl58w%3D
< x-pomerium-intercepted-response: true
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< server: envoy
< x-request-id: 6eda7b91-d88f-46e0-be2c-b265fcbebe88
<
< HTTP/2 404
< date: Mon, 13 Feb 2023 22:55:54 GMT
< content-type: text/html; charset=UTF-8
< content-length: 1414
< x-pomerium-intercepted-response: true
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< server: envoy
< x-request-id: b53bb012-9725-4174-b7a1-b566934fd3a4
<
< HTTP/2 302
< date: Mon, 13 Feb 2023 22:55:55 GMT
< content-type: text/html; charset=UTF-8
< content-length: 1411
< location: https://sso.example.com/.pomerium/sign_in?pomerium_expiry=1676329255&pomerium_idp_id=C3HESNcvnqS4eqcjUna9rVax8spGEWe2sC8F65GTt2ip&pomerium_issued=1676328955&pomerium_redirect_uri=https%3A%2F%2Fverify.example.com%2Fhealthcheck&pomerium_signature=NsZdyX7CDZJjKGpro7tebeQoGmrI2r53jZzSn_av2Dc%3D
< x-pomerium-intercepted-response: true
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< server: envoy
< x-request-id: 17ab805d-9fac-4eb5-8e3a-8608dc855d36

What's your environment like?

$ kubectl -n ingress get deploy/pomerium -o yaml | yq '.spec.template.spec.containers[0].image'
pomerium/ingress-controller:sha-cdc389c
$ kubectl get nodes -o yaml | yq '.items[-1] | .status.nodeInfo'
architecture: arm64
bootID: 247a7d1c-b579-4f89-b1b6-2b98883e4150
containerRuntimeVersion: docker://20.10.17
kernelVersion: 5.4.226-129.415.amzn2.aarch64
kubeProxyVersion: v1.21.14-eks-fb459a0
kubeletVersion: v1.21.14-eks-fb459a0
machineID: ec2d3bd9b9ea252972a242d7da68e233
operatingSystem: linux
osImage: Amazon Linux 2
systemUUID: ec2d3bd9-b9ea-2529-72a2-42d7da68e233

What's your config.yaml?

apiVersion: ingress.pomerium.io/v1
kind: Pomerium
metadata:
  name: global
spec:
  authenticate:
    url: https://sso.example.com
  certificates:
  - ingress/tls-wildcards
  identityProvider:
    provider: google
    secret: ingress/google-idp-creds
  secrets: ingress/bootstrap
status:
  ingress:
    ingress/verify:
      observedAt: "2023-02-13T22:55:54Z"
      observedGeneration: 2
      reconciled: true
    sandbox/example-ingress:
      observedAt: "2023-02-13T22:28:14Z"
      observedGeneration: 6
      reconciled: true
  settingsStatus:
    observedAt: "2023-02-10T15:33:57Z"
    observedGeneration: 5
    reconciled: true
    warnings:
    - 'storage: please specify a persistent storage backend, please see https://www.pomerium.com/docs/topics/data-storage#persistence'

What did you see in the logs?

{
  "level": "info",
  "service": "envoy",
  "upstream-cluster": "",
  "method": "GET",
  "authority": "verify.ops.bereal.me",
  "path": "/healthcheck",
  "user-agent": "curl/7.81.0",
  "referer": "",
  "forwarded-for": "71.254.0.45,10.123.60.7",
  "request-id": "fb2011be-f4d0-47bd-9094-8ad12f583009",
  "duration": 14.398539,
  "size": 1414,
  "response-code": 404,
  "response-code-details": "ext_authz_denied",
  "time": "2023-02-14T01:34:20Z",
  "message": "http-request"
}

@wasaga
Copy link
Collaborator

wasaga commented Feb 14, 2023

currently pomerium uses Service Endpoints object, that is updated once Pods are terminated or new ones become Ready. The update takes a bit of time which is the root cause of the downtime.

One current option to avoid the short downtime window should be to use kubernetes service proxy instead, see https://www.pomerium.com/docs/deploying/k8s/ingress#service-proxy

in the long term, we probably we should start using a newer EndpointSlice object that takes the pod conditions into the consideration. https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#conditions

@wasaga wasaga added the help wanted Extra attention is needed label Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants