Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgoCD stops syncing/stuck refreshing #18011

Closed
xamroc opened this issue Apr 28, 2024 · 4 comments
Closed

ArgoCD stops syncing/stuck refreshing #18011

xamroc opened this issue Apr 28, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@xamroc
Copy link

xamroc commented Apr 28, 2024

Checklist:

  • [-] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [-] I've included steps to reproduce the bug.
  • [-] I've pasted the output of argocd version.

Describe the bug

Hi All,

We're also facing the same issue mentioned here: #11458. I opened a new issue as requested by @jsoref

To debug this, we ran an experimental ArgoCD with:

  1. a single controller
  2. three repo-servers
  3. pulling helm from github

From time to time, ArgoCD stops syncing (refreshes looping infinitely) a couple hours at a time. Observations we've made:

  1. ArgoCD is restored over time after being stuck a couple hours.
  2. The argocd_repo_pending_request_total metric shows a higher value during these times for a single repo-server.
  3. The high pending request repo server has stuck processes (refer to Logs).
  4. Those processes usually gets delisted from ps but sticks around during high pending request.

To Reproduce

  1. Deploy AWS EKS cluster
  2. Helm Install https://artifacthub.io/packages/helm/argo/argo-cd/6.7.14
  3. Create applications sourcing helm manifests

Expected behavior

ArgoCD should not get stuck refreshing.

Screenshots
image

Version

  argocd: v2.10.7+b060053
  BuildDate: 2024-04-15T08:45:08Z
  GitCommit: b060053b099b4c81c1e635839a309c9c8c1863e9
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/amd64

Logs

No logs but these processes get stuck during high pending requests.

PID   USER     TIME  COMMAND
170709 999       0:00 /bin/sh -c ssh -i /dev/shm/xxx -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts "$@" ssh -i /dev/shm/xxx -o StrictHostKeyChecking=yes -o UserKnownHostsFile=
170710 999       0:00 ssh -i /dev/shm/22900259 -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts -o SendEnv=GIT_PROTOCOL [email protected] git-upload-pack '<OUR_REPO>.git'
@xamroc xamroc added the bug Something isn't working label Apr 28, 2024
@haooliveira84
Copy link

I see the same problem here with v2.11.0!
But, if I set the dynamicClusterDistribution as a false the problem has been resolved.

@jsoref
Copy link
Member

jsoref commented May 16, 2024

tagging @ishitasequeira from #15036

@xamroc
Copy link
Author

xamroc commented May 17, 2024

We've identified that this is not an ArgoCD issue.

In our case, we switched to AWS VPC CNI network policies: https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html

This was fixed by disabling network policies feature.

If you have the same setup, do not simply disable it! We've discovered that we cannot simply rollback as that causes the network pods to crash. Disabling will require rolling your clusters nodes.

We don't know the root case but this network policies feature also caused a number of connectivity issues other apps of ours. Just wanted to close this since outside ArgoCD's scope.

@xamroc xamroc closed this as completed May 17, 2024
@jsoref
Copy link
Member

jsoref commented May 17, 2024

thanks @xamroc

@haooliveira84: if you aren't using AWS VPC CNI, please open a new ticket w/ your details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants