Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-9061] Jobs are not cleaned up from local cluster #2870

Open
1 task done
mikmatko opened this issue Sep 18, 2024 · 3 comments
Open
1 task done

[SURE-9061] Jobs are not cleaned up from local cluster #2870

mikmatko opened this issue Sep 18, 2024 · 3 comments
Assignees
Milestone

Comments

@mikmatko
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

In Rancher local cluster, for each commit/change in each GitRepo, there is a Job started by Fleet. There is nothing to clean up these Jobs, so you will quickly end up with hundreds of lingering Job objects and their completed Pods.

I didn't notice this behavior in Fleet 0.9.x, so I assume something in 0.10.x introduced these Jobs. I was assuming this is related to automatic chart dependency update, but setting disableDependencyUpdate to true doesn't seem to affect.

Expected Behavior

Unnecessary Job objects are cleaned up, e.g. by setting some sane default for .spec.ttlSecondsAfterFinished: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

Steps To Reproduce

  1. Install Rancher & Fleet
  2. Add any GitRepo and make sure it deploys
  3. Check rancher-local cluster. You now have lingering Job objects

Environment

- Architecture: x86
- Fleet Version: v0.10.2
- Cluster:
  - Provider: GKE
  - Options: Rancher 2.9.1
  - Kubernetes Version: v1.30.4-gke.1213000

Logs

No response

Anything else?

No response

@kkaempf kkaempf added this to the v2.9.3 milestone Sep 18, 2024
@0xavi0 0xavi0 self-assigned this Sep 18, 2024
@manno
Copy link
Member

manno commented Sep 23, 2024

This can help to identify completed jobs that have not been cleaned up:

kubectl get jobs --all-namespaces -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,OWNER:.metadata.ownerReferences[].name,STATUS:.status.succeeded'

@kkaempf kkaempf changed the title Jobs are not cleaned up from local cluster [SURE-9061] Jobs are not cleaned up from local cluster Sep 24, 2024
@kkaempf kkaempf added the JIRA Must shout label Sep 24, 2024
@manno
Copy link
Member

manno commented Sep 30, 2024

@0xavi0
Copy link
Contributor

0xavi0 commented Oct 1, 2024

Additional QA

Problem

Fleet is not deleting the jobs related to GitRepos.
We create a new job for every new commit we get in the git repository, which is a problem in systems with many GitRepos and many commits because we could reach the etcd limits.

Solution

  • Fleet will create a new job when is needed and will delete it after it succeeds
  • In case of error the job won't be deleted (so we can describe the job, check the logs, etc)
  • If a job is not finished and the user changes the Spec or force updates or a new commit is received, the job running will be deleted and a new one will be created.

Testing

Test a few scenarios so cover all the possible cases

  • Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds
  • Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then update the Commit, check that another job is created and deleted after it succeeds.
  • Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then Force Update, check that another job is created and deleted after it succeeds.
  • Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then change the Spec of the GitRepo (for example change the path), check that another job is created and deleted after it succeeds.
  • Apply a GitRepo that is not successful (for example a bad path or git url or anything that makes the job fail). Check that the job is not deleted and we can see the error in the logs.
  • Apply a GitRepo that creates a job that is slow, so we have time to Force Update before it is finished. Check that the job is deleted and re-created
  • Apply a GitRepo that creates a job that is slow, so we have enough time to change the Spec (for example the path). Check that the job is deleted and re-created.

In any test, the job should only stay if it is not successful, otherwise it should be deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs QA review
Development

No branches or pull requests

4 participants