[SURE-9061] Jobs are not cleaned up from local cluster #2870

mikmatko · 2024-09-18T08:23:12Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

In Rancher local cluster, for each commit/change in each GitRepo, there is a Job started by Fleet. There is nothing to clean up these Jobs, so you will quickly end up with hundreds of lingering Job objects and their completed Pods.

I didn't notice this behavior in Fleet 0.9.x, so I assume something in 0.10.x introduced these Jobs. I was assuming this is related to automatic chart dependency update, but setting disableDependencyUpdate to true doesn't seem to affect.

Expected Behavior

Unnecessary Job objects are cleaned up, e.g. by setting some sane default for .spec.ttlSecondsAfterFinished: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

Steps To Reproduce

Install Rancher & Fleet
Add any GitRepo and make sure it deploys
Check rancher-local cluster. You now have lingering Job objects

Environment

- Architecture: x86
- Fleet Version: v0.10.2
- Cluster:
  - Provider: GKE
  - Options: Rancher 2.9.1
  - Kubernetes Version: v1.30.4-gke.1213000

Logs

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

manno · 2024-09-23T16:12:53Z

This can help to identify completed jobs that have not been cleaned up:

kubectl get jobs --all-namespaces -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,OWNER:.metadata.ownerReferences[].name,STATUS:.status.succeeded'

manno · 2024-09-30T12:39:09Z

we need to turn Cleanup completed gitjobs #2907 into a run-once job when Changes job handling in gitops controller #2903 is merged.
make sure it's actually run when upgrading 0.10.2->0.10.4
fix service account used

0xavi0 · 2024-10-01T08:20:34Z

Additional QA

Problem

Fleet is not deleting the jobs related to GitRepos.
We create a new job for every new commit we get in the git repository, which is a problem in systems with many GitRepos and many commits because we could reach the etcd limits.

Solution

Fleet will create a new job when is needed and will delete it after it succeeds
In case of error the job won't be deleted (so we can describe the job, check the logs, etc)
If a job is not finished and the user changes the Spec or force updates or a new commit is received, the job running will be deleted and a new one will be created.

Testing

Test a few scenarios so cover all the possible cases

Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then update the Commit, check that another job is created and deleted after it succeeds.
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then Force Update, check that another job is created and deleted after it succeeds.
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then change the Spec of the GitRepo (for example change the path), check that another job is created and deleted after it succeeds.
Apply a GitRepo that is not successful (for example a bad path or git url or anything that makes the job fail). Check that the job is not deleted and we can see the error in the logs.
Apply a GitRepo that creates a job that is slow, so we have time to Force Update before it is finished. Check that the job is deleted and re-created
Apply a GitRepo that creates a job that is slow, so we have enough time to change the Spec (for example the path). Check that the job is deleted and re-created.

In any test, the job should only stay if it is not successful, otherwise it should be deleted.

mikmatko added the kind/bug label Sep 18, 2024

kkaempf added the kind/regression label Sep 18, 2024

kkaempf added this to the v2.9.3 milestone Sep 18, 2024

0xavi0 self-assigned this Sep 18, 2024

kkaempf changed the title ~~Jobs are not cleaned up from local cluster~~ [SURE-9061] Jobs are not cleaned up from local cluster Sep 24, 2024

kkaempf added the JIRA Must shout label Sep 24, 2024

0xavi0 mentioned this issue Sep 26, 2024

Changes job handling in gitops controller #2903

Merged

This was referenced Sep 27, 2024

Cleanup completed gitjobs #2907

Merged

[v0.10] Backport of [SURE-9061] Jobs are not cleaned up from local cluster #2909

Closed

manno mentioned this issue Oct 1, 2024

Cleanup completed gitjobs runs once when upgrading Fleet #2921

Draft

0xavi0 modified the milestones: v2.9.3, v2.10.0 Oct 2, 2024

0xavi0 mentioned this issue Oct 3, 2024

Converts the delete gitjobs to one-time job #2928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SURE-9061] Jobs are not cleaned up from local cluster #2870

[SURE-9061] Jobs are not cleaned up from local cluster #2870

mikmatko commented Sep 18, 2024

manno commented Sep 23, 2024

manno commented Sep 30, 2024 •

edited

Loading

0xavi0 commented Oct 1, 2024

[SURE-9061] Jobs are not cleaned up from local cluster #2870

[SURE-9061] Jobs are not cleaned up from local cluster #2870

Comments

mikmatko commented Sep 18, 2024

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Logs

Anything else?

manno commented Sep 23, 2024

manno commented Sep 30, 2024 • edited Loading

0xavi0 commented Oct 1, 2024

Additional QA

Problem

Solution

Testing

manno commented Sep 30, 2024 •

edited

Loading