Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled rollout #727

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

Athosone
Copy link

@Athosone Athosone commented Mar 8, 2022

Feature description

This feature allows you to plan a deployment at a given period.

A period is composed of a cron and a duration.

Both of these properties represents a window during which the application can be deployed.

The schedule can be applied to either the cluster itself and/or the bundle.

If schedules are set on both the cluster and the bundle then the bundle schedule takes priority over the cluster schedule.

We chose to prioritise the bundle over the cluster as we think that the user may want to specify exceptions to the global rule of a cluster schedule.

Usage Example

fleet.yaml

defaultNamespace: default
namespace: default

schedule:
  cron: "0 16 * * *"
  duration: "1h"

cluster

apiVersion: fleet.cattle.io/v1alpha1
kind: Cluster
metadata:
  name: c-kd88w
  namespace: fleet-default
spec:
  paused: false
  deploymentSchedule:
    cron: "0 16 * * *"
    duration: "1h"

Motivation for this PR

We noticed there was an existing PR #450 for this feature related to #383, but saw that there was no activity on it.

Also there were two implementation details that were not right in our opinion:

Athosone and others added 3 commits March 8, 2022 10:31
Co-authored-by: Christian Artin <[email protected]>
Co-authored-by: Christian Artin <[email protected]>
Co-authored-by: Christian Artin <[email protected]>
@gravufo
Copy link

gravufo commented Mar 8, 2022

@ibrokethecloud @nickgerace FYI, tagging you because you were involved in the previous PR.

Co-authored-by: Christian Artin <[email protected]>
@Athosone Athosone force-pushed the scheduled-rollout branch from f659912 to 86d28e7 Compare March 8, 2022 20:09
@richard-cox
Copy link
Member

I've raised this with the Fleet team, they're aware and will start working through community PRs soon.

@SheilaghM
Copy link

Tagging @ibrokethecloud - will you please review this as it brings up a two instances not covered in yours and advise whether yours should be closed in favor of this one?

@gravufo
Copy link

gravufo commented May 6, 2022

@SheilaghM Any news? We are now in May and we still have no feedback after 2 months.

@ibrokethecloud
Copy link
Contributor

Hi @Athosone thanks a lot for your PR.
There are few minor tweaks needed which should make the agent more efficient in scheduling the changes.

@SheilaghM can we please have some clarity on whether cluster scoped schedule supersedes the bundle scoped one or are we happy to have the bundle schedule have priority.

@Athosone
Copy link
Author

Athosone commented May 9, 2022

Hi :)!

Sure no problem let me know I'll fix it !

@Athosone
Copy link
Author

Athosone commented May 16, 2022

Hey :)!

Did you have the time to think about this PR?

@ibrokethecloud
@SheilaghM

@SheilaghM
Copy link

@Athosone - We are still discussing the Fleet use cases with Product Management. We will act on this one way or another as soon as we have clarity.

@@ -53,6 +53,17 @@ defaultNamespace: default
# Default: ""
namespace: default

# Specify a deployment schedule to deploy bundle during that window.
# In this example we will deploy bundles every monday from 4pm to 5pm.
# If a helm timeout is specified in the helm structure below, it will be considered in the schedule evaluation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with the helm code. I'm a bit concerned this will not be obvious to users, maybe we need to log that the "job will not be executed, because its timeout is too large for the schedule window".

But is it really necessary to take the helm timeout into account? As far as I understand helm tasks are never cancelled by, so we rely on the helm library to honor the timeout value. If timeout is 0, wait is not used, are we certain helm will finish immediately in that case? I feel like it must be possible to make it take very long, by using hooks or lookups maybe?
In any case the actual helm deployment going on in the cluster might take longer. So even when you take the tiemout into consideration, there is no guarantee the cluster is in a ready state after the schedule window?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi !

The way we approached it is, if, as a user I don't specify a timeout, it means I do not care how long the installation take (and helm default it to 5 min).
It could also means that you know that there is no hook/lookups and it will be installed instantly. It would not be accurate to reserve a five minutes block for that case.

On the other hand, a user that wants to have some kind of mechanism to prevent an installation from taking too much time, could specify a timeout in order for helm to cancel the installation (and maybe rollback if atomic is specified).

As you said, we cannot guarantee the cluster will be in a ready state after the schedule window, it is a best effort way of doing it.

I tested it with a simple chart including hooks. The post-install hook includes a job that sleep for 305 seconds.

I launched the install by using:

helm install httpbin .

The result is:

Error: INSTALLATION FAILED: failed post-install: timed out waiting for the condition

the job:

apiVersion: batch/v1
kind: Job
metadata:
  name: sleep
  annotations:
    helm.sh/hook: post-install
spec:
  template:
    spec:
      containers:
      - name: sleep
        image: busybox:1.28
        command: ["sh",  "-c", "echo Hello ! && sleep 350"]
      restartPolicy: Never
  backoffLimit: 4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, okay. Best effort is fine with me.

However, I was looking at the helm code:

u.Timeout = timeout
u.DryRun = dryRun
u.PostRenderer = pr
if u.Timeout > 0 {
u.Wait = true
}
and I think the timeout defaults to 0 seconds. The helm package then uses that timeout only to wait for hooks and interprets 0 as forever (ContextWithOptionalTimeout).

So, if you tested your hook chart without a timeout, I think it would run for 350s.

Copy link
Author

@Athosone Athosone Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I see what you mean, specifying --timeout=0 run until it completes. It looks like fleet is setting 0 implicitly instead of the default 300 sec of helm, thus it will run forever and ever if the user do not specify a timeout

@ibrokethecloud ibrokethecloud removed their request for review June 7, 2023 23:08
@CiraciNicolo
Copy link

CiraciNicolo commented May 2, 2024

Any news on this?

@Athosone
Copy link
Author

Athosone commented May 2, 2024

We could rebase it and try to push it again if the fleet team is still interested in the contrib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants