-
Notifications
You must be signed in to change notification settings - Fork 254
start: Delay pull secret on disk check to end #4776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideDefers the time-consuming pull secret disk check by moving its invocation from immediately after SSH key update to the end of the Start workflow, thus allowing other initialization steps to run first and reducing overall startup blocking time. Sequence diagram for the updated Start process with delayed pull secret checksequenceDiagram
actor User
participant client as client.Start()
participant cluster as Cluster Operations
participant sshRunner as SSH Runner
User->>client: Initiate Start(startConfig)
activate client
client->>cluster: UpdateHostMCDToken(...)
activate cluster
cluster-->>client: Token Updated
deactivate cluster
client->>cluster: AddSSHKeyToMachine(...)
activate cluster
cluster-->>client: SSH Key Added
deactivate cluster
client->>cluster: UpdateUserPasswords(...)
activate cluster
cluster-->>client: Passwords Updated
deactivate cluster
client->>cluster: EnsurePersistentVolume(...)
activate cluster
cluster-->>client: Volume Ensured
deactivate cluster
client->>cluster: WaitForProxyConfig(...)
activate cluster
cluster-->>client: Proxy Config Ready
deactivate cluster
client->>cluster: WaitForClusterToBeReachable(...)
activate cluster
cluster-->>client: Cluster Reachable
deactivate cluster
client->>cluster: WaitForPullSecretPresentOnInstanceDisk(ctx, sshRunner)
activate cluster
cluster->>sshRunner: Check pull secret on disk
activate sshRunner
sshRunner-->>cluster: Disk check status
deactivate sshRunner
cluster-->>client: Pull Secret Present
deactivate cluster
client->>client: waitForProxyPropagation(...)
client-->>User: Start Process Complete
deactivate client
Class diagram for the modified client typeclassDiagram
class client {
+Start(Context, StartConfig)
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @praveenkumar - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
if err := cluster.WaitForPullSecretPresentOnInstanceDisk(ctx, sshRunner); err != nil { | ||
return nil, errors.Wrap(err, "Failed to update pull secret on the disk") | ||
} | ||
|
||
if err := cluster.UpdateUserPasswords(ctx, ocConfig, startConfig.KubeAdminPassword, startConfig.DeveloperPassword); err != nil { | ||
return nil, errors.Wrap(err, "Failed to update kubeadmin user password") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Refine error message to reflect waiting operation
The error message should indicate a failure in checking for the pull secret's presence, not updating it. Suggest: Failed initial pull secret presence check
.
0423b15
to
1e97ca1
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: anjannath The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -614,6 +610,10 @@ func (client *client) Start(ctx context.Context, startConfig types.StartConfig) | |||
logging.Warnf("Cluster is not ready: %v", err) | |||
} | |||
|
|||
if err := cluster.WaitForPullSecretPresentOnInstanceDisk(ctx, sshRunner); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can cluster.StartMonitoring()
and cluster.WaitForClusterStable
succeed if the pull secret is not written yet to disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have monitoring related images cached in the bundle so when user want to use monitoring operator then they have to wait for images and images can will be downloaded using the pull secret but in our case we don't dictate when this is available on the disk since MCO is responsible for it. Now assume when monitoring operator start and pull secret is not yet available on disk then it will fail and again retry but eventually succeed as soon as images are pulled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfergeau any thought on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is some code which needs the pull secret to be on disk to work, and we have a "wait until it works" function for this code, then I’d prefer if cluster.WaitForPullSecretPresentOnInstanceDisk
was called before it.
Otherwise the "wait until it works" function will also act as a hidden cluster.WaitForPullSecretPresentOnInstanceDisk
, and cluster.WaitForPullSecretPresentOnInstanceDisk
will almost be a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfergeau We don't have any code that directly depends on the pull secret being present on disk. Instead, we're executing a set of oc
commands, some action (like monitoring or marketplace) involve pulling container images. These operations require the pull secret, and if it's not available, they can result in image pull errors. However, these errors are non-blocking and are retried, so they don't cause a complete failure.
Our current goal in this PR is to optimize startup time by checking for the presence of the pull secret on disk only at the end of the process. In fact, we could go a step further and consider removing this check entirely. Once the pull secret is patched, it's the responsibility of the Machine Config Operator (MCO) to propagate it to the disk. Our check is effectively a redundant safeguard as of now.
In current scenario MCO is patched with user provided pull secret and just after, we check if pull secret is part of disk which takes ~1 min since MCD make that change and the report to MCO. In this PR we are delaying this pull secret check to end because instead of blocking it for ~1 min better to execute other part and at the end check if pull secret is part of disk image. With this PR (crc start time, 6 runs) ``` real 4m9.247s user 0m0.557s sys 0m0.165s real 4m0.455s user 0m0.619s sys 0m0.168s real 4m5.962s user 0m0.445s sys 0m0.154s real 3m59.594s user 0m0.661s sys 0m0.179s real 4m3.958s user 0m0.563s sys 0m0.177s real 4m28.806s user 0m0.460s sys 0m0.171s ``` Without this PR ``` real 5m7.235s user 0m0.797s sys 0m0.181s real 4m28.741s user 0m0.891s sys 0m0.195s real 6m6.815s user 0m0.747s sys 0m0.194s real 5m1.733s user 0m0.395s sys 0m0.199s real 4m30.551s user 0m0.466s sys 0m0.173s real 4m31.067s user 0m0.673s sys 0m0.183s ```
1e97ca1
to
24605d0
Compare
New changes are detected. LGTM label has been removed. |
During CI failure it is observe sometime following error happen when checking the pull secret present on the disk. This PR make sure that ssh connectivity is present before checking pull-secret. ``` DEBU SSH command results: err: ssh: unexpected packet in response to channel open: <nil>, output: DEBU error: Temporary error: ssh command error: ```
/retest |
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughA connectivity check was added to the process of waiting for the pull secret file on the instance disk, ensuring SSH access before each retry. Additionally, the sequence of operations was adjusted so that waiting for the pull secret occurs after the cluster stabilization step rather than before it. Changes
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (1.64.8)Error: you are using a configuration file for golangci-lint v2 with golangci-lint v1: please use golangci-lint v2 ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/crc/cluster/cluster.go (1)
450-452
: Consider making connectivity check retryable.The connectivity check is a good defensive measure, but returning immediately on connectivity failure may be too strict. If SSH connectivity is temporarily lost (network hiccup, VM resource contention), the entire function fails without retry.
Consider wrapping the connectivity check in a
RetriableError
to allow the retry mechanism to handle temporary connectivity issues:if err := sshRunner.WaitForConnectivity(ctx, 30*time.Second); err != nil { - return err + return &errors.RetriableError{Err: err} }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/crc/cluster/cluster.go
(1 hunks)pkg/crc/machine/start.go
(1 hunks)
🔇 Additional comments (1)
pkg/crc/machine/start.go (1)
613-616
: LGTM! Sequence optimization improves start time.Moving the pull secret disk check to after cluster stabilization is a good optimization. This allows other startup operations to proceed while the Machine Config Operator propagates the pull secret, reducing the overall blocking time as described in the PR objectives.
In current scenario MCO is patched with user provided pull secret and just after, we check if pull secret is part of disk which takes ~1 min since MCD make that change and the report to MCO. In this PR we are delaying this pull secret check to end because instead of blocking it for ~1 min better to execute other part and at the end check if pull secret is part of disk image.
With this PR (crc start time, 6 runs)
Without this PR
Description
Fixes: #N
Relates to: #N, PR #N, ...
Type of change
test, version modification, documentation, etc.)
Proposed changes
Testing
Contribution Checklist
Summary by Sourcery
Enhancements:
Summary by CodeRabbit