Skip to content

start: Delay pull secret on disk check to end #4776

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

praveenkumar
Copy link
Member

@praveenkumar praveenkumar commented May 30, 2025

In current scenario MCO is patched with user provided pull secret and just after, we check if pull secret is part of disk which takes ~1 min since MCD make that change and the report to MCO. In this PR we are delaying this pull secret check to end because instead of blocking it for ~1 min better to execute other part and at the end check if pull secret is part of disk image.

With this PR (crc start time, 6 runs)

real	4m9.247s
user	0m0.557s
sys	0m0.165s

real	4m0.455s
user	0m0.619s
sys	0m0.168s

real	4m5.962s
user	0m0.445s
sys	0m0.154s

real	3m59.594s
user	0m0.661s
sys	0m0.179s

real	4m3.958s
user	0m0.563s
sys	0m0.177s

real	4m28.806s
user	0m0.460s
sys	0m0.171s

Without this PR

real	5m7.235s
user	0m0.797s
sys	0m0.181s

real	4m28.741s
user	0m0.891s
sys	0m0.195s

real	6m6.815s
user	0m0.747s
sys	0m0.194s

real	5m1.733s
user	0m0.395s
sys	0m0.199s

real	4m30.551s
user	0m0.466s
sys	0m0.173s

real	4m31.067s
user	0m0.673s
sys	0m0.183s

Description

Fixes: #N

Relates to: #N, PR #N, ...

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change
  • Chore (non-breaking change which doesn't affect codebase;
    test, version modification, documentation, etc.)

Proposed changes

Testing

Contribution Checklist

  • I Keep It Small and Simple: The smaller the PR is, the easier it is to review and have it merged
  • I have performed a self-review of my code
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Which platform have you tested the code changes on?
    • Linux
    • Windows
    • MacOS

Summary by Sourcery

Enhancements:

  • Move pull secret disk presence verification to the end of the start process to prevent early blocking and improve performance

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability when accessing the pull secret by ensuring connectivity checks are performed before each attempt.
    • Adjusted the sequence of operations during cluster startup to wait for cluster stabilization before verifying the presence of the pull secret.

Copy link

sourcery-ai bot commented May 30, 2025

Reviewer's Guide

Defers the time-consuming pull secret disk check by moving its invocation from immediately after SSH key update to the end of the Start workflow, thus allowing other initialization steps to run first and reducing overall startup blocking time.

Sequence diagram for the updated Start process with delayed pull secret check

sequenceDiagram
    actor User
    participant client as client.Start()
    participant cluster as Cluster Operations
    participant sshRunner as SSH Runner

    User->>client: Initiate Start(startConfig)
    activate client

    client->>cluster: UpdateHostMCDToken(...)
    activate cluster
    cluster-->>client: Token Updated
    deactivate cluster

    client->>cluster: AddSSHKeyToMachine(...)
    activate cluster
    cluster-->>client: SSH Key Added
    deactivate cluster

    client->>cluster: UpdateUserPasswords(...)
    activate cluster
    cluster-->>client: Passwords Updated
    deactivate cluster

    client->>cluster: EnsurePersistentVolume(...)
    activate cluster
    cluster-->>client: Volume Ensured
    deactivate cluster

    client->>cluster: WaitForProxyConfig(...)
    activate cluster
    cluster-->>client: Proxy Config Ready
    deactivate cluster

    client->>cluster: WaitForClusterToBeReachable(...)
    activate cluster
    cluster-->>client: Cluster Reachable
    deactivate cluster

    client->>cluster: WaitForPullSecretPresentOnInstanceDisk(ctx, sshRunner)
    activate cluster
    cluster->>sshRunner: Check pull secret on disk
    activate sshRunner
    sshRunner-->>cluster: Disk check status
    deactivate sshRunner
    cluster-->>client: Pull Secret Present
    deactivate cluster

    client->>client: waitForProxyPropagation(...)

    client-->>User: Start Process Complete
    deactivate client
Loading

Class diagram for the modified client type

classDiagram
  class client {
    +Start(Context, StartConfig)
  }
Loading

File-Level Changes

Change Details Files
Delay pull secret disk presence check to the end of the Start sequence
  • Removed the early call to WaitForPullSecretPresentOnInstanceDisk before password updates
  • Inserted the pull secret check after UpdateUserPasswords and readiness log handling
  • Adjusted error wrapping to reflect the new call site
pkg/crc/machine/start.go

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@openshift-ci openshift-ci bot requested review from cfergeau and lstocchi May 30, 2025 15:54
@praveenkumar
Copy link
Member Author

/retest

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @praveenkumar - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

if err := cluster.WaitForPullSecretPresentOnInstanceDisk(ctx, sshRunner); err != nil {
return nil, errors.Wrap(err, "Failed to update pull secret on the disk")
}

if err := cluster.UpdateUserPasswords(ctx, ocConfig, startConfig.KubeAdminPassword, startConfig.DeveloperPassword); err != nil {
return nil, errors.Wrap(err, "Failed to update kubeadmin user password")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Refine error message to reflect waiting operation

The error message should indicate a failure in checking for the pull secret's presence, not updating it. Suggest: Failed initial pull secret presence check.

@praveenkumar praveenkumar force-pushed the delay_pull_secret_check branch from 0423b15 to 1e97ca1 Compare May 30, 2025 15:59
Copy link

openshift-ci bot commented Jun 5, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anjannath

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jun 5, 2025
@@ -614,6 +610,10 @@ func (client *client) Start(ctx context.Context, startConfig types.StartConfig)
logging.Warnf("Cluster is not ready: %v", err)
}

if err := cluster.WaitForPullSecretPresentOnInstanceDisk(ctx, sshRunner); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can cluster.StartMonitoring() and cluster.WaitForClusterStable succeed if the pull secret is not written yet to disk?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have monitoring related images cached in the bundle so when user want to use monitoring operator then they have to wait for images and images can will be downloaded using the pull secret but in our case we don't dictate when this is available on the disk since MCO is responsible for it. Now assume when monitoring operator start and pull secret is not yet available on disk then it will fail and again retry but eventually succeed as soon as images are pulled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfergeau any thought on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is some code which needs the pull secret to be on disk to work, and we have a "wait until it works" function for this code, then I’d prefer if cluster.WaitForPullSecretPresentOnInstanceDisk was called before it.
Otherwise the "wait until it works" function will also act as a hidden cluster.WaitForPullSecretPresentOnInstanceDisk, and cluster.WaitForPullSecretPresentOnInstanceDisk will almost be a no-op.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfergeau We don't have any code that directly depends on the pull secret being present on disk. Instead, we're executing a set of oc commands, some action (like monitoring or marketplace) involve pulling container images. These operations require the pull secret, and if it's not available, they can result in image pull errors. However, these errors are non-blocking and are retried, so they don't cause a complete failure.

Our current goal in this PR is to optimize startup time by checking for the presence of the pull secret on disk only at the end of the process. In fact, we could go a step further and consider removing this check entirely. Once the pull secret is patched, it's the responsibility of the Machine Config Operator (MCO) to propagate it to the disk. Our check is effectively a redundant safeguard as of now.

In current scenario MCO is patched with user provided pull secret and
just after, we check if pull secret is part of disk which takes ~1 min
since MCD make that change and the report to MCO. In this PR we are
delaying this pull secret check to end because instead of blocking it
for ~1 min better to execute other part and at the end check if pull
secret is part of disk image.

With this PR (crc start time, 6 runs)
```
real	4m9.247s
user	0m0.557s
sys	0m0.165s

real	4m0.455s
user	0m0.619s
sys	0m0.168s

real	4m5.962s
user	0m0.445s
sys	0m0.154s

real	3m59.594s
user	0m0.661s
sys	0m0.179s

real	4m3.958s
user	0m0.563s
sys	0m0.177s

real	4m28.806s
user	0m0.460s
sys	0m0.171s
```

Without this PR
```
real	5m7.235s
user	0m0.797s
sys	0m0.181s

real	4m28.741s
user	0m0.891s
sys	0m0.195s

real	6m6.815s
user	0m0.747s
sys	0m0.194s

real	5m1.733s
user	0m0.395s
sys	0m0.199s

real	4m30.551s
user	0m0.466s
sys	0m0.173s

real	4m31.067s
user	0m0.673s
sys	0m0.183s
```
@praveenkumar praveenkumar force-pushed the delay_pull_secret_check branch from 1e97ca1 to 24605d0 Compare June 11, 2025 04:47
@openshift-ci openshift-ci bot removed the lgtm label Jun 11, 2025
Copy link

openshift-ci bot commented Jun 11, 2025

New changes are detected. LGTM label has been removed.

During CI failure it is observe sometime following error happen when
checking the pull secret present on the disk. This PR make sure that ssh
connectivity is present before checking pull-secret.
```
DEBU SSH command results: err: ssh: unexpected packet in response to channel open: <nil>, output:
DEBU error: Temporary error: ssh command error:
```
@praveenkumar
Copy link
Member Author

/retest

@praveenkumar
Copy link
Member Author

@coderabbitai review

Copy link

coderabbitai bot commented Jun 24, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

coderabbitai bot commented Jun 24, 2025

Walkthrough

A connectivity check was added to the process of waiting for the pull secret file on the instance disk, ensuring SSH access before each retry. Additionally, the sequence of operations was adjusted so that waiting for the pull secret occurs after the cluster stabilization step rather than before it.

Changes

File(s) Change Summary
pkg/crc/cluster/cluster.go Added SSH connectivity check before each retry in WaitForPullSecretPresentOnInstanceDisk.
pkg/crc/machine/start.go Moved the call to wait for pull secret presence to after the cluster stabilization step.

Poem

In the warren of code, a secret we seek,
But first, we must check if the tunnel’s not weak.
Wait for the cluster, let it grow strong,
Only then do we search where secrets belong.
With patience and pings, the rabbit prevails—
Secure in its burrow, where connectivity never fails! 🐇

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (1.64.8)

Error: you are using a configuration file for golangci-lint v2 with golangci-lint v1: please use golangci-lint v2
Failed executing command with error: you are using a configuration file for golangci-lint v2 with golangci-lint v1: please use golangci-lint v2

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@praveenkumar
Copy link
Member Author

@coderabbitai full review

Copy link

coderabbitai bot commented Jun 24, 2025

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/crc/cluster/cluster.go (1)

450-452: Consider making connectivity check retryable.

The connectivity check is a good defensive measure, but returning immediately on connectivity failure may be too strict. If SSH connectivity is temporarily lost (network hiccup, VM resource contention), the entire function fails without retry.

Consider wrapping the connectivity check in a RetriableError to allow the retry mechanism to handle temporary connectivity issues:

 		if err := sshRunner.WaitForConnectivity(ctx, 30*time.Second); err != nil {
-			return err
+			return &errors.RetriableError{Err: err}
 		}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e686f31 and d4c2438.

📒 Files selected for processing (2)
  • pkg/crc/cluster/cluster.go (1 hunks)
  • pkg/crc/machine/start.go (1 hunks)
🔇 Additional comments (1)
pkg/crc/machine/start.go (1)

613-616: LGTM! Sequence optimization improves start time.

Moving the pull secret disk check to after cluster stabilization is a good optimization. This allows other startup operations to proceed while the Machine Config Operator propagates the pull secret, reducing the overall blocking time as described in the PR objectives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants