-
Notifications
You must be signed in to change notification settings - Fork 147
[WIP] OCPEDGE-2176: two new disruptive fencing validation jobs for TNF #1486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: nhamza <[email protected]>
WalkthroughAdds an in-cluster disruptive validation runner and CLI subcommand; sets a default PCS pcmk_delay_base for the first fencing config; propagates a dynamic client through operator controllers; introduces a disruptive JobType; and applies minor formatting changes. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (10)
pkg/tnf/pkg/pcs/fencing.go (6)
20-20
: Regex is brittle; fails IPv6 and can over-match due to greediness.Prefer url parsing (after trimming redfish+ prefix) or a tighter, anchored pattern.
Apply:
-var addressRegEx = regexp.MustCompile(`.*//(.*):(.*)(/redfish.*)`) +// host may be hostname, IPv4, or bracketed IPv6; require an explicit port +var addressRegEx = regexp.MustCompile(`^.+?//([^/:]+|\[[^\]]+\]):([0-9]+)(/redfish.*)$`)
319-325
: Substring match can produce false positives (e.g., node1 vs node10).Tokenize or use boundaries.
-func isOnline(output, name string) bool { - short := name - if i := strings.IndexByte(name, '.'); i > 0 { - short = name[:i] - } - return strings.Contains(output, name) || strings.Contains(output, short) -} +func isOnline(output, name string) bool { + short := name + if i := strings.IndexByte(name, '.'); i > 0 { + short = name[:i] + } + split := func(r rune) bool { return r == ' ' || r == '\t' || r == '\n' || r == ',' || r == '[' || r == ']' } + toks := strings.FieldsFunc(output, split) + for _, t := range toks { + if strings.EqualFold(t, name) || strings.EqualFold(t, short) { + return true + } + } + return false +}
327-344
: Treat transient pcs errors as transient during waits.Current code aborts on first pcs status error; continue polling instead.
- out, pcsErr := pcsNodesOnline(ctx) - if pcsErr != nil { - return fmt.Errorf("pcs status failed while waiting OFFLINE: %w", pcsErr) - } + out, pcsErr := pcsNodesOnline(ctx) + if pcsErr != nil { + klog.V(2).Infof("pcs status transient error: %v", pcsErr) + // fall through to sleep + } else if !isOnline(out, name) { + return nil + } - if !isOnline(out, name) { - return nil - }- out, pcsErr := pcsNodesOnline(ctx) - if pcsErr != nil { - return fmt.Errorf("pcs status failed while waiting ONLINE: %w", pcsErr) - } - if isOnline(out, name) { - return nil - } + out, pcsErr := pcsNodesOnline(ctx) + if pcsErr == nil && isOnline(out, name) { + return nil + }Also applies to: 346-363
384-403
: Health check aborts on first member error.Consider counting healthy members and treating individual health probe errors as transient.
424-429
: Log stdout on fencing failure for easier triage.Also ok to keep quoting with %q.
- _, stdErr, fenceErr := exec.Execute(ctx, cmd) + stdOut, stdErr, fenceErr := exec.Execute(ctx, cmd) if fenceErr != nil { - klog.Error(fenceErr, "pcs stonith fence failed", "stderr", stdErr) + klog.Error(fenceErr, "pcs stonith fence failed", "stdout", stdOut, "stderr", stdErr) return fmt.Errorf("pcs stonith fence %q failed: %w", target, fenceErr) }
440-440
: Nit: stray leading space in log message.- klog.Infof(" peer-only fencing validation passed for peer %q", target) + klog.Infof("peer-only fencing validation passed for peer %q", target)pkg/tnf/setup/runner.go (4)
35-41
: Cancel context on all exit paths.Add a top-level defer cancel().
- ctx, cancel := context.WithCancel(context.Background()) + ctx, cancel := context.WithCancel(context.Background()) + defer cancel()
52-57
: Protobuf client config is created but not used.Either use it for kubeClient or drop protoConfig.
-// This kube client use protobuf, do not use it for CR -kubeClient, err := kubernetes.NewForConfig(clientConfig) +// Use protobuf for core kube API (not for CRDs) +kubeClient, err := kubernetes.NewForConfig(protoConfig)
75-81
: Ensure informers stop on early returns too.Close stopCh via defer and drop the extra closer goroutine.
-stopCh := make(chan struct{}) -go func() { - <-ctx.Done() - close(stopCh) -}() +stopCh := make(chan struct{}) +defer close(stopCh)
183-187
: Parse the env flag robustly (case-insensitive, 1/true/yes).-if os.Getenv("TNF_VALIDATE_PEER_ONLY") == "true" { +if v, _ := strconv.ParseBool(os.Getenv("TNF_VALIDATE_PEER_ONLY")); v {Remember to import strconv.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (2)
pkg/tnf/pkg/pcs/fencing.go
(2 hunks)pkg/tnf/setup/runner.go
(3 hunks)
🔇 Additional comments (2)
pkg/tnf/setup/runner.go (1)
91-98
: Don't adddefer ec.Close()
— EtcdClient is an interface without Close().The EtcdClient in pkg/etcdcli/interfaces.go is an interface and does not define Close(); calling
ec.Close()
in pkg/tnf/setup/runner.go will not compile. Either addClose()
to the EtcdClient interface or ensure the concrete client is closed via its public API.Likely an incorrect or invalid review comment.
pkg/tnf/pkg/pcs/fencing.go (1)
314-317
: Incorrect — exec.Execute already invokes a shell (/bin/bash -c)pkg/tnf/pkg/exec/exec.go builds hostCommand := []string{"/usr/bin/nsenter", "-a", "-t 1", "/bin/bash", "-c"} and calls exec.CommandContext(...), so shell operators like || will be interpreted; the suggested refactor is unnecessary.
Likely an incorrect or invalid review comment.
@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Signed-off-by: nhamza <[email protected]>
Signed-off-by: nhamza <[email protected]>
/retest |
/retest-required |
Signed-off-by: nhamza <[email protected]>
Signed-off-by: nhamza <[email protected]>
@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Signed-off-by: nhamza <[email protected]>
Signed-off-by: nhamza <[email protected]>
Signed-off-by: nhamza <[email protected]>
Signed-off-by: nhamza <[email protected]>
Signed-off-by: nhamza <[email protected]>
Signed-off-by: nhamza <[email protected]>
@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retest |
@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, but the direction seems Ok to me
pkg/tnf/disruptivevalidate/runner.go
Outdated
}) | ||
} | ||
|
||
func detectLocalAndPeer(_ context.Context, _ kubernetes.Interface, n1, n2 string) (string, string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to pass context and kubernetes.Interface if not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was for consistency and future-ready for any API usage but i will remove it for now
func detectLocalAndPeer(_ context.Context, _ kubernetes.Interface, n1, n2 string) (string, string, error) { | ||
podName, err := os.Hostname() | ||
if err != nil || strings.TrimSpace(podName) == "" { | ||
return "", "", fmt.Errorf("get pod hostname: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the err is nil
, but strings.TrimSpace(podName)
returns empty string, the resulting error message won't be very helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true! updated
pkg/tnf/disruptivevalidate/runner.go
Outdated
return wait.PollUntilContextTimeout(ctx, poll, timeoutPeerJob, true, func(context.Context) (bool, error) { | ||
j, err := kc.BatchV1().Jobs(operatorclient.TargetNamespace).Get(ctx, target, metav1.GetOptions{}) | ||
if apierrors.IsNotFound(err) { | ||
return false, nil | ||
} | ||
if err != nil { | ||
return false, nil | ||
} | ||
klog.V(2).Infof("peer %s status: succeeded=%d failed=%d conditions=%+v", target, j.Status.Succeeded, j.Status.Failed, j.Status.Conditions) | ||
|
||
// Only treat as failed if the JobFailed condition is set | ||
if tools.IsConditionTrue(j.Status.Conditions, batchv1.JobFailed) { | ||
return false, fmt.Errorf("peer validate job %s failed", target) | ||
} | ||
// Proceed when the peer is complete | ||
return j.Status.Succeeded > 0 || tools.IsConditionTrue(j.Status.Conditions, batchv1.JobComplete), nil | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to waitForLabeledJob to me (well except the logging). Any chance we can label the validation job too, and reuse the same function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i already tag the jobs just like we do with other jobs. however the label is not unique for each node, meaning i had to find a way to detect this specific job for this node
for that case i merged the logic into one function.
since labeling each node job would require some logic change in other places which could be fragile also for the other jobs which i would rather not touch
now it should look better
Signed-off-by: nhamza <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/tnf/pkg/pcs/fencing.go (1)
64-66
: Guard against overwriting explicit pcmk_delay_base values.Right now Line 65 unconditionally sets
pcmk_delay_base
on the primary device. If the secret (or future config) ever supplies an explicit delay, we’ll stomp it while building the option map. Please fence the assignment so we only apply the default when the option is absent.- if i == 0 { - fc.FencingDeviceOptions[PcmkDelayBase] = defaultPcmkDelayBase - } + if i == 0 { + if _, exists := fc.FencingDeviceOptions[PcmkDelayBase]; !exists { + fc.FencingDeviceOptions[PcmkDelayBase] = defaultPcmkDelayBase + } + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (6)
cmd/tnf-setup-runner/main.go
(3 hunks)pkg/tnf/disruptivevalidate/runner.go
(1 hunks)pkg/tnf/operator/starter.go
(5 hunks)pkg/tnf/pkg/pcs/fencing.go
(1 hunks)pkg/tnf/pkg/tools/jobs.go
(2 hunks)pkg/tnf/setup/runner.go
(0 hunks)
💤 Files with no reviewable changes (1)
- pkg/tnf/setup/runner.go
🔇 Additional comments (2)
pkg/tnf/pkg/tools/jobs.go (1)
29-43
: Disruptive validate JobType wiring looks solid.The new enum entry on Line 29 plus the subcommand branch on Lines 42-43 keep the naming helpers consistent; the downstream job controllers will pick up
disruptive-validate
without extra glue.pkg/tnf/operator/starter.go (1)
74-109
: Nice sequencing tie-in for disruptive-validate jobs.Adding the per-node controller on Lines 106-109 alongside auth and after-setup keeps the new disruptive validation workflow under the existing job orchestration umbrella.
Signed-off-by: nhamza <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (1)
pkg/tnf/disruptivevalidate/runner.go
(1 hunks)
Signed-off-by: nhamza <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
pkg/tnf/disruptivevalidate/runner.go (5)
99-107
: Surface stderr from pcs on failure for actionable diagnostics.pcs often writes errors to stderr; using only stdout may hide the real cause.
- out, _, ferr := exec.Execute(ctx, fmt.Sprintf(`/usr/sbin/pcs stonith fence %s`, peer)) + out, errOut, ferr := exec.Execute(ctx, fmt.Sprintf(`/usr/sbin/pcs stonith fence %s`, peer)) if ferr != nil { - ls := out + // Prefer stderr last line when available. + ls := errOut + if strings.TrimSpace(ls) == "" { + ls = out + } if i := strings.LastIndex(ls, "\n"); i >= 0 && i+1 < len(ls) { ls = ls[i+1:] } return fmt.Errorf("pcs fence %s failed: %w (last line: %s)", peer, ferr, strings.TrimSpace(ls)) }
191-200
: Consider removing now-redundant allowNeverSeenTTL plumbing.If you adopt the fix above, allowNeverSeenTTL becomes unnecessary. You can simplify by inlining waitForJobNamePeerTTL to call waitForJobName and drop the extra parameter from waitForJobs.
-func waitForJobNamePeerTTL(ctx context.Context, kc kubernetes.Interface, name string, to time.Duration) error { - return waitForJobs(ctx, kc, name, "", 1, to, true) // allowNeverSeenTTL = true -} +func waitForJobNamePeerTTL(ctx context.Context, kc kubernetes.Interface, name string, to time.Duration) error { + // With strict semantics (never-seen ≠ completed), reuse the regular waiter. + return waitForJobName(ctx, kc, name, to) +}And (optional) remove the allowNeverSeenTTL argument from waitForJobs.
249-263
: Tighten JSON unmarshal handling for etcdctl output.Capture and log the unmarshal error instead of discarding it; helps when etcdctl prints transient HTML/errors.
- if json.Unmarshal([]byte(out), &ml) != nil { - return false, nil - } + if err := json.Unmarshal([]byte(out), &ml); err != nil { + klog.V(3).Infof("invalid etcdctl member list output: %v", err) + return false, nil + }
282-300
: pcs status parsing: handle “Online: N nodes [ ... ]” variant robustly.Current parsing works in most cases, but pcs can emit "Online: 2 nodes [ a b ]". Your trimming covers brackets, but tokenization may include count/“nodes”. Consider explicitly scanning within the brackets when present to avoid false negatives.
28-36
: Nit: clarify timeout names.timeoutAfter reuses SetupJobCompletedTimeout. Consider renaming to timeoutAfterSetup for readability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (1)
pkg/tnf/disruptivevalidate/runner.go
(1 hunks)
🔇 Additional comments (2)
pkg/tnf/disruptivevalidate/runner.go (2)
211-221
: Hostname-to-JobName derivation can be brittle across controller versions.Assuming Pod name = "-" works for current Jobs, but formats can vary. If feasible, inject the job name via Downward API env var, or read ownerRefs of the local Pod (if RBAC allows) to avoid string heuristics.
136-147
: Remove unsafe TTL fallback for never-seen Jobs
allowNeverSeenTTL can prematurely mark a non-existent Job as complete, bypassing sequencing safety. Only assume TTL-after-finish if the Job was observed at least once.
Apply:pkg/tnf/disruptivevalidate/runner.go @@ -142,7 +142,6 @@ if seen { klog.V(2).Infof("job %s disappeared after observation; assuming TTL after completion", byName) return true, nil - if allowNeverSeenTTL && time.Since(start) > appearanceGrace { - klog.V(2).Infof("job %s not found for %s; assuming completed earlier and TTL-deleted", byName, appearanceGrace) - return true, nil - } return false, nil
- Optional: to support a peer-already-TTL’d scenario, either disable TTLSecondsAfterFinished on disruptive-validate Jobs or switch to a durable completion signal.
@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, Neilhamza The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@Neilhamza: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
New jobs
Lock mechanism (how both nodes get fenced safely)
Orchestration (who waits on whom)
CANNOT BE MERGED BLOCKED BY: OCPBUGS-42808 which is related to : https://issues.redhat.com/browse/ETCD-673
how the jobs will look like:
