Skip to content

Conversation

Neilhamza
Copy link

@Neilhamza Neilhamza commented Sep 17, 2025

New jobs

  • tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node (those 2 jobs will run last in order)

Lock mechanism (how both nodes get fenced safely)

  • Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.
  • Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.
  • The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).
  • Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

  • Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).
  • Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).
  • Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

CANNOT BE MERGED BLOCKED BY: OCPBUGS-42808 which is related to : https://issues.redhat.com/browse/ETCD-673

how the jobs will look like:
Screenshot From 2025-09-20 01-46-58

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2025
Copy link

coderabbitai bot commented Sep 17, 2025

Walkthrough

Adds an in-cluster disruptive validation runner and CLI subcommand; sets a default PCS pcmk_delay_base for the first fencing config; propagates a dynamic client through operator controllers; introduces a disruptive JobType; and applies minor formatting changes.

Changes

Cohort / File(s) Summary
Disruptive validation runner & CLI
pkg/tnf/disruptivevalidate/runner.go, cmd/tnf-setup-runner/main.go
Adds RunDisruptiveValidate() implementing in-cluster client setup, job-waiting, fencing orchestration, health checks, and helper utilities; adds NewDisruptiveValidateCommand() and registers the disruptive-validate CLI subcommand.
PCS fencing tweaks
pkg/tnf/pkg/pcs/fencing.go
Moves addressRegEx to file scope, adds defaultPcmkDelayBase = "10s", and assigns pcmk_delay_base to the first fencing config entry; removes extraneous blank lines (formatting).
Operator controller signature changes
pkg/tnf/operator/starter.go
Adds dynamicClient propagation by updating HandleDualReplicaClusters and adjusting runExternalEtcdSupportController and runTnfResourceController signatures and call sites to accept the additional client parameter.
Job type enum / mapping
pkg/tnf/pkg/tools/jobs.go
Adds JobTypeDisruptiveValidate constant and maps it to the "disruptive-validate" subcommand in GetSubCommand.
Setup runner minor edit
pkg/tnf/setup/runner.go
Removes an extra blank line after the RunTnfSetup function signature (formatting-only).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ⚠️ Warning The title includes a “[WIP]” tag and highlights only the disruptive fencing validation jobs, omitting related support changes such as the new CLI command and operator updates, and contains noise that detracts from a concise summary of the main change. Please remove the “[WIP]” prefix and revise the title to clearly and concisely summarize the primary change (for example, adding TNF disruptive fencing validation commands and jobs) without extraneous tags.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed The provided description clearly outlines the new per-node disruptive validation jobs and how they are sequenced, matching the changes introduced across multiple files. It details the lock mechanism, orchestration flow, and merge blocker references, all of which directly tie to the code modifications. This indicates a relevant and topic-aligned description that adequately informs reviewers of the PR’s intent.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from clobrano and slintes September 17, 2025 13:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (10)
pkg/tnf/pkg/pcs/fencing.go (6)

20-20: Regex is brittle; fails IPv6 and can over-match due to greediness.

Prefer url parsing (after trimming redfish+ prefix) or a tighter, anchored pattern.

Apply:

-var addressRegEx = regexp.MustCompile(`.*//(.*):(.*)(/redfish.*)`)
+// host may be hostname, IPv4, or bracketed IPv6; require an explicit port
+var addressRegEx = regexp.MustCompile(`^.+?//([^/:]+|\[[^\]]+\]):([0-9]+)(/redfish.*)$`)

319-325: Substring match can produce false positives (e.g., node1 vs node10).

Tokenize or use boundaries.

-func isOnline(output, name string) bool {
-	short := name
-	if i := strings.IndexByte(name, '.'); i > 0 {
-		short = name[:i]
-	}
-	return strings.Contains(output, name) || strings.Contains(output, short)
-}
+func isOnline(output, name string) bool {
+	short := name
+	if i := strings.IndexByte(name, '.'); i > 0 {
+		short = name[:i]
+	}
+	split := func(r rune) bool { return r == ' ' || r == '\t' || r == '\n' || r == ',' || r == '[' || r == ']' }
+	toks := strings.FieldsFunc(output, split)
+	for _, t := range toks {
+		if strings.EqualFold(t, name) || strings.EqualFold(t, short) {
+			return true
+		}
+	}
+	return false
+}

327-344: Treat transient pcs errors as transient during waits.

Current code aborts on first pcs status error; continue polling instead.

-		out, pcsErr := pcsNodesOnline(ctx)
-		if pcsErr != nil {
-			return fmt.Errorf("pcs status failed while waiting OFFLINE: %w", pcsErr)
-		}
+		out, pcsErr := pcsNodesOnline(ctx)
+		if pcsErr != nil {
+			klog.V(2).Infof("pcs status transient error: %v", pcsErr)
+			// fall through to sleep
+		} else if !isOnline(out, name) {
+			return nil
+		}
-		if !isOnline(out, name) {
-			return nil
-		}
-		out, pcsErr := pcsNodesOnline(ctx)
-		if pcsErr != nil {
-			return fmt.Errorf("pcs status failed while waiting ONLINE: %w", pcsErr)
-		}
-		if isOnline(out, name) {
-			return nil
-		}
+		out, pcsErr := pcsNodesOnline(ctx)
+		if pcsErr == nil && isOnline(out, name) {
+			return nil
+		}

Also applies to: 346-363


384-403: Health check aborts on first member error.

Consider counting healthy members and treating individual health probe errors as transient.


424-429: Log stdout on fencing failure for easier triage.

Also ok to keep quoting with %q.

-	_, stdErr, fenceErr := exec.Execute(ctx, cmd)
+	stdOut, stdErr, fenceErr := exec.Execute(ctx, cmd)
 	if fenceErr != nil {
-		klog.Error(fenceErr, "pcs stonith fence failed", "stderr", stdErr)
+		klog.Error(fenceErr, "pcs stonith fence failed", "stdout", stdOut, "stderr", stdErr)
 		return fmt.Errorf("pcs stonith fence %q failed: %w", target, fenceErr)
 	}

440-440: Nit: stray leading space in log message.

-	klog.Infof(" peer-only fencing validation passed for peer %q", target)
+	klog.Infof("peer-only fencing validation passed for peer %q", target)
pkg/tnf/setup/runner.go (4)

35-41: Cancel context on all exit paths.

Add a top-level defer cancel().

-	ctx, cancel := context.WithCancel(context.Background())
+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()

52-57: Protobuf client config is created but not used.

Either use it for kubeClient or drop protoConfig.

-// This kube client use protobuf, do not use it for CR
-kubeClient, err := kubernetes.NewForConfig(clientConfig)
+// Use protobuf for core kube API (not for CRDs)
+kubeClient, err := kubernetes.NewForConfig(protoConfig)

75-81: Ensure informers stop on early returns too.

Close stopCh via defer and drop the extra closer goroutine.

-stopCh := make(chan struct{})
-go func() {
-	<-ctx.Done()
-	close(stopCh)
-}()
+stopCh := make(chan struct{})
+defer close(stopCh)

183-187: Parse the env flag robustly (case-insensitive, 1/true/yes).

-if os.Getenv("TNF_VALIDATE_PEER_ONLY") == "true" {
+if v, _ := strconv.ParseBool(os.Getenv("TNF_VALIDATE_PEER_ONLY")); v {

Remember to import strconv.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 61b7fb5 and 530c521.

📒 Files selected for processing (2)
  • pkg/tnf/pkg/pcs/fencing.go (2 hunks)
  • pkg/tnf/setup/runner.go (3 hunks)
🔇 Additional comments (2)
pkg/tnf/setup/runner.go (1)

91-98: Don't add defer ec.Close() — EtcdClient is an interface without Close().

The EtcdClient in pkg/etcdcli/interfaces.go is an interface and does not define Close(); calling ec.Close() in pkg/tnf/setup/runner.go will not compile. Either add Close() to the EtcdClient interface or ensure the concrete client is closed via its public API.

Likely an incorrect or invalid review comment.

pkg/tnf/pkg/pcs/fencing.go (1)

314-317: Incorrect — exec.Execute already invokes a shell (/bin/bash -c)

pkg/tnf/pkg/exec/exec.go builds hostCommand := []string{"/usr/bin/nsenter", "-a", "-t 1", "/bin/bash", "-c"} and calls exec.CommandContext(...), so shell operators like || will be interpreted; the suggested refactor is unnecessary.

Likely an incorrect or invalid review comment.

@Neilhamza Neilhamza changed the title [WIP] disruptive fencing validation TNF [WIP] OCPEDGE-2176: disruptive fencing validation TNF Sep 17, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 17, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 17, 2025

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

ValidateDisruptivePeerOnly flow to test fencing a peer node.

Adds fencing validation logic (pcs stonith + etcd quorum check)

Integrates into TNF setup runner behind an environment gate (TNF_VALIDATE_PEER_ONLY)

Improves safety with self-fence guard, explicit error handling, and pcs/etcd health waits

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Neilhamza
Copy link
Author

/retest

@Neilhamza
Copy link
Author

/retest-required

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 18, 2025

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces a new disruptive-validate job that runs on each control-plane node, ensuring fencing validation is performed sequentially (one node at a time). Adds preflight checks, pacemaker integration, and peer state monitoring to validate proper recovery after disruptive fencing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 19, 2025

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

  • tnf-auth-job- (per node): prepares node-local prerequisites used by later TNF stages (auth/env checks required for host tooling).
  • tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node:
  • pre-fence etcd check (2 voters),
  • PCS preflight (pcs present, pacemaker active, peer ONLINE),
  • pcs stonith fence → wait peer OFFLINE → ONLINE,
  • post-fence etcd check (2 voters).

Lock mechanism (how both nodes get fenced safely)

  • Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.
  • Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.
  • The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).
  • Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

  • Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).
  • Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).
  • Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 19, 2025

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

  • tnf-auth-job- (per node): prepares node-local prerequisites used by later TNF stages (auth/env checks required for host tooling).
  • tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node:
  • pre-fence etcd check (2 voters),
  • PCS preflight (pcs present, pacemaker active, peer ONLINE),
  • pcs stonith fence → wait peer OFFLINE → ONLINE,
  • post-fence etcd check (2 voters).

Lock mechanism (how both nodes get fenced safely)

  • Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.
  • Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.
  • The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).
  • Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

  • Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).
  • Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).
  • Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

how the jobs will look like:
Screenshot From 2025-09-20 01-46-58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 19, 2025

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

  • tnf-auth-job- (per node): prepares node-local prerequisites used by later TNF stages (auth/env checks required for host tooling).
  • tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node:
  • pre-fence etcd check (2 voters),
  • PCS preflight (pcs present, pacemaker active, peer ONLINE),
  • pcs stonith fence → wait peer OFFLINE → ONLINE,
  • post-fence etcd check (2 voters).

Lock mechanism (how both nodes get fenced safely)

  • Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.
  • Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.
  • The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).
  • Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

  • Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).
  • Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).
  • Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

how the jobs will look like:
Screenshot From 2025-09-20 01-46-58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 19, 2025

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

  • tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node (those 2 jobs will run last in order)

Lock mechanism (how both nodes get fenced safely)

  • Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.
  • Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.
  • The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).
  • Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

  • Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).
  • Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).
  • Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

how the jobs will look like:
Screenshot From 2025-09-20 01-46-58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Neilhamza Neilhamza changed the title [WIP] OCPEDGE-2176: disruptive fencing validation TNF [WIP] OCPEDGE-2176: introducing two new disruptive fencing validation jobs for TNF Sep 19, 2025
@Neilhamza Neilhamza changed the title [WIP] OCPEDGE-2176: introducing two new disruptive fencing validation jobs for TNF OCPEDGE-2176: two new disruptive fencing validation jobs for TNF Sep 19, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 19, 2025
@Neilhamza
Copy link
Author

/retest

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 20, 2025

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

  • tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node (those 2 jobs will run last in order)

Lock mechanism (how both nodes get fenced safely)

  • Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.
  • Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.
  • The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).
  • Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

  • Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).
  • Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).
  • Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

CANNOT BE MERGED BLOCKED BY: OCPBUGS-61117

how the jobs will look like:
Screenshot From 2025-09-20 01-46-58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but the direction seems Ok to me

})
}

func detectLocalAndPeer(_ context.Context, _ kubernetes.Interface, n1, n2 string) (string, string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to pass context and kubernetes.Interface if not used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was for consistency and future-ready for any API usage but i will remove it for now

func detectLocalAndPeer(_ context.Context, _ kubernetes.Interface, n1, n2 string) (string, string, error) {
podName, err := os.Hostname()
if err != nil || strings.TrimSpace(podName) == "" {
return "", "", fmt.Errorf("get pod hostname: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the err is nil, but strings.TrimSpace(podName) returns empty string, the resulting error message won't be very helpful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true! updated

Comment on lines 175 to 191
return wait.PollUntilContextTimeout(ctx, poll, timeoutPeerJob, true, func(context.Context) (bool, error) {
j, err := kc.BatchV1().Jobs(operatorclient.TargetNamespace).Get(ctx, target, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return false, nil
}
if err != nil {
return false, nil
}
klog.V(2).Infof("peer %s status: succeeded=%d failed=%d conditions=%+v", target, j.Status.Succeeded, j.Status.Failed, j.Status.Conditions)

// Only treat as failed if the JobFailed condition is set
if tools.IsConditionTrue(j.Status.Conditions, batchv1.JobFailed) {
return false, fmt.Errorf("peer validate job %s failed", target)
}
// Proceed when the peer is complete
return j.Status.Succeeded > 0 || tools.IsConditionTrue(j.Status.Conditions, batchv1.JobComplete), nil
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar to waitForLabeledJob to me (well except the logging). Any chance we can label the validation job too, and reuse the same function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i already tag the jobs just like we do with other jobs. however the label is not unique for each node, meaning i had to find a way to detect this specific job for this node
for that case i merged the logic into one function.
since labeling each node job would require some logic change in other places which could be fragile also for the other jobs which i would rather not touch
now it should look better

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/tnf/pkg/pcs/fencing.go (1)

64-66: Guard against overwriting explicit pcmk_delay_base values.

Right now Line 65 unconditionally sets pcmk_delay_base on the primary device. If the secret (or future config) ever supplies an explicit delay, we’ll stomp it while building the option map. Please fence the assignment so we only apply the default when the option is absent.

-		if i == 0 {
-			fc.FencingDeviceOptions[PcmkDelayBase] = defaultPcmkDelayBase
-		}
+		if i == 0 {
+			if _, exists := fc.FencingDeviceOptions[PcmkDelayBase]; !exists {
+				fc.FencingDeviceOptions[PcmkDelayBase] = defaultPcmkDelayBase
+			}
+		}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 530c521 and fc08add.

📒 Files selected for processing (6)
  • cmd/tnf-setup-runner/main.go (3 hunks)
  • pkg/tnf/disruptivevalidate/runner.go (1 hunks)
  • pkg/tnf/operator/starter.go (5 hunks)
  • pkg/tnf/pkg/pcs/fencing.go (1 hunks)
  • pkg/tnf/pkg/tools/jobs.go (2 hunks)
  • pkg/tnf/setup/runner.go (0 hunks)
💤 Files with no reviewable changes (1)
  • pkg/tnf/setup/runner.go
🔇 Additional comments (2)
pkg/tnf/pkg/tools/jobs.go (1)

29-43: Disruptive validate JobType wiring looks solid.

The new enum entry on Line 29 plus the subcommand branch on Lines 42-43 keep the naming helpers consistent; the downstream job controllers will pick up disruptive-validate without extra glue.

pkg/tnf/operator/starter.go (1)

74-109: Nice sequencing tie-in for disruptive-validate jobs.

Adding the per-node controller on Lines 106-109 alongside auth and after-setup keeps the new disruptive validation workflow under the existing job orchestration umbrella.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between fc08add and 6a2b9a4.

📒 Files selected for processing (1)
  • pkg/tnf/disruptivevalidate/runner.go (1 hunks)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
pkg/tnf/disruptivevalidate/runner.go (5)

99-107: Surface stderr from pcs on failure for actionable diagnostics.

pcs often writes errors to stderr; using only stdout may hide the real cause.

-	out, _, ferr := exec.Execute(ctx, fmt.Sprintf(`/usr/sbin/pcs stonith fence %s`, peer))
+	out, errOut, ferr := exec.Execute(ctx, fmt.Sprintf(`/usr/sbin/pcs stonith fence %s`, peer))
 	if ferr != nil {
-		ls := out
+		// Prefer stderr last line when available.
+		ls := errOut
+		if strings.TrimSpace(ls) == "" {
+			ls = out
+		}
 		if i := strings.LastIndex(ls, "\n"); i >= 0 && i+1 < len(ls) {
 			ls = ls[i+1:]
 		}
 		return fmt.Errorf("pcs fence %s failed: %w (last line: %s)", peer, ferr, strings.TrimSpace(ls))
 	}

191-200: Consider removing now-redundant allowNeverSeenTTL plumbing.

If you adopt the fix above, allowNeverSeenTTL becomes unnecessary. You can simplify by inlining waitForJobNamePeerTTL to call waitForJobName and drop the extra parameter from waitForJobs.

-func waitForJobNamePeerTTL(ctx context.Context, kc kubernetes.Interface, name string, to time.Duration) error {
-	return waitForJobs(ctx, kc, name, "", 1, to, true) // allowNeverSeenTTL = true
-}
+func waitForJobNamePeerTTL(ctx context.Context, kc kubernetes.Interface, name string, to time.Duration) error {
+	// With strict semantics (never-seen ≠ completed), reuse the regular waiter.
+	return waitForJobName(ctx, kc, name, to)
+}

And (optional) remove the allowNeverSeenTTL argument from waitForJobs.


249-263: Tighten JSON unmarshal handling for etcdctl output.

Capture and log the unmarshal error instead of discarding it; helps when etcdctl prints transient HTML/errors.

-		if json.Unmarshal([]byte(out), &ml) != nil {
-			return false, nil
-		}
+		if err := json.Unmarshal([]byte(out), &ml); err != nil {
+			klog.V(3).Infof("invalid etcdctl member list output: %v", err)
+			return false, nil
+		}

282-300: pcs status parsing: handle “Online: N nodes [ ... ]” variant robustly.

Current parsing works in most cases, but pcs can emit "Online: 2 nodes [ a b ]". Your trimming covers brackets, but tokenization may include count/“nodes”. Consider explicitly scanning within the brackets when present to avoid false negatives.


28-36: Nit: clarify timeout names.

timeoutAfter reuses SetupJobCompletedTimeout. Consider renaming to timeoutAfterSetup for readability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6a2b9a4 and 0003e21.

📒 Files selected for processing (1)
  • pkg/tnf/disruptivevalidate/runner.go (1 hunks)
🔇 Additional comments (2)
pkg/tnf/disruptivevalidate/runner.go (2)

211-221: Hostname-to-JobName derivation can be brittle across controller versions.

Assuming Pod name = "-" works for current Jobs, but formats can vary. If feasible, inject the job name via Downward API env var, or read ownerRefs of the local Pod (if RBAC allows) to avoid string heuristics.


136-147: Remove unsafe TTL fallback for never-seen Jobs
allowNeverSeenTTL can prematurely mark a non-existent Job as complete, bypassing sequencing safety. Only assume TTL-after-finish if the Job was observed at least once.
Apply:

pkg/tnf/disruptivevalidate/runner.go
@@ -142,7 +142,6 @@
                 if seen {
                     klog.V(2).Infof("job %s disappeared after observation; assuming TTL after completion", byName)
                     return true, nil
-                if allowNeverSeenTTL && time.Since(start) > appearanceGrace {
-                    klog.V(2).Infof("job %s not found for %s; assuming completed earlier and TTL-deleted", byName, appearanceGrace)
-                    return true, nil
-                }
                 return false, nil
  • Optional: to support a peer-already-TTL’d scenario, either disable TTLSecondsAfterFinished on disruptive-validate Jobs or switch to a durable completion signal.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 25, 2025

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

  • tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node (those 2 jobs will run last in order)

Lock mechanism (how both nodes get fenced safely)

  • Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.
  • Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.
  • The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).
  • Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

  • Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).
  • Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).
  • Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

CANNOT BE MERGED BLOCKED BY: OCPBUGS-42808 which is related to : https://issues.redhat.com/browse/ETCD-673

how the jobs will look like:
Screenshot From 2025-09-20 01-46-58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2025
Copy link
Contributor

openshift-ci bot commented Sep 25, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, Neilhamza

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2025
@Neilhamza Neilhamza changed the title OCPEDGE-2176: two new disruptive fencing validation jobs for TNF [WIP] OCPEDGE-2176: two new disruptive fencing validation jobs for TNF Sep 25, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Oct 3, 2025

@Neilhamza: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-etcd-scaling 0003e21 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-metal-ovn-two-node-fencing 0003e21 link false /test e2e-metal-ovn-two-node-fencing
ci/prow/e2e-azure-ovn-etcd-scaling 0003e21 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 0003e21 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-ovn-serial 0003e21 link true /test e2e-aws-ovn-serial
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 0003e21 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-aws 0003e21 link false /test e2e-aws
ci/prow/e2e-metal-ipi-ovn-ipv6 0003e21 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-gcp-disruptive 0003e21 link false /test e2e-gcp-disruptive
ci/prow/e2e-metal-assisted 0003e21 link true /test e2e-metal-assisted
ci/prow/e2e-aws-etcd-recovery 0003e21 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-gcp-ovn-etcd-scaling 0003e21 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-aws-cpms 0003e21 link true /test e2e-aws-cpms
ci/prow/e2e-aws-etcd-certrotation 0003e21 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-aws-disruptive-ovn 0003e21 link false /test e2e-aws-disruptive-ovn
ci/prow/e2e-aws-disruptive 0003e21 link false /test e2e-aws-disruptive
ci/prow/e2e-gcp-disruptive-ovn 0003e21 link false /test e2e-gcp-disruptive-ovn
ci/prow/e2e-aws-ovn-serial-1of2 0003e21 link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-aws-ovn-serial-2of2 0003e21 link true /test e2e-aws-ovn-serial-2of2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants