[WIP] OCPEDGE-2176: two new disruptive fencing validation jobs for TNF #1486

Neilhamza · 2025-09-17T13:07:39Z

New jobs

tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node (those 2 jobs will run last in order)

Lock mechanism (how both nodes get fenced safely)

Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.
Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.
The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).
Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).
Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).
Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

CANNOT BE MERGED BLOCKED BY: OCPBUGS-42808 which is related to : https://issues.redhat.com/browse/ETCD-673

how the jobs will look like:

Signed-off-by: nhamza <[email protected]>

coderabbitai · 2025-09-17T13:07:48Z

Walkthrough

Adds an in-cluster disruptive validation runner and CLI subcommand; sets a default PCS pcmk_delay_base for the first fencing config; propagates a dynamic client through operator controllers; introduces a disruptive JobType; and applies minor formatting changes.

Changes

Cohort / File(s)	Summary
Disruptive validation runner & CLI `pkg/tnf/disruptivevalidate/runner.go`, `cmd/tnf-setup-runner/main.go`	Adds `RunDisruptiveValidate()` implementing in-cluster client setup, job-waiting, fencing orchestration, health checks, and helper utilities; adds `NewDisruptiveValidateCommand()` and registers the `disruptive-validate` CLI subcommand.
PCS fencing tweaks `pkg/tnf/pkg/pcs/fencing.go`	Moves `addressRegEx` to file scope, adds `defaultPcmkDelayBase = "10s"`, and assigns `pcmk_delay_base` to the first fencing config entry; removes extraneous blank lines (formatting).
Operator controller signature changes `pkg/tnf/operator/starter.go`	Adds `dynamicClient` propagation by updating `HandleDualReplicaClusters` and adjusting `runExternalEtcdSupportController` and `runTnfResourceController` signatures and call sites to accept the additional client parameter.
Job type enum / mapping `pkg/tnf/pkg/tools/jobs.go`	Adds `JobTypeDisruptiveValidate` constant and maps it to the `"disruptive-validate"` subcommand in `GetSubCommand`.
Setup runner minor edit `pkg/tnf/setup/runner.go`	Removes an extra blank line after the `RunTnfSetup` function signature (formatting-only).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	⚠️ Warning	The title includes a “[WIP]” tag and highlights only the disruptive fencing validation jobs, omitting related support changes such as the new CLI command and operator updates, and contains noise that detracts from a concise summary of the main change.	Please remove the “[WIP]” prefix and revise the title to clearly and concisely summarize the primary change (for example, adding TNF disruptive fencing validation commands and jobs) without extraneous tags.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	The provided description clearly outlines the new per-node disruptive validation jobs and how they are sequenced, matching the changes introduced across multiple files. It details the lock mechanism, orchestration flow, and merge blocker references, all of which directly tie to the code modifications. This indicates a relevant and topic-aligned description that adequately informs reviewers of the PR’s intent.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (10)

pkg/tnf/pkg/pcs/fencing.go (6)

20-20: Regex is brittle; fails IPv6 and can over-match due to greediness.

Prefer url parsing (after trimming redfish+ prefix) or a tighter, anchored pattern.

Apply:

-var addressRegEx = regexp.MustCompile(`.*//(.*):(.*)(/redfish.*)`)
+// host may be hostname, IPv4, or bracketed IPv6; require an explicit port
+var addressRegEx = regexp.MustCompile(`^.+?//([^/:]+|\[[^\]]+\]):([0-9]+)(/redfish.*)$`)

319-325: Substring match can produce false positives (e.g., node1 vs node10).

Tokenize or use boundaries.

-func isOnline(output, name string) bool {
-	short := name
-	if i := strings.IndexByte(name, '.'); i > 0 {
-		short = name[:i]
-	}
-	return strings.Contains(output, name) || strings.Contains(output, short)
-}
+func isOnline(output, name string) bool {
+	short := name
+	if i := strings.IndexByte(name, '.'); i > 0 {
+		short = name[:i]
+	}
+	split := func(r rune) bool { return r == ' ' || r == '\t' || r == '\n' || r == ',' || r == '[' || r == ']' }
+	toks := strings.FieldsFunc(output, split)
+	for _, t := range toks {
+		if strings.EqualFold(t, name) || strings.EqualFold(t, short) {
+			return true
+		}
+	}
+	return false
+}

327-344: Treat transient pcs errors as transient during waits.

Current code aborts on first pcs status error; continue polling instead.

-		out, pcsErr := pcsNodesOnline(ctx)
-		if pcsErr != nil {
-			return fmt.Errorf("pcs status failed while waiting OFFLINE: %w", pcsErr)
-		}
+		out, pcsErr := pcsNodesOnline(ctx)
+		if pcsErr != nil {
+			klog.V(2).Infof("pcs status transient error: %v", pcsErr)
+			// fall through to sleep
+		} else if !isOnline(out, name) {
+			return nil
+		}
-		if !isOnline(out, name) {
-			return nil
-		}

-		out, pcsErr := pcsNodesOnline(ctx)
-		if pcsErr != nil {
-			return fmt.Errorf("pcs status failed while waiting ONLINE: %w", pcsErr)
-		}
-		if isOnline(out, name) {
-			return nil
-		}
+		out, pcsErr := pcsNodesOnline(ctx)
+		if pcsErr == nil && isOnline(out, name) {
+			return nil
+		}

Also applies to: 346-363

384-403: Health check aborts on first member error.

Consider counting healthy members and treating individual health probe errors as transient.

424-429: Log stdout on fencing failure for easier triage.

Also ok to keep quoting with %q.

-	_, stdErr, fenceErr := exec.Execute(ctx, cmd)
+	stdOut, stdErr, fenceErr := exec.Execute(ctx, cmd)
 	if fenceErr != nil {
-		klog.Error(fenceErr, "pcs stonith fence failed", "stderr", stdErr)
+		klog.Error(fenceErr, "pcs stonith fence failed", "stdout", stdOut, "stderr", stdErr)
 		return fmt.Errorf("pcs stonith fence %q failed: %w", target, fenceErr)
 	}

440-440: Nit: stray leading space in log message.

-	klog.Infof(" peer-only fencing validation passed for peer %q", target)
+	klog.Infof("peer-only fencing validation passed for peer %q", target)

pkg/tnf/setup/runner.go (4)

35-41: Cancel context on all exit paths.

Add a top-level defer cancel().
-	ctx, cancel := context.WithCancel(context.Background())
+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()
52-57: Protobuf client config is created but not used.

Either use it for kubeClient or drop protoConfig.
-// This kube client use protobuf, do not use it for CR
-kubeClient, err := kubernetes.NewForConfig(clientConfig)
+// Use protobuf for core kube API (not for CRDs)
+kubeClient, err := kubernetes.NewForConfig(protoConfig)
75-81: Ensure informers stop on early returns too.

Close stopCh via defer and drop the extra closer goroutine.
-stopCh := make(chan struct{})
-go func() {
-	<-ctx.Done()
-	close(stopCh)
-}()
+stopCh := make(chan struct{})
+defer close(stopCh)
183-187: Parse the env flag robustly (case-insensitive, 1/true/yes).
-if os.Getenv("TNF_VALIDATE_PEER_ONLY") == "true" {
+if v, _ := strconv.ParseBool(os.Getenv("TNF_VALIDATE_PEER_ONLY")); v {
Remember to import strconv.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 61b7fb5 and 530c521.

📒 Files selected for processing (2)

pkg/tnf/pkg/pcs/fencing.go (2 hunks)
pkg/tnf/setup/runner.go (3 hunks)

🔇 Additional comments (2)

pkg/tnf/setup/runner.go (1)

91-98: Don't add defer ec.Close() — EtcdClient is an interface without Close().

The EtcdClient in pkg/etcdcli/interfaces.go is an interface and does not define Close(); calling ec.Close() in pkg/tnf/setup/runner.go will not compile. Either add Close() to the EtcdClient interface or ensure the concrete client is closed via its public API.

Likely an incorrect or invalid review comment.

pkg/tnf/pkg/pcs/fencing.go (1)

314-317: Incorrect — exec.Execute already invokes a shell (/bin/bash -c)

pkg/tnf/pkg/exec/exec.go builds hostCommand := []string{"/usr/bin/nsenter", "-a", "-t 1", "/bin/bash", "-c"} and calls exec.CommandContext(...), so shell operators like || will be interpreted; the suggested refactor is unnecessary.

Likely an incorrect or invalid review comment.

pkg/tnf/pkg/pcs/fencing.go

pkg/tnf/setup/runner.go

openshift-ci-robot · 2025-09-17T13:15:27Z

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

ValidateDisruptivePeerOnly flow to test fencing a peer node.

Adds fencing validation logic (pcs stonith + etcd quorum check)

Integrates into TNF setup runner behind an environment gate (TNF_VALIDATE_PEER_ONLY)

Improves safety with self-fence guard, explicit error handling, and pcs/etcd health waits

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Signed-off-by: nhamza <[email protected]>

Neilhamza · 2025-09-18T13:16:57Z

/retest

Neilhamza · 2025-09-18T13:17:49Z

/retest-required

Signed-off-by: nhamza <[email protected]>

openshift-ci-robot · 2025-09-18T22:50:24Z

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces a new disruptive-validate job that runs on each control-plane node, ensuring fencing validation is performed sequentially (one node at a time). Adds preflight checks, pacemaker integration, and peer state monitoring to validate proper recovery after disruptive fencing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Signed-off-by: nhamza <[email protected]>

openshift-ci-robot · 2025-09-19T22:54:02Z

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

tnf-auth-job- (per node): prepares node-local prerequisites used by later TNF stages (auth/env checks required for host tooling).

tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node:

pre-fence etcd check (2 voters),

PCS preflight (pcs present, pacemaker active, peer ONLINE),

pcs stonith fence → wait peer OFFLINE → ONLINE,

post-fence etcd check (2 voters).

Lock mechanism (how both nodes get fenced safely)

Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.

Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.

The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).

Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).

Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).

Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-09-19T22:54:48Z

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

tnf-auth-job- (per node): prepares node-local prerequisites used by later TNF stages (auth/env checks required for host tooling).

tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node:

pre-fence etcd check (2 voters),

PCS preflight (pcs present, pacemaker active, peer ONLINE),

pcs stonith fence → wait peer OFFLINE → ONLINE,

post-fence etcd check (2 voters).

Lock mechanism (how both nodes get fenced safely)

Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.

Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.

The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).

Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).

Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).

Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

how the jobs will look like:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-09-19T22:55:30Z

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

tnf-auth-job- (per node): prepares node-local prerequisites used by later TNF stages (auth/env checks required for host tooling).

tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node:

pre-fence etcd check (2 voters),

PCS preflight (pcs present, pacemaker active, peer ONLINE),

pcs stonith fence → wait peer OFFLINE → ONLINE,

post-fence etcd check (2 voters).

Lock mechanism (how both nodes get fenced safely)

Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.

Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.

The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).

Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).

Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).

Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

how the jobs will look like:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-09-19T22:56:38Z

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node (those 2 jobs will run last in order)

Lock mechanism (how both nodes get fenced safely)

Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.

Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.

The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).

Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).

Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).

Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

how the jobs will look like:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Neilhamza · 2025-09-19T23:56:28Z

/retest

openshift-ci-robot · 2025-09-20T09:31:34Z

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node (those 2 jobs will run last in order)

Lock mechanism (how both nodes get fenced safely)

Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.

Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.

The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).

Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).

Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).

Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

CANNOT BE MERGED BLOCKED BY: OCPBUGS-61117

how the jobs will look like:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

clobrano

I left some comments, but the direction seems Ok to me

clobrano · 2025-09-22T07:41:29Z

pkg/tnf/disruptivevalidate/runner.go

+	})
+}
+
+func detectLocalAndPeer(_ context.Context, _ kubernetes.Interface, n1, n2 string) (string, string, error) {


What is the reason to pass context and kubernetes.Interface if not used?

it was for consistency and future-ready for any API usage but i will remove it for now

clobrano · 2025-09-22T07:46:34Z

pkg/tnf/disruptivevalidate/runner.go

+func detectLocalAndPeer(_ context.Context, _ kubernetes.Interface, n1, n2 string) (string, string, error) {
+	podName, err := os.Hostname()
+	if err != nil || strings.TrimSpace(podName) == "" {
+		return "", "", fmt.Errorf("get pod hostname: %w", err)


If the err is nil, but strings.TrimSpace(podName) returns empty string, the resulting error message won't be very helpful

true! updated

pkg/tnf/disruptivevalidate/runner.go

clobrano · 2025-09-22T07:55:15Z

pkg/tnf/disruptivevalidate/runner.go

+	return wait.PollUntilContextTimeout(ctx, poll, timeoutPeerJob, true, func(context.Context) (bool, error) {
+		j, err := kc.BatchV1().Jobs(operatorclient.TargetNamespace).Get(ctx, target, metav1.GetOptions{})
+		if apierrors.IsNotFound(err) {
+			return false, nil
+		}
+		if err != nil {
+			return false, nil
+		}
+		klog.V(2).Infof("peer %s status: succeeded=%d failed=%d conditions=%+v", target, j.Status.Succeeded, j.Status.Failed, j.Status.Conditions)
+
+		// Only treat as failed if the JobFailed condition is set
+		if tools.IsConditionTrue(j.Status.Conditions, batchv1.JobFailed) {
+			return false, fmt.Errorf("peer validate job %s failed", target)
+		}
+		// Proceed when the peer is complete
+		return j.Status.Succeeded > 0 || tools.IsConditionTrue(j.Status.Conditions, batchv1.JobComplete), nil
+	})


This looks very similar to waitForLabeledJob to me (well except the logging). Any chance we can label the validation job too, and reuse the same function?

i already tag the jobs just like we do with other jobs. however the label is not unique for each node, meaning i had to find a way to detect this specific job for this node
for that case i merged the logic into one function.
since labeling each node job would require some logic change in other places which could be fragile also for the other jobs which i would rather not touch
now it should look better

pkg/tnf/disruptivevalidate/runner.go

Signed-off-by: nhamza <[email protected]>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/tnf/pkg/pcs/fencing.go (1)
64-66: Guard against overwriting explicit pcmk_delay_base values.

Right now Line 65 unconditionally sets pcmk_delay_base on the primary device. If the secret (or future config) ever supplies an explicit delay, we’ll stomp it while building the option map. Please fence the assignment so we only apply the default when the option is absent.
-		if i == 0 {
-			fc.FencingDeviceOptions[PcmkDelayBase] = defaultPcmkDelayBase
-		}
+		if i == 0 {
+			if _, exists := fc.FencingDeviceOptions[PcmkDelayBase]; !exists {
+				fc.FencingDeviceOptions[PcmkDelayBase] = defaultPcmkDelayBase
+			}
+		}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 530c521 and fc08add.

📒 Files selected for processing (6)

cmd/tnf-setup-runner/main.go (3 hunks)
pkg/tnf/disruptivevalidate/runner.go (1 hunks)
pkg/tnf/operator/starter.go (5 hunks)
pkg/tnf/pkg/pcs/fencing.go (1 hunks)
pkg/tnf/pkg/tools/jobs.go (2 hunks)
pkg/tnf/setup/runner.go (0 hunks)

💤 Files with no reviewable changes (1)

pkg/tnf/setup/runner.go

🔇 Additional comments (2)

pkg/tnf/pkg/tools/jobs.go (1)

29-43: Disruptive validate JobType wiring looks solid.

The new enum entry on Line 29 plus the subcommand branch on Lines 42-43 keep the naming helpers consistent; the downstream job controllers will pick up disruptive-validate without extra glue.

pkg/tnf/operator/starter.go (1)

74-109: Nice sequencing tie-in for disruptive-validate jobs.

Adding the per-node controller on Lines 106-109 alongside auth and after-setup keeps the new disruptive validation workflow under the existing job orchestration umbrella.

cmd/tnf-setup-runner/main.go

pkg/tnf/disruptivevalidate/runner.go

Signed-off-by: nhamza <[email protected]>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between fc08add and 6a2b9a4.

📒 Files selected for processing (1)

pkg/tnf/disruptivevalidate/runner.go (1 hunks)

pkg/tnf/disruptivevalidate/runner.go

Signed-off-by: nhamza <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

pkg/tnf/disruptivevalidate/runner.go (5)
99-107: Surface stderr from pcs on failure for actionable diagnostics.

pcs often writes errors to stderr; using only stdout may hide the real cause.
-	out, _, ferr := exec.Execute(ctx, fmt.Sprintf(`/usr/sbin/pcs stonith fence %s`, peer))
+	out, errOut, ferr := exec.Execute(ctx, fmt.Sprintf(`/usr/sbin/pcs stonith fence %s`, peer))
 	if ferr != nil {
-		ls := out
+		// Prefer stderr last line when available.
+		ls := errOut
+		if strings.TrimSpace(ls) == "" {
+			ls = out
+		}
 		if i := strings.LastIndex(ls, "\n"); i >= 0 && i+1 < len(ls) {
 			ls = ls[i+1:]
 		}
 		return fmt.Errorf("pcs fence %s failed: %w (last line: %s)", peer, ferr, strings.TrimSpace(ls))
 	}
191-200: Consider removing now-redundant allowNeverSeenTTL plumbing.

If you adopt the fix above, allowNeverSeenTTL becomes unnecessary. You can simplify by inlining waitForJobNamePeerTTL to call waitForJobName and drop the extra parameter from waitForJobs.
-func waitForJobNamePeerTTL(ctx context.Context, kc kubernetes.Interface, name string, to time.Duration) error {
-	return waitForJobs(ctx, kc, name, "", 1, to, true) // allowNeverSeenTTL = true
-}
+func waitForJobNamePeerTTL(ctx context.Context, kc kubernetes.Interface, name string, to time.Duration) error {
+	// With strict semantics (never-seen ≠ completed), reuse the regular waiter.
+	return waitForJobName(ctx, kc, name, to)
+}
And (optional) remove the allowNeverSeenTTL argument from waitForJobs.

249-263: Tighten JSON unmarshal handling for etcdctl output.

Capture and log the unmarshal error instead of discarding it; helps when etcdctl prints transient HTML/errors.
-		if json.Unmarshal([]byte(out), &ml) != nil {
-			return false, nil
-		}
+		if err := json.Unmarshal([]byte(out), &ml); err != nil {
+			klog.V(3).Infof("invalid etcdctl member list output: %v", err)
+			return false, nil
+		}
282-300: pcs status parsing: handle “Online: N nodes [ ... ]” variant robustly.

Current parsing works in most cases, but pcs can emit "Online: 2 nodes [ a b ]". Your trimming covers brackets, but tokenization may include count/“nodes”. Consider explicitly scanning within the brackets when present to avoid false negatives.

28-36: Nit: clarify timeout names.

timeoutAfter reuses SetupJobCompletedTimeout. Consider renaming to timeoutAfterSetup for readability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6a2b9a4 and 0003e21.

📒 Files selected for processing (1)

pkg/tnf/disruptivevalidate/runner.go (1 hunks)

🔇 Additional comments (2)

pkg/tnf/disruptivevalidate/runner.go (2)
211-221: Hostname-to-JobName derivation can be brittle across controller versions.

Assuming Pod name = "-" works for current Jobs, but formats can vary. If feasible, inject the job name via Downward API env var, or read ownerRefs of the local Pod (if RBAC allows) to avoid string heuristics.

136-147: Remove unsafe TTL fallback for never-seen Jobs
allowNeverSeenTTL can prematurely mark a non-existent Job as complete, bypassing sequencing safety. Only assume TTL-after-finish if the Job was observed at least once.
Apply:
pkg/tnf/disruptivevalidate/runner.go
@@ -142,7 +142,6 @@
                 if seen {
                     klog.V(2).Infof("job %s disappeared after observation; assuming TTL after completion", byName)
                     return true, nil
-                if allowNeverSeenTTL && time.Since(start) > appearanceGrace {
-                    klog.V(2).Infof("job %s not found for %s; assuming completed earlier and TTL-deleted", byName, appearanceGrace)
-                    return true, nil
-                }
                 return false, nil
Optional: to support a peer-already-TTL’d scenario, either disable TTLSecondsAfterFinished on disruptive-validate Jobs or switch to a durable completion signal.

openshift-ci-robot · 2025-09-25T11:58:50Z

@Neilhamza: This pull request references OCPEDGE-2176 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

New jobs

tnf-disruptive-validate-job- (per node): runs the disruptive peer validation from that node (those 2 jobs will run last in order)

Lock mechanism (how both nodes get fenced safely)

Each validate pod derives its local node from its Job name: tnf-disruptive-validate-job-.

Compute lexicographic order of the two nodes. The first (min) proceeds; the second (max) blocks.

The blocker polls the peer’s Disruptive-Validate Job and only proceeds after it Succeeds (or fails fast if JobFailed).

Result: node A fences node B, then (after A completes) node B fences node A — never in parallel.

Orchestration (who waits on whom)

Operator schedules: Setup (1), Fencing (1), After-Setup (2: one per node), Auth (2), Disruptive-Validate (2).

Each Disruptive-Validate run gates on: Setup ≥1, Fencing ≥1, After-Setup ≥2 (label-based waits).

Then the sequencing lock (above) ensures only one node fences at a time; the second runs after the first completes.

CANNOT BE MERGED BLOCKED BY: OCPBUGS-42808 which is related to : https://issues.redhat.com/browse/ETCD-673

how the jobs will look like:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

clobrano

/lgtm

openshift-ci · 2025-09-25T12:03:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, Neilhamza

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/tnf-setup-runner/OWNERS~~ [clobrano]
~~pkg/tnf/OWNERS~~ [clobrano]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-merge-robot · 2025-10-03T21:42:57Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-10-03T21:42:58Z

@Neilhamza: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-vsphere-ovn-etcd-scaling	`0003e21`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-metal-ovn-two-node-fencing	`0003e21`	link	false	`/test e2e-metal-ovn-two-node-fencing`
ci/prow/e2e-azure-ovn-etcd-scaling	`0003e21`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown	`0003e21`	link	false	`/test e2e-metal-ovn-ha-cert-rotation-shutdown`
ci/prow/e2e-aws-ovn-serial	`0003e21`	link	true	`/test e2e-aws-ovn-serial`
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown	`0003e21`	link	false	`/test e2e-metal-ovn-sno-cert-rotation-shutdown`
ci/prow/e2e-aws	`0003e21`	link	false	`/test e2e-aws`
ci/prow/e2e-metal-ipi-ovn-ipv6	`0003e21`	link	true	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-gcp-disruptive	`0003e21`	link	false	`/test e2e-gcp-disruptive`
ci/prow/e2e-metal-assisted	`0003e21`	link	true	`/test e2e-metal-assisted`
ci/prow/e2e-aws-etcd-recovery	`0003e21`	link	false	`/test e2e-aws-etcd-recovery`
ci/prow/e2e-gcp-ovn-etcd-scaling	`0003e21`	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-aws-cpms	`0003e21`	link	true	`/test e2e-aws-cpms`
ci/prow/e2e-aws-etcd-certrotation	`0003e21`	link	false	`/test e2e-aws-etcd-certrotation`
ci/prow/e2e-aws-disruptive-ovn	`0003e21`	link	false	`/test e2e-aws-disruptive-ovn`
ci/prow/e2e-aws-disruptive	`0003e21`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-gcp-disruptive-ovn	`0003e21`	link	false	`/test e2e-gcp-disruptive-ovn`
ci/prow/e2e-aws-ovn-serial-1of2	`0003e21`	link	true	`/test e2e-aws-ovn-serial-1of2`
ci/prow/e2e-aws-ovn-serial-2of2	`0003e21`	link	true	`/test e2e-aws-ovn-serial-2of2`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

add fencing disruptive fencing validation

530c521

Signed-off-by: nhamza <[email protected]>

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2025

openshift-ci bot requested review from clobrano and slintes September 17, 2025 13:09

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

pkg/tnf/pkg/pcs/fencing.go Outdated Show resolved Hide resolved

pkg/tnf/setup/runner.go Show resolved Hide resolved

Neilhamza changed the title ~~[WIP] disruptive fencing validation TNF~~ [WIP] OCPEDGE-2176: disruptive fencing validation TNF Sep 17, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 17, 2025

Neilhamza added 2 commits September 18, 2025 13:23

new fencing validation logic - two new jobs

cd1364c

Signed-off-by: nhamza <[email protected]>

add lock mechanism

ea8fec3

Signed-off-by: nhamza <[email protected]>

Neilhamza added 2 commits September 18, 2025 21:25

update node online parsing

0e179b9

Signed-off-by: nhamza <[email protected]>

remove bad fencing command

3d07e51

Signed-off-by: nhamza <[email protected]>

Neilhamza added 6 commits September 19, 2025 02:17

add etcd check

2e97359

Signed-off-by: nhamza <[email protected]>

update etcd started check

bfc5151

Signed-off-by: nhamza <[email protected]>

update lock mechanism

e7de4c3

Signed-off-by: nhamza <[email protected]>

adjust knowing who is the local node

140682f

Signed-off-by: nhamza <[email protected]>

dont fail on retries

9525b6f

Signed-off-by: nhamza <[email protected]>

update node online logic

efb27ef

Signed-off-by: nhamza <[email protected]>

Neilhamza changed the title ~~[WIP] OCPEDGE-2176: disruptive fencing validation TNF~~ [WIP] OCPEDGE-2176: introducing two new disruptive fencing validation jobs for TNF Sep 19, 2025

Neilhamza changed the title ~~[WIP] OCPEDGE-2176: introducing two new disruptive fencing validation jobs for TNF~~ OCPEDGE-2176: two new disruptive fencing validation jobs for TNF Sep 19, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 19, 2025

clobrano reviewed Sep 22, 2025

View reviewed changes

clobrano suggested changes Sep 25, 2025

View reviewed changes

pkg/tnf/disruptivevalidate/runner.go Show resolved Hide resolved

update per requested changes

fc08add

Signed-off-by: nhamza <[email protected]>

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

cmd/tnf-setup-runner/main.go Show resolved Hide resolved

pkg/tnf/disruptivevalidate/runner.go Show resolved Hide resolved

fix TTL-deleted infinite loop

6a2b9a4

Signed-off-by: nhamza <[email protected]>

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

pkg/tnf/disruptivevalidate/runner.go Outdated Show resolved Hide resolved

fix not marking unseen job as completed

0003e21

Signed-off-by: nhamza <[email protected]>

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

clobrano approved these changes Sep 25, 2025

View reviewed changes

openshift-ci bot assigned clobrano Sep 25, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2025

Neilhamza changed the title ~~OCPEDGE-2176: two new disruptive fencing validation jobs for TNF~~ [WIP] OCPEDGE-2176: two new disruptive fencing validation jobs for TNF Sep 25, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2025

[WIP] OCPEDGE-2176: two new disruptive fencing validation jobs for TNF #1486

Are you sure you want to change the base?

[WIP] OCPEDGE-2176: two new disruptive fencing validation jobs for TNF #1486

Conversation

Neilhamza commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 17, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Neilhamza commented Sep 18, 2025

Uh oh!

Neilhamza commented Sep 18, 2025

Uh oh!

openshift-ci-robot commented Sep 18, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 19, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 19, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 19, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 19, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Neilhamza commented Sep 19, 2025

Uh oh!

openshift-ci-robot commented Sep 20, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clobrano left a comment

Choose a reason for hiding this comment

Uh oh!

clobrano Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Neilhamza Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

clobrano Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Neilhamza Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clobrano Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Neilhamza Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Neilhamza commented Sep 17, 2025 •

edited

Loading

coderabbitai bot commented Sep 17, 2025 •

edited

Loading

openshift-ci-robot commented Sep 17, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 18, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 19, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 19, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 19, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 19, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 20, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 25, 2025 •

edited by openshift-ci bot

Loading