Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
123517: roachtest: move node-kill operation to pkill/pgrep-based kill approach r=renatolabs a=itsbilal

For some reason, `StopServiceForVirtualCluster` fails with this error on drt clusters:

```
20:23:41 node_kill.go:51: operation status: killing node 1  with signal 15
20:23:41 cluster.go:2148: stoping virtual cluster
20:23:41 operation_impl.go:128: operation failure #1: no service for virtual cluster ""
```

The debug message has a bug, the virtual cluster is set to "system" but it seems like the service discovery process isn't able to determine the cockroach process based on dns settings in the drt project. This change makes the node-kill operation more dns-agnostic by looking for the cockroach process.

Epic: none

Release note: None

Co-authored-by: Bilal Akhtar <[email protected]>
  • Loading branch information
craig[bot] and itsbilal committed May 2, 2024
2 parents d20a00e + 9db386a commit 1f6e966
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 7 deletions.
1 change: 0 additions & 1 deletion pkg/cmd/roachtest/operations/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ go_library(
"//pkg/cmd/roachtest/roachtestflags",
"//pkg/cmd/roachtest/roachtestutil",
"//pkg/roachprod",
"//pkg/roachprod/install",
"//pkg/util/randutil",
],
)
29 changes: 23 additions & 6 deletions pkg/cmd/roachtest/operations/node_kill.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ package operations
import (
"context"
"fmt"
"strings"
"time"

"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/operation"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/option"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/roachprod"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/cockroach/pkg/util/randutil"
)

Expand Down Expand Up @@ -79,11 +79,28 @@ func runNodeKill(
}
o.Status(fmt.Sprintf("killing node %s with signal %d", node.NodeIDsString(), signal))

stopOpts := option.StopVirtualClusterOpts(install.SystemInterfaceName, node)
stopOpts.RoachprodOpts.Sig = signal
stopOpts.RoachprodOpts.Wait = true
stopOpts.RoachprodOpts.MaxWait = 300 // 5 minutes
c.StopServiceForVirtualCluster(ctx, o.L(), stopOpts)
err := c.RunE(ctx, option.WithNodes(node), "pkill", fmt.Sprintf("-%d", signal), "-f", "cockroach\\ start")
if err != nil {
o.Fatal(err)
}
o.Status(fmt.Sprintf("sent signal %d to node %s, waiting for process to exit", signal, node.NodeIDsString()))

for {
if err := ctx.Err(); err != nil {
o.Fatal(err)
}
err := c.RunE(ctx, option.WithNodes(node), "pgrep", "-f", "cockroach\\ start")
if err != nil {
if strings.Contains(err.Error(), "status 1") {
// pgrep returns error code 1 if no processes are found.
break
}
o.Fatal(err)
}

time.Sleep(1 * time.Second)
}

o.Status(fmt.Sprintf("killed node %s with signal %d", node.NodeIDsString(), signal))

return &cleanupNodeKill{
Expand Down

0 comments on commit 1f6e966

Please sign in to comment.