Unable to Shutdown Node Due to Shard Activity #8049

pengyongqiang666 · 2024-09-12T12:20:41Z

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elk-test
  namespace: default
status:
  availableNodes: 2
  conditions:
    - lastTransitionTime: '2024-09-12T11:30:24Z'
      message: Downscale in progress
      status: "False"
      type: ReconciliationComplete
    - lastTransitionTime: '2024-08-30T09:40:16Z'
      message: All nodes are running version 7.9.3
      status: 'True'
      type: RunningDesiredVersion
    - lastTransitionTime: '2024-08-30T09:40:54Z'
      message: Service default/elk-test-es-internal-http has endpoints
      status: 'True'
      type: ElasticsearchIsReachable
  health: green
  inProgressOperations:
    downscale:
      lastUpdatedTime: '2024-09-12T11:30:23Z'
      nodes:
        - name: cold-node-2
          shutdownStatus: IN_PROGRESS
    upgrade:
      lastUpdatedTime: '2024-09-06T01:03:30Z'
    upscale:
      lastUpdatedTime: '2024-08-30T09:40:16Z'
  observedGeneration: 8
  phase: ApplyingChanges
  version: 7.9.3
spec:
  nodeSets:
    - count: 2
      name: hot-node
      ...
    - count: 2
      name: cold-node
      ...
  version: 7.9.3

In our Elasticsearch cluster, the shutdown status of the cold-node-2 node remains in the IN_PROGRESS state, preventing it from being taken offline.

The root cause of this issue lies in the following function:

func (h Health) HasShardActivity() bool {
	return h.TimedOut || // make sure request did not time out (i.e. no pending events)
		h.NumberOfInFlightFetch > 0 || // no shards being fetched
		h.InitializingShards > 0 || // no shards initializing
		h.RelocatingShards > 0 // no shards relocating
}

The variable h.RelocatingShards represents the total number of relocating shards across the entire cluster. Therefore, even when the shards on cold-node-2 have completed migration, if other nodes in the cluster are still relocating shards, the HasShardActivity function will continuously return true. As a result, the node shutdown is delayed due to ongoing shard activity elsewhere in the cluster.

In certain scenarios, such as when hot nodes are always migrating data to cold nodes, the node intended for shutdown may never be taken offline, even after its own shards have finished relocating.

Question: Why can't the node be taken offline as soon as its own shard migration is complete?

I would like the node to be shut down once its shard data migration is finished, but this is not currently happening. Is there a specific reason why Elasticsearch doesn't allow for shutting down a node immediately after its own shard migration completes?

The text was updated successfully, but these errors were encountered:

botelastic bot added the triage label Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Shutdown Node Due to Shard Activity #8049

Unable to Shutdown Node Due to Shard Activity #8049

pengyongqiang666 commented Sep 12, 2024

Unable to Shutdown Node Due to Shard Activity #8049

Unable to Shutdown Node Due to Shard Activity #8049

Comments

pengyongqiang666 commented Sep 12, 2024

Question: Why can't the node be taken offline as soon as its own shard migration is complete?