Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Shutdown Node Due to Shard Activity #8049

Open
pengyongqiang666 opened this issue Sep 12, 2024 · 0 comments
Open

Unable to Shutdown Node Due to Shard Activity #8049

pengyongqiang666 opened this issue Sep 12, 2024 · 0 comments
Labels

Comments

@pengyongqiang666
Copy link

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elk-test
  namespace: default
status:
  availableNodes: 2
  conditions:
    - lastTransitionTime: '2024-09-12T11:30:24Z'
      message: Downscale in progress
      status: "False"
      type: ReconciliationComplete
    - lastTransitionTime: '2024-08-30T09:40:16Z'
      message: All nodes are running version 7.9.3
      status: 'True'
      type: RunningDesiredVersion
    - lastTransitionTime: '2024-08-30T09:40:54Z'
      message: Service default/elk-test-es-internal-http has endpoints
      status: 'True'
      type: ElasticsearchIsReachable
  health: green
  inProgressOperations:
    downscale:
      lastUpdatedTime: '2024-09-12T11:30:23Z'
      nodes:
        - name: cold-node-2
          shutdownStatus: IN_PROGRESS
    upgrade:
      lastUpdatedTime: '2024-09-06T01:03:30Z'
    upscale:
      lastUpdatedTime: '2024-08-30T09:40:16Z'
  observedGeneration: 8
  phase: ApplyingChanges
  version: 7.9.3
spec:
  nodeSets:
    - count: 2
      name: hot-node
      ...
    - count: 2
      name: cold-node
      ...
  version: 7.9.3

In our Elasticsearch cluster, the shutdown status of the cold-node-2 node remains in the IN_PROGRESS state, preventing it from being taken offline.

The root cause of this issue lies in the following function:

func (h Health) HasShardActivity() bool {
	return h.TimedOut || // make sure request did not time out (i.e. no pending events)
		h.NumberOfInFlightFetch > 0 || // no shards being fetched
		h.InitializingShards > 0 || // no shards initializing
		h.RelocatingShards > 0 // no shards relocating
}

The variable h.RelocatingShards represents the total number of relocating shards across the entire cluster. Therefore, even when the shards on cold-node-2 have completed migration, if other nodes in the cluster are still relocating shards, the HasShardActivity function will continuously return true. As a result, the node shutdown is delayed due to ongoing shard activity elsewhere in the cluster.

In certain scenarios, such as when hot nodes are always migrating data to cold nodes, the node intended for shutdown may never be taken offline, even after its own shards have finished relocating.

Question: Why can't the node be taken offline as soon as its own shard migration is complete?

I would like the node to be shut down once its shard data migration is finished, but this is not currently happening. Is there a specific reason why Elasticsearch doesn't allow for shutting down a node immediately after its own shard migration completes?

@botelastic botelastic bot added the triage label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant