Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

driver reports connection errors for decommissioned nodes and a delay for cassandra-stress, resulted with: "Command did not complete within 12870 seconds!" #401

Open
2 tasks
yarongilor opened this issue Dec 19, 2024 · 0 comments

Comments

@yarongilor
Copy link

Packages

Scylla version: 6.3.0~dev-20241217.01cdba9a9894 with build-id f5cdbc08a2634f6f378e901fbb10a27fc164783e

Kernel Version: 6.8.0-1018-azure

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Run a 3 hours longevity on Azure.
The node longevity-10gb-3h-master-db-node-413a3a9b-eastus-2 was decommissioned at 2:

disrupt_nodetool_seed_decommission	longevity-10gb-3h-master-db-node-413a3a9b-eastus-2	Succeeded	2024-12-18 01:54:18	2024-12-18 02:07:15
Nemesis Information
Class: Sisyphus
Name: disrupt_nodetool_seed_decommission
Status: Succeeded
Target Information
Name: longevity-10gb-3h-master-db-node-413a3a9b-eastus-2
Public IP: null
Private IP: 10.0.0.6
State: terminated
Shards: 7

About 2 hours later, 2 nemesis that issues a rolling restart of all cluster nodes are executed:

disrupt_rolling_restart_cluster	longevity-10gb-3h-master-db-node-413a3a9b-eastus-11	Succeeded	2024-12-18 03:55:11	2024-12-18 03:57:50
disrupt_rolling_config_change_internode_compression	longevity-10gb-3h-master-db-node-413a3a9b-eastus-4	Succeeded	2024-12-18 04:00:20	2024-12-18 04:08:15

Then c-s got errors for node-2 (that was decommissioned ~ 2 hours ago) like:

WARN  [cluster1-nio-worker-2] 2024-12-18 03:56:11,975 DefaultPromise.java:593 - An exception was thrown by com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete()
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
java.lang.AssertionError: null
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionFailure(ConvictionPolicy.java:101)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.Connection.defunct(Connection.java:812)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1667)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1657)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:625)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:105)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed

Multiple connection errors (that might be expected) are reported for some other nodes as well, during these rolling restarts:

WARN  [cluster1-nio-worker-5] 2024-12-18 03:55:13,594 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
WARN  [cluster1-nio-worker-6] 2024-12-18 03:55:15,606 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
total,     769417033,   78364,   78364,   78364,    12.8,     7.8,    40.2,    62.0,    90.7,   148.6, 9965.0,  0.00209,      0,      0,       0,       0,       0,       0
WARN  [cluster1-nio-worker-0] 2024-12-18 03:55:19,611 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
...
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.9:9042] Connection has been closed
...
WARN  [cluster1-nio-worker-0] 2024-12-18 03:55:30,963 Connection.java:284 - Error creating netty channel to /10.0.0.9:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.9:9042
Caused by: java.net.ConnectException: Connection refused

Another unclear connection issues are found for node-10 that was decommissioned at 03:22 -

longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14]
live time:
	2024-12-18 03:12:00	2024-12-18 03:20:08

node-10 removal:

< t:2024-12-18 03:19:34,951 f:cluster.py      l:1169 c:sdcm.cluster_azure   p:INFO  > Node longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14]: Set termination_event
< t:2024-12-18 03:20:08,988 f:cluster.py      l:1231 c:sdcm.cluster         p:INFO  > Node longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14] destroyed

connection errors:

WARN  [cluster1-nio-worker-2] 2024-12-18 04:16:37,254 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
...

WARN  [cluster1-nio-worker-3] 2024-12-18 04:26:40,326 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
...

WARN  [cluster1-nio-worker-5] 2024-12-18 04:36:43,398 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)

node-10 private address is also reported long afterwards:

< t:2024-12-18 03:55:54,995 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-18T03:55:54.979+00:00 longevity-10gb-3h-master-db-node-413a3a9b-eastus-7     !INFO | scylla[5195]:  [shard 0:main] messaging_service - Stopping client for address: 10.0.0.14:0
< t:2024-12-18 03:55:55,016 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-18T03:55:54.980+00:00 longevity-10gb-3h-master-db-node-413a3a9b-eastus-7     !INFO | scylla[5195]:  [shard 0:main] messaging_service - Stopping client for address: 10.0.0.14:0 - Done

The cassandra-stress eventually failed the test for an error of:

2024-12-18 04:43:29.357: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=36f22416-2cde-4c0c-bccd-2deeac0e027f duration=3h34m30s: node=Node longevity-10gb-3h-master-loader-node-413a3a9b-eastus-1 [None | 10.0.0.11]
stress_cmd=cassandra-stress write cl=QUORUM duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5
errors:
Stress command execution failed with: Command did not complete within 12870 seconds!
Command: 'sudo  docker exec 6b4a56b5947dea0840cfcb21e1e974a93072cc15cb75c103041fb8075e6f6db5 /bin/sh -c \'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=HITDHJBU91WD2JHM5LG4; cassandra-stress write no-warmup cl=QUORUM duration=180m -schema keyspace=keyspace1 \'"\'"\'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)\'"\'"\' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5 -node 10.0.0.5,10.0.0.6,10.0.0.7,10.0.0.8,10.0.0.9,10.0.0.10 -errors skip-unsupported-columns\''
Stdout:
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
INFO  [cluster1-reconnection-1] 2024-12-18 04:37:08,585 HostConnectionPool.java:200 - Using advanced port-based shard awareness with /10.0.0.8:9042
Stderr:
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-9 (null | 10.0.0.5) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-8 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-7 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-6 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-5 (null | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-4 (null | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-3 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-2 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-11 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 (null | 10.0.0.14) (shards: -1)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-1 (null | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.3.0-dev-x86_64-2024-12-18T02-02-40 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test
Test id: 413a3a9b-fe7b-4e5e-b864-6f1f26628226
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 413a3a9b-fe7b-4e5e-b864-6f1f26628226
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 413a3a9b-fe7b-4e5e-b864-6f1f26628226

Logs:

Jenkins job URL
Argus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant