You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
driver reports connection errors for decommissioned nodes and a delay for cassandra-stress, resulted with: "Command did not complete within 12870 seconds!"
#401
Open
2 tasks
yarongilor opened this issue
Dec 19, 2024
· 0 comments
Then c-s got errors for node-2 (that was decommissioned ~ 2 hours ago) like:
WARN [cluster1-nio-worker-2] 2024-12-18 03:56:11,975 DefaultPromise.java:593 - An exception was thrown by com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete()
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
java.lang.AssertionError: null
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionFailure(ConvictionPolicy.java:101)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.driver.core.Connection.defunct(Connection.java:812)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1667)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1657)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:625)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:105)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
at com.datastax.shaded.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
Multiple connection errors (that might be expected) are reported for some other nodes as well, during these rolling restarts:
WARN [cluster1-nio-worker-5] 2024-12-18 03:55:13,594 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
WARN [cluster1-nio-worker-6] 2024-12-18 03:55:15,606 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
total, 769417033, 78364, 78364, 78364, 12.8, 7.8, 40.2, 62.0, 90.7, 148.6, 9965.0, 0.00209, 0, 0, 0, 0, 0, 0
WARN [cluster1-nio-worker-0] 2024-12-18 03:55:19,611 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
...
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.9:9042] Connection has been closed
...
WARN [cluster1-nio-worker-0] 2024-12-18 03:55:30,963 Connection.java:284 - Error creating netty channel to /10.0.0.9:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.9:9042
Caused by: java.net.ConnectException: Connection refused
Another unclear connection issues are found for node-10 that was decommissioned at 03:22 -
longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14]
live time:
2024-12-18 03:12:00 2024-12-18 03:20:08
WARN [cluster1-nio-worker-2] 2024-12-18 04:16:37,254 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
...
WARN [cluster1-nio-worker-3] 2024-12-18 04:26:40,326 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
...
WARN [cluster1-nio-worker-5] 2024-12-18 04:36:43,398 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
node-10 private address is also reported long afterwards:
The cassandra-stress eventually failed the test for an error of:
2024-12-18 04:43:29.357: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=36f22416-2cde-4c0c-bccd-2deeac0e027f duration=3h34m30s: node=Node longevity-10gb-3h-master-loader-node-413a3a9b-eastus-1 [None | 10.0.0.11]
stress_cmd=cassandra-stress write cl=QUORUM duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5
errors:
Stress command execution failed with: Command did not complete within 12870 seconds!
Command: 'sudo docker exec 6b4a56b5947dea0840cfcb21e1e974a93072cc15cb75c103041fb8075e6f6db5 /bin/sh -c \'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=HITDHJBU91WD2JHM5LG4; cassandra-stress write no-warmup cl=QUORUM duration=180m -schema keyspace=keyspace1 \'"\'"\'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)\'"\'"\' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5 -node 10.0.0.5,10.0.0.6,10.0.0.7,10.0.0.8,10.0.0.9,10.0.0.10 -errors skip-unsupported-columns\''
Stdout:
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
INFO [cluster1-reconnection-1] 2024-12-18 04:37:08,585 HostConnectionPool.java:200 - Using advanced port-based shard awareness with /10.0.0.8:9042
Stderr:
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.3.0-dev-x86_64-2024-12-18T02-02-40 (azure: undefined_region)
Test: longevity-10gb-3h-azure-test
Test id: 413a3a9b-fe7b-4e5e-b864-6f1f26628226
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Packages
Scylla version:
6.3.0~dev-20241217.01cdba9a9894
with build-idf5cdbc08a2634f6f378e901fbb10a27fc164783e
Kernel Version:
6.8.0-1018-azure
Issue description
Describe your issue in detail and steps it took to produce it.
Run a 3 hours longevity on Azure.
The node
longevity-10gb-3h-master-db-node-413a3a9b-eastus-2
was decommissioned at 2:About 2 hours later, 2 nemesis that issues a rolling restart of all cluster nodes are executed:
Then c-s got errors for node-2 (that was decommissioned ~ 2 hours ago) like:
Multiple connection errors (that might be expected) are reported for some other nodes as well, during these rolling restarts:
Another unclear connection issues are found for node-10 that was decommissioned at 03:22 -
node-10 removal:
connection errors:
node-10 private address is also reported long afterwards:
The cassandra-stress eventually failed the test for an error of:
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
OS / Image:
/subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.3.0-dev-x86_64-2024-12-18T02-02-40
(azure: undefined_region)Test:
longevity-10gb-3h-azure-test
Test id:
413a3a9b-fe7b-4e5e-b864-6f1f26628226
Test name:
scylla-master/longevity/longevity-10gb-3h-azure-test
Test method:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 413a3a9b-fe7b-4e5e-b864-6f1f26628226
$ hydra investigate show-logs 413a3a9b-fe7b-4e5e-b864-6f1f26628226
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: