You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have been trying Medusa on a Cassandra cluster with 10 nodes and around 1 TB of data.
To perform the first backup of the cluster, we are running the medusa backup-cluster --backup-name <name> --mode differential.
These backups are going to a S3 bucket, which do not have any backups, so I assume that although is differential, basically does a full backup?!
Nonetheless, the curious thing we found was that for the last nodes, the backup was incomplete.
We ran the command on node 01 at 17:28, which ran until 19:15, checking the logs from the process we noticed an error
[2023-11-27 19:15:38,899] ERROR: This error happened during the cluster backup: (-1, b'Socket error: disconnected')
Traceback (most recent call last):
File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 64, in orchestrate
backup.execute(cql_session_provider)
File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 150, in execute
self._upload_backup()
File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 172, in _upload_backup
hosts_variables={})
File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/orchestration.py", line 83, in pssh_run
display_output(error)
File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/orchestration.py", line 26, in display_output
for line in host_out.stdout:
File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/base/single.py", line 246, in read_output_buffer
for line in output_buffer:
File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/ssh/single.py", line 301, in _read_output
_reader.get(timeout=timeout)
File "src/gevent/greenlet.py", line 805, in gevent._gevent_cgreenlet.Greenlet.get
File "src/gevent/greenlet.py", line 373, in gevent._gevent_cgreenlet.Greenlet._raise_exception
File "/opt/cassandra-medusa/lib64/python3.6/site-packages/gevent/_compat.py", line 66, in reraise
raise value.with_traceback(tb)
File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/ssh/single.py", line 324, in _read_output_to_buffer
size, data = channel.read_nonblocking(is_stderr=is_stderr)
File "ssh/channel.pyx", line 189, in ssh.channel.Channel.read_nonblocking
File "ssh/utils.pyx", line 94, in ssh.utils.handle_error_codes
ssh.exceptions.SSHError: (-1, b'Socket error: disconnected')
This error causes it to crash and trigger the clean-up of the snapshots.
[2023-11-27 19:15:38,898] ERROR: Job executing "mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa -vvv backup-node --backup-name test-all-cluster-drk --mode differential" ran and finished with errors on following nodes...
[2023-11-27 19:15:38,898] DEBUG: Waiting for stdout reader
[2023-11-27 19:15:38,899] ERROR: This error happened during the cluster backup: (-1, b'Socket error: disconnected')
...
[2023-11-27 19:15:38,907] ERROR: Something went wrong! Attempting to clean snapshots and exit.
...
[2023-11-27 19:15:54,756] INFO: Job executing "nodetool -Dcom.sun.jndi.rmiURLParsing=legacy -u <user> -pw <password> clearsnapshot -t medusa-test-all-cluster-drk" ran and finished Successfully on all nodes.
[2023-11-27 19:15:54,757] INFO: All nodes successfully cleared their snapshot.
Meanwhile, on one of the nodes that was incomplete, it shows that the upload was still ongoing, and it stopped due to the fact that the snapshot was no longer there.
[2023-11-27 19:15:11,897] DEBUG: [https://s3-eu-west-1.amazonaws.com:443](https://s3-eu-west-1.amazonaws.com/) "PUT /sbk-medusa-prd/data/fbp_markets_journal_store/messages-fd3f4700de1a11eabb225d971d2361
fe/mc-455-big-TOC.txt HTTP/1.1" 200 0
...
[2023-11-27 19:20:21,347] ERROR: Error occurred during backup: awscli cp failed. Max attempts exceeded. Check /tmp/awscli_50571839-2d02-4c4d-8c52-afdad2e56732.output for more informations.
...
The user-provided path /var/lib/cassandra/data/fbp_markets_journal_store/messages-fd3f4700de1a11eabb225d971d2361fe/snapshots/medusa-test-all-cluster-drk/mc-6305-big-Data.db does not exist.
I assume this happens because it takes much time to do the backup of all nodes and all data, and the SSH connection might time out.
But what is this SSH connection?
Is the first node that opens this connection to all nodes to execute this command mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa -vvv backup-node --backup-name test-all-cluster-drk --mode differential?
Is the connection open at the beginning and kept until all nodes have finished the backup?
As an update to this, it is clearly related with the SSH keep-alive.
Happens when performing the first backup of high volume data or a restore.
On both actions, an SSH connection is established that requires to be open until the end. This has happened for a cluster of 10 nodes and more than 1Tb.
Is there a recommended approach for maintaining the Medusa SSH connection without resorting to constantly adjusting the SSH keep-alive interval settings on the node from which the command is executed?
lopesjp
changed the title
SSH Error when running first backup-cluster
SSH Timeout Error on first backup-cluster or restore
Jan 3, 2024
Project board link
Hi 👋
We have been trying Medusa on a Cassandra cluster with 10 nodes and around 1 TB of data.
To perform the first backup of the cluster, we are running the
medusa backup-cluster --backup-name <name> --mode differential
.These backups are going to a S3 bucket, which do not have any backups, so I assume that although is differential, basically does a full backup?!
Nonetheless, the curious thing we found was that for the last nodes, the backup was incomplete.
We ran the command on node 01 at 17:28, which ran until 19:15, checking the logs from the process we noticed an error
This error causes it to crash and trigger the clean-up of the snapshots.
Meanwhile, on one of the nodes that was incomplete, it shows that the upload was still ongoing, and it stopped due to the fact that the snapshot was no longer there.
I assume this happens because it takes much time to do the backup of all nodes and all data, and the SSH connection might time out.
But what is this SSH connection?
Is the first node that opens this connection to all nodes to execute this command
mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa -vvv backup-node --backup-name test-all-cluster-drk --mode differential
?Is the connection open at the beginning and kept until all nodes have finished the backup?
┆Issue is synchronized with this Jira Story by Unito
The text was updated successfully, but these errors were encountered: