SSH Timeout Error on first backup-cluster or restore #689

lopesjp · 2023-11-29T18:47:08Z

Hi 👋

We have been trying Medusa on a Cassandra cluster with 10 nodes and around 1 TB of data.

To perform the first backup of the cluster, we are running the medusa backup-cluster --backup-name <name> --mode differential.

These backups are going to a S3 bucket, which do not have any backups, so I assume that although is differential, basically does a full backup?!

Nonetheless, the curious thing we found was that for the last nodes, the backup was incomplete.
We ran the command on node 01 at 17:28, which ran until 19:15, checking the logs from the process we noticed an error

[2023-11-27 19:15:38,899] ERROR: This error happened during the cluster backup: (-1, b'Socket error: disconnected')
Traceback (most recent call last):
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 64, in orchestrate
    backup.execute(cql_session_provider)
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 150, in execute
    self._upload_backup()
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 172, in _upload_backup
    hosts_variables={})
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/orchestration.py", line 83, in pssh_run
    display_output(error)
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/orchestration.py", line 26, in display_output
    for line in host_out.stdout:
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/base/single.py", line 246, in read_output_buffer
    for line in output_buffer:
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/ssh/single.py", line 301, in _read_output
    _reader.get(timeout=timeout)
  File "src/gevent/greenlet.py", line 805, in gevent._gevent_cgreenlet.Greenlet.get
  File "src/gevent/greenlet.py", line 373, in gevent._gevent_cgreenlet.Greenlet._raise_exception
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/gevent/_compat.py", line 66, in reraise
    raise value.with_traceback(tb)
  File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/ssh/single.py", line 324, in _read_output_to_buffer
    size, data = channel.read_nonblocking(is_stderr=is_stderr)
  File "ssh/channel.pyx", line 189, in ssh.channel.Channel.read_nonblocking
  File "ssh/utils.pyx", line 94, in ssh.utils.handle_error_codes
ssh.exceptions.SSHError: (-1, b'Socket error: disconnected')

This error causes it to crash and trigger the clean-up of the snapshots.

[2023-11-27 19:15:38,898] ERROR: Job executing "mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa  -vvv backup-node --backup-name test-all-cluster-drk   --mode differential" ran and finished with errors on following nodes...
[2023-11-27 19:15:38,898] DEBUG: Waiting for stdout reader
[2023-11-27 19:15:38,899] ERROR: This error happened during the cluster backup: (-1, b'Socket error: disconnected')
...
[2023-11-27 19:15:38,907] ERROR: Something went wrong! Attempting to clean snapshots and exit.
...
[2023-11-27 19:15:54,756] INFO: Job executing "nodetool -Dcom.sun.jndi.rmiURLParsing=legacy -u <user> -pw <password> clearsnapshot -t medusa-test-all-cluster-drk" ran and finished Successfully on all nodes.
[2023-11-27 19:15:54,757] INFO: All nodes successfully cleared their snapshot.

Meanwhile, on one of the nodes that was incomplete, it shows that the upload was still ongoing, and it stopped due to the fact that the snapshot was no longer there.

[2023-11-27 19:15:11,897] DEBUG: [https://s3-eu-west-1.amazonaws.com:443](https://s3-eu-west-1.amazonaws.com/) "PUT /sbk-medusa-prd/data/fbp_markets_journal_store/messages-fd3f4700de1a11eabb225d971d2361
fe/mc-455-big-TOC.txt HTTP/1.1" 200 0
...
[2023-11-27 19:20:21,347] ERROR: Error occurred during backup: awscli cp failed. Max attempts exceeded. Check /tmp/awscli_50571839-2d02-4c4d-8c52-afdad2e56732.output for more informations.
...
The user-provided path /var/lib/cassandra/data/fbp_markets_journal_store/messages-fd3f4700de1a11eabb225d971d2361fe/snapshots/medusa-test-all-cluster-drk/mc-6305-big-Data.db does not exist.

I assume this happens because it takes much time to do the backup of all nodes and all data, and the SSH connection might time out.

But what is this SSH connection?
Is the first node that opens this connection to all nodes to execute this command mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa -vvv backup-node --backup-name test-all-cluster-drk --mode differential?
Is the connection open at the beginning and kept until all nodes have finished the backup?

┆Issue is synchronized with this Jira Story by Unito

The text was updated successfully, but these errors were encountered:

lopesjp · 2024-01-03T11:40:00Z

As an update to this, it is clearly related with the SSH keep-alive.
Happens when performing the first backup of high volume data or a restore.

On both actions, an SSH connection is established that requires to be open until the end. This has happened for a cluster of 10 nodes and more than 1Tb.

Is there a recommended approach for maintaining the Medusa SSH connection without resorting to constantly adjusting the SSH keep-alive interval settings on the node from which the command is executed?

lopesjp changed the title ~~SSH Error when running backup-cluster for a~~ SSH Error when running first backup-cluster Nov 29, 2023

lopesjp changed the title ~~SSH Error when running first backup-cluster~~ SSH Timeout Error on first backup-cluster or restore Jan 3, 2024

rzvoncek added the assess Issues in the state 'assess' label Mar 5, 2024

rzvoncek added help wanted Extra attention is needed configuration and removed assess Issues in the state 'assess' labels Apr 3, 2024

adejanovski added the assess Issues in the state 'assess' label Apr 3, 2024

rzvoncek added to-groom and removed assess Issues in the state 'assess' to-groom labels Apr 3, 2024

adejanovski added the assess Issues in the state 'assess' label Apr 8, 2024

rzvoncek added help-wanted Issues in the state 'help-wanted' and removed help wanted Extra attention is needed labels Apr 8, 2024

sync-by-unito bot assigned adejanovski Sep 3, 2024

sync-by-unito bot unassigned adejanovski Sep 12, 2024

alberto-bortolan mentioned this issue Sep 17, 2024

added ssh keepalive and pseudo terminal support to orchestrator #804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSH Timeout Error on first backup-cluster or restore #689

SSH Timeout Error on first backup-cluster or restore #689

lopesjp commented Nov 29, 2023 •

edited by sync-by-unito bot

Loading

lopesjp commented Jan 3, 2024 •

edited

Loading

SSH Timeout Error on first backup-cluster or restore #689

SSH Timeout Error on first backup-cluster or restore #689

Comments

lopesjp commented Nov 29, 2023 • edited by sync-by-unito bot Loading

lopesjp commented Jan 3, 2024 • edited Loading

lopesjp commented Nov 29, 2023 •

edited by sync-by-unito bot

Loading

lopesjp commented Jan 3, 2024 •

edited

Loading