Hardening relay rpc error handling #207

sameh-farouk · 2024-12-26T14:52:35Z

2024-12-26T12:00:15.068Z INFO  [rmb::events] started chain events listener
439x			
2024-12-26T12:00:13.594Z ERROR [rmb::twin::substrate] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)
2024-12-26T12:00:13.516Z ERROR [rmb::twin::substrate] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Connection timeout exceeded: 10s
2024-12-26T12:00:05.394Z ERROR [rmb::events] error listening to events Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required
2024-12-26T12:00:05.394Z INFO  [rmb::events] started chain events listener
8x			
2024-12-26T12:00:03.467Z ERROR [rmb::twin::substrate] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)
    failed to connect to substrate using the provided urls. Retrying in 2s...
Caused by:
2024-12-26T12:00:03.393Z WARN  [rmb_relay] Listener failed on attempt 2: failed to listen to chain events
2x			
2024-12-26T12:00:03.393Z ERROR [rmb::events] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)
2024-12-26T12:00:03.393Z ERROR [rmb::events] error listening to events Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required
2024-12-26T12:00:03.393Z INFO  [rmb::events] started chain events listener
542x			
2024-12-26T12:00:03.233Z ERROR [rmb::twin::substrate] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)

From the logs, it seems the relay is retrying too quickly about 500 times per second.

More importantly, we need to investigate whether relays can end up in a state where they are not fully started or unable to handle WS messages, yet still accept connections.

tokio::spawn(async move {
        let max_retries = 9; // max wait is 2^9 = 512 seconds ( 5 minutes )
        let mut attempt = 0;
        let mut backoff = Duration::from_secs(1);
        let mut got_hit = false;

        loop {
            match l
                .listen(&mut got_hit)
                .await
                .context("failed to listen to chain events")
            {
                Ok(_) => break,
                Err(e) => {
                    if got_hit {
                        log::warn!("Listener got a hit, but failed to listen to chain events before no attempts will be reset");
                        got_hit = false;
                        attempt = 0;
                        backoff = Duration::from_secs(1);
                    }
                    attempt += 1;
                    if attempt > max_retries {
                        log::error!("Listener failed after {} attempts: {:?}", attempt - 1, e);
                        let _ = tx.send(());
                        break;
                    }
                    log::warn!(
                        "Listener failed on attempt {}: {:?}. Retrying in {:?}...",
                        attempt,
                        e,
                        backoff
                    );
                    tokio::time::sleep(backoff).await;
                    backoff *= 2;
                }
            }
        }
    });

    r.start(&args.listen).await.unwrap();

I believe the issue here is that the relay starts accepting messages even if it can't connect to the RPC node. After 5 minutes of receiving messages without successfully querying twin relays, it exits and is likely restarted by compose restart policy. This cycle continues, trapping messages and confusing the nodes, keeping accepting messages leading them to believe the messages will eventually be delivered.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardening relay rpc error handling #207

Hardening relay rpc error handling #207

sameh-farouk commented Dec 26, 2024 •

edited

Loading

Hardening relay rpc error handling #207

Hardening relay rpc error handling #207

Comments

sameh-farouk commented Dec 26, 2024 • edited Loading

sameh-farouk commented Dec 26, 2024 •

edited

Loading