Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardening relay rpc error handling #207

Open
sameh-farouk opened this issue Dec 26, 2024 · 0 comments
Open

Hardening relay rpc error handling #207

sameh-farouk opened this issue Dec 26, 2024 · 0 comments

Comments

@sameh-farouk
Copy link
Member

sameh-farouk commented Dec 26, 2024

2024-12-26T12:00:15.068Z INFO  [rmb::events] started chain events listener
439x			
2024-12-26T12:00:13.594Z ERROR [rmb::twin::substrate] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)
2024-12-26T12:00:13.516Z ERROR [rmb::twin::substrate] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Connection timeout exceeded: 10s
2024-12-26T12:00:05.394Z ERROR [rmb::events] error listening to events Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required
2024-12-26T12:00:05.394Z INFO  [rmb::events] started chain events listener
8x			
2024-12-26T12:00:03.467Z ERROR [rmb::twin::substrate] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)
    failed to connect to substrate using the provided urls. Retrying in 2s...
Caused by:
2024-12-26T12:00:03.393Z WARN  [rmb_relay] Listener failed on attempt 2: failed to listen to chain events
2x			
2024-12-26T12:00:03.393Z ERROR [rmb::events] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)
2024-12-26T12:00:03.393Z ERROR [rmb::events] error listening to events Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required
2024-12-26T12:00:03.393Z INFO  [rmb::events] started chain events listener
542x			
2024-12-26T12:00:03.233Z ERROR [rmb::twin::substrate] failed to create substrate client with url "ws://tfchain-public-node:9944": Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)

From the logs, it seems the relay is retrying too quickly about 500 times per second.

More importantly, we need to investigate whether relays can end up in a state where they are not fully started or unable to handle WS messages, yet still accept connections.

tokio::spawn(async move {
        let max_retries = 9; // max wait is 2^9 = 512 seconds ( 5 minutes )
        let mut attempt = 0;
        let mut backoff = Duration::from_secs(1);
        let mut got_hit = false;

        loop {
            match l
                .listen(&mut got_hit)
                .await
                .context("failed to listen to chain events")
            {
                Ok(_) => break,
                Err(e) => {
                    if got_hit {
                        log::warn!("Listener got a hit, but failed to listen to chain events before no attempts will be reset");
                        got_hit = false;
                        attempt = 0;
                        backoff = Duration::from_secs(1);
                    }
                    attempt += 1;
                    if attempt > max_retries {
                        log::error!("Listener failed after {} attempts: {:?}", attempt - 1, e);
                        let _ = tx.send(());
                        break;
                    }
                    log::warn!(
                        "Listener failed on attempt {}: {:?}. Retrying in {:?}...",
                        attempt,
                        e,
                        backoff
                    );
                    tokio::time::sleep(backoff).await;
                    backoff *= 2;
                }
            }
        }
    });

    r.start(&args.listen).await.unwrap();

I believe the issue here is that the relay starts accepting messages even if it can't connect to the RPC node. After 5 minutes of receiving messages without successfully querying twin relays, it exits and is likely restarted by compose restart policy. This cycle continues, trapping messages and confusing the nodes, keeping accepting messages leading them to believe the messages will eventually be delivered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant