rewrite for better detection of connection loss #179

muhamadazmy · 2023-12-11T21:16:38Z

Mainly this rewrites the socket layer to avoid blocking forever. by making sure timeouts are handled in the main loop.
run cargo update to make sure we are using latest crates

sameh-farouk · 2023-12-12T12:43:44Z

src/peer/socket.rs

        let (ws, _) = match tokio_tungstenite::connect_async(&u).await {
            Ok(v) => v,
            Err(err) => {
-                log::trace!(
+                log::error!(


I prefer to set this to debug, trace or warn as the new rmb release built with redundancy in mind, to hide these continues attempts to connect while can be seen with more verbose log level, because RMB should be run fine as there are other established relay connection (unless all failing).
You still get from info level the connection status, connecting .., connected and disconnected.

Spamming logs with error while the peer is completely functioning is a bit annoying / confusing. warn would fit better here for connection error and disconnection error.

But i can understand your point if you would like to leave it as it is.

I think it's better to show connection instability by default. this can give an early sign that a certain relay is in a bad condition, and give us a chance to choose to fix it or drop it from the relay lists.

If it's hidden as debug (not everyone run with debug flags) we can end up in the situation that is failing (for any reason) and we have no indication that the peer is struggling to get a stable connection.

sameh-farouk · 2023-12-12T12:48:40Z

src/peer/socket.rs

+                            //weird why would we receive a non message
+                            log::error!("socket closed!");
+                            handler.abort();
+                            return


q: I'm not follow what should happens here? should we instead reconnect?

This is a way to terminate the connection. This can only be none if the sender part of a mpsc channel is dropped, this can only happen if the Socket structure that we have is dropped. If that happen it's totally find to drop the entire connection, and return.

sameh-farouk · 2023-12-12T13:06:37Z

src/peer/socket.rs

+                        if close.send(err).await.is_err() {
+                            log::error!("failed to notify of socket connection loss");
+                        }
+                        return;


q: what suppose to happens after returning here

this means the read routing is failing to read next message. we first notify the main loop that the read has failed (by sending the error on the close channel)

That send should never fail unless the retainer has exited completely, but in both cases the read routine need to exit.

The main loop will then re connect, start a new routing and wait for new messages

sameh-farouk · 2023-12-12T13:08:45Z

src/peer/socket.rs

+                            // receive timeout (on upstream message)
+                            // we can then send a ping to keep the connection alive
+                            log::debug!("sending a ping");
+                            Message::Ping(Vec::default())


q: why we only sending Ping on err? because this handled by the lib?

We are not sending ping on error, we are sending pin on timeout. Wrapping the recv() in a timeout() makes the return of recv() wrapped in another Result. That result can be error if the timeout happened before the recv. which means it has been some time (20 seconds) with no activity and we then can send a ping

In either cases if there is no received messages in a window or 40 seconds (2 pings) we assume the connection is dead.

still didn't wrap my head around it.
should we then store the ts anytime we receive a message (instead of only on receving pong), so we do ping only on idle connection (when we are not receiving messages from relay)

sameh-farouk · 2023-12-12T13:12:16Z

src/peer/socket.rs

-    });
-
-    receiver
+fn timestamp() -> u64 {


can we store instead Instant instance ?

you won't need to unwrap here

it is more reliable and

you can call elapsed method on it directly to know if you passed the timeout threshold.

No, because i need to use AtomicU64. Since the value is modified/read from 2 different routines. If i have to use Instance then i have to use a Mutex which is heavier for that use case

I understand this but not sure if practically performance will be an issue here.
the thing is that i thought before that this method is unlikely to fail, till i saw it randomly failing and panicking. i remember i switched to instant for that reason.

should we do at lest some error handling / logging here to spot this when if happens?

if you check the docs, this can only fail if the duration_since is given time AFTER the time instance (or now) in this case which is impossible since i use the EPOCH.

If suddenly the EPOCH is after the now, i would crash instead

rewrite for better detection of connection loss

4a3e8a6

muhamadazmy requested a review from sameh-farouk December 11, 2023 21:16

muhamadazmy added 3 commits December 12, 2023 09:07

crate update

d8fd14a

make sure cancel read routine on break

1624623

run clippy

8dfffd8

sameh-farouk reviewed Dec 12, 2023

View reviewed changes

sameh-farouk previously approved these changes Dec 12, 2023

View reviewed changes

update timestamp on receiving a message

300264d

muhamadazmy dismissed sameh-farouk’s stale review via 300264d December 12, 2023 15:25

muhamadazmy merged commit 15b8be8 into main Dec 12, 2023
1 check passed

muhamadazmy deleted the rewrite-socket branch December 12, 2023 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite for better detection of connection loss #179

rewrite for better detection of connection loss #179

muhamadazmy commented Dec 11, 2023 •

edited

Loading

sameh-farouk Dec 12, 2023

muhamadazmy Dec 12, 2023

sameh-farouk Dec 12, 2023

muhamadazmy Dec 12, 2023

sameh-farouk Dec 12, 2023

muhamadazmy Dec 12, 2023

sameh-farouk Dec 12, 2023

muhamadazmy Dec 12, 2023

sameh-farouk Dec 12, 2023

sameh-farouk Dec 12, 2023

muhamadazmy Dec 12, 2023

sameh-farouk Dec 12, 2023

muhamadazmy Dec 12, 2023

rewrite for better detection of connection loss #179

rewrite for better detection of connection loss #179

Conversation

muhamadazmy commented Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muhamadazmy commented Dec 11, 2023 •

edited

Loading