Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite for better detection of connection loss #179

Merged
merged 5 commits into from
Dec 12, 2023
Merged

Conversation

muhamadazmy
Copy link
Member

@muhamadazmy muhamadazmy commented Dec 11, 2023

  • Mainly this rewrites the socket layer to avoid blocking forever. by making sure timeouts are handled in the main loop.
  • run cargo update to make sure we are using latest crates

let (ws, _) = match tokio_tungstenite::connect_async(&u).await {
Ok(v) => v,
Err(err) => {
log::trace!(
log::error!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to set this to debug, trace or warn as the new rmb release built with redundancy in mind, to hide these continues attempts to connect while can be seen with more verbose log level, because RMB should be run fine as there are other established relay connection (unless all failing).
You still get from info level the connection status, connecting .., connected and disconnected.

Spamming logs with error while the peer is completely functioning is a bit annoying / confusing. warn would fit better here for connection error and disconnection error.

But i can understand your point if you would like to leave it as it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to show connection instability by default. this can give an early sign that a certain relay is in a bad condition, and give us a chance to choose to fix it or drop it from the relay lists.

If it's hidden as debug (not everyone run with debug flags) we can end up in the situation that is failing (for any reason) and we have no indication that the peer is struggling to get a stable connection.

//weird why would we receive a non message
log::error!("socket closed!");
handler.abort();
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: I'm not follow what should happens here? should we instead reconnect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a way to terminate the connection. This can only be none if the sender part of a mpsc channel is dropped, this can only happen if the Socket structure that we have is dropped. If that happen it's totally find to drop the entire connection, and return.

if close.send(err).await.is_err() {
log::error!("failed to notify of socket connection loss");
}
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: what suppose to happens after returning here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means the read routing is failing to read next message. we first notify the main loop that the read has failed (by sending the error on the close channel)

That send should never fail unless the retainer has exited completely, but in both cases the read routine need to exit.

The main loop will then re connect, start a new routing and wait for new messages

// receive timeout (on upstream message)
// we can then send a ping to keep the connection alive
log::debug!("sending a ping");
Message::Ping(Vec::default())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: why we only sending Ping on err? because this handled by the lib?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not sending ping on error, we are sending pin on timeout. Wrapping the recv() in a timeout() makes the return of recv() wrapped in another Result. That result can be error if the timeout happened before the recv. which means it has been some time (20 seconds) with no activity and we then can send a ping

In either cases if there is no received messages in a window or 40 seconds (2 pings) we assume the connection is dead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still didn't wrap my head around it.
should we then store the ts anytime we receive a message (instead of only on receving pong), so we do ping only on idle connection (when we are not receiving messages from relay)

});

receiver
fn timestamp() -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we store instead Instant instance ?

  • you won't need to unwrap here
  • it is more reliable and
  • you can call elapsed method on it directly to know if you passed the timeout threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because i need to use AtomicU64. Since the value is modified/read from 2 different routines. If i have to use Instance then i have to use a Mutex which is heavier for that use case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this but not sure if practically performance will be an issue here.
the thing is that i thought before that this method is unlikely to fail, till i saw it randomly failing and panicking. i remember i switched to instant for that reason.

should we do at lest some error handling / logging here to spot this when if happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you check the docs, this can only fail if the duration_since is given time AFTER the time instance (or now) in this case which is impossible since i use the EPOCH.

If suddenly the EPOCH is after the now, i would crash instead

sameh-farouk
sameh-farouk previously approved these changes Dec 12, 2023
@muhamadazmy muhamadazmy merged commit 15b8be8 into main Dec 12, 2023
1 check passed
@muhamadazmy muhamadazmy deleted the rewrite-socket branch December 12, 2023 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants