-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewrite for better detection of connection loss #179
Conversation
let (ws, _) = match tokio_tungstenite::connect_async(&u).await { | ||
Ok(v) => v, | ||
Err(err) => { | ||
log::trace!( | ||
log::error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to set this to debug, trace or warn as the new rmb release built with redundancy in mind, to hide these continues attempts to connect while can be seen with more verbose log level, because RMB should be run fine as there are other established relay connection (unless all failing).
You still get from info level the connection status, connecting .., connected and disconnected.
Spamming logs with error while the peer is completely functioning is a bit annoying / confusing. warn would fit better here for connection error and disconnection error.
But i can understand your point if you would like to leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to show connection instability by default. this can give an early sign that a certain relay is in a bad condition, and give us a chance to choose to fix it or drop it from the relay lists.
If it's hidden as debug (not everyone run with debug flags) we can end up in the situation that is failing (for any reason) and we have no indication that the peer is struggling to get a stable connection.
//weird why would we receive a non message | ||
log::error!("socket closed!"); | ||
handler.abort(); | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: I'm not follow what should happens here? should we instead reconnect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a way to terminate the connection. This can only be none if the sender part of a mpsc channel is dropped, this can only happen if the Socket
structure that we have is dropped. If that happen it's totally find to drop the entire connection, and return.
if close.send(err).await.is_err() { | ||
log::error!("failed to notify of socket connection loss"); | ||
} | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: what suppose to happens after returning here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means the read
routing is failing to read next message. we first notify the main loop that the read has failed (by sending the error on the close channel)
That send should never fail unless the retainer has exited completely, but in both cases the read routine need to exit.
The main loop will then re connect, start a new routing and wait for new messages
// receive timeout (on upstream message) | ||
// we can then send a ping to keep the connection alive | ||
log::debug!("sending a ping"); | ||
Message::Ping(Vec::default()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: why we only sending Ping on err? because this handled by the lib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not sending ping on error, we are sending pin on timeout. Wrapping the recv()
in a timeout() makes the return of recv() wrapped in another Result. That result can be error if the timeout happened before the recv. which means it has been some time (20 seconds) with no activity and we then can send a ping
In either cases if there is no received messages in a window or 40 seconds (2 pings) we assume the connection is dead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still didn't wrap my head around it.
should we then store the ts anytime we receive a message (instead of only on receving pong), so we do ping only on idle connection (when we are not receiving messages from relay)
}); | ||
|
||
receiver | ||
fn timestamp() -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we store instead Instant instance ?
- you won't need to unwrap here
- it is more reliable and
- you can call elapsed method on it directly to know if you passed the timeout threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because i need to use AtomicU64. Since the value is modified/read from 2 different routines. If i have to use Instance then i have to use a Mutex which is heavier for that use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this but not sure if practically performance will be an issue here.
the thing is that i thought before that this method is unlikely to fail, till i saw it randomly failing and panicking. i remember i switched to instant for that reason.
should we do at lest some error handling / logging here to spot this when if happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you check the docs, this can only fail if the duration_since is given time AFTER the time instance (or now) in this case which is impossible since i use the EPOCH.
If suddenly the EPOCH is after the now, i would crash instead
cargo update
to make sure we are using latest crates