-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade libp2p from 0.52.4 to 0.54.1 #6248
base: master
Are you sure you want to change the base?
Conversation
Looks like something called |
1d112cd
to
7903b11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing scary indeed. Thanks a lot for upgrading libp2p for us!
// Populate kad with both the legacy and the new protocol names. | ||
// Remove the legacy protocol: | ||
// https://github.com/paritytech/polkadot-sdk/issues/504 | ||
let kademlia_protocols = if let Some(legacy_protocol) = kademlia_legacy_protocol { | ||
vec![kademlia_protocol.clone(), legacy_protocol] | ||
} else { | ||
vec![kademlia_protocol.clone()] | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! We should finally do this)
substrate/client/network/src/protocol/notifications/behaviour.rs
Outdated
Show resolved
Hide resolved
/tip small |
Only members of paritytech/tip-bot-approvers have permission to request the creation of the tip referendum from the bot. However, you can create the tip referendum yourself using Polkassembly or PolkadotJS Apps. |
Looks like in your CI |
Looking at the test, it doesn't seem to do anything time consuming. And a longer timeout was not needed with old libp2p? |
It runs quickly on its own, but when running all network tests at once it suddenly takes much longer and always seems to finish later than most tests. Didn't debug why very deeply though, primarily because as comment states it is not really applicable to how it is actually used in Substrate, just test-specific behavior. Pumping to 5 minutes fixed the test and shouldn't be particularly flaky. Not sure about older version, didn't run tests in a loop to try and reproduce, but keep-alive behavior has certainly changed. |
# Conflicts: # Cargo.lock
I remember with previous |
Yes, that was the plan — I was going to do a versi burn-in once it is available after syncing refactoring testing. Last time the issues with libp2p upgrade popped a week after running the nodes, so we need to run a burn-in for at least a week or so. Unfortunately, the branch-off for |
I'll probably be backporting it then, so any feedback that can be collected until then would be helpful regardless |
return Poll::Ready(ToSwarm::RemoveListener { id }), | ||
Poll::Ready(event) => { | ||
return Poll::Ready(event.map_in(Either::Left).map_out(|_| { | ||
unreachable!("`GenerateEvent` is handled in a branch above; qed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Im a bit afraid we might be missing events in cases where they are generated a few hours / days after starting the node. Could we instead make this an error log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion is statically correct. event
is non-exhaustive, so they added .map_in()
and .map_out()
methods to exhaustively process it.
In this case .map_out()
's argument corresponds to ev
from ToSwarm::GenerateEvent(ev)
above and since we did match on ToSwarm::GenerateEvent(ev)
already, there is no way .map_out()
will ever be called, hence unreachable
.
|
||
config.set_replication_factor(kademlia_replication_factor); | ||
// Populate kad with both the legacy and the new protocol names. | ||
// Remove the legacy protocol: | ||
// https://github.com/paritytech/polkadot-sdk/issues/504 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should wait a bit and scrape the network. I believe everything should be fine, but let's double check first.
I guess we can merge #6109 before this one as it also handles legacy_protocol removal, but shouldn't matter too much.
I'll have a look this week and investigate versions, generally I think its safe for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I specifically used kademlia_legacy_protocol: _,
above instead of full removal of the property to reduce the inevitable conflict with #6109 whichever ends up being first.
SwarmConfig::with_executor(TokioExecutor(runtime)) | ||
// This is taken care of by notification protocols in non-test environment | ||
// It is very slow in test environment for some reason, hence larger timeout | ||
.with_idle_connection_timeout(Duration::from_secs(300)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dq: Was this working ok in the past?
It might indicate some issues with the debug build in terms of degrading libp2p performance? I don't believe we changed CI machines recently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked differently due to timeouts as described in libp2p/rust-libp2p#4306
I did not investigate deeply why this particular test takes so long to run with others sometimes (it runs quickly on its own though), someone should definitely look into it at some point. However this test only tests a single behavior in isolation, while the actual Substrate network notification protocol keeps connection alive, see substrate/client/network/src/protocol/notifications/handler.rs
.
)))) => "sync-notifications-clogged", | ||
Some(ConnectionError::Handler(Either::Left(Either::Left( | ||
Either::Right(Either::Left(_)), | ||
)))) => "ping-timeout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we don't propagate the ping-timeout
and sync-clogged
labels anymore, are they merged into actively-closed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect they were merged into Io
, ConnectionError::Handler
variant doesn't exist upstream anymore
@@ -29,28 +29,21 @@ use libp2p::{ | |||
}; | |||
use std::{sync::Arc, time::Duration}; | |||
|
|||
// TODO: Create a wrapper similar to upstream `BandwidthTransport` that tracks sent/received bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this blocked by prometheus-client
crate update? Would be good to have an issue in substrate to not forget about this, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, this tracker is ultimately used to display in/out bandwidth in the informer, which will be needed regardless of prometheus library used. Though we will be able to avoid this for metrics purposes with prometheus library upgrade indeed.
transport::build_transport( | ||
local_identity.clone().into(), | ||
config_mem, | ||
network_config.yamux_window_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should deprecate this CLI flag with the next litep2p release
dq: @nazar-pc Did you observe a significant improvement in terms of performance when switching to the autotuning impl? I know the synthetic benchmarks look solid, but for litep2p this did not translate into other improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we saw some improvements, but nothing groundbreaking. Although our DSN networking stack works substantially differently from Substrate to draw any conclusions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As always @nazar-pc thanks for the outstanding work here! 🙏
Before merging this I would like to sort out the legacy KAD protocol deprecation, as it might possibly affect the discovery of nodes that have not updated for a while (cc #6109)
It would be good to have it running in our test stack versi or versi-net and have a collected report in terms of logs and a few dashboards (mostly interested in performance side and connection stability -- I remember that 0.52.4 had a few regressions) 🙏
As part of our testing and performance comparison we start 2 nodes side by side, it might not be ideal but I'm wondering if this will break the testing environment:
@nazar-pc we'll just need to provide a different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nazar-pc we'll just need to provide a different --listen-addr port for libp2p for the installed protocols and everything will work as before? (ie we can have libp2p and litep2p side by side or even 2 libp2p nodes without interferences)
Yes, exactly. We basically no longer have fallback to a random listening port if 2 libp2p instances are claiming the same one.
return Poll::Ready(ToSwarm::RemoveListener { id }), | ||
Poll::Ready(event) => { | ||
return Poll::Ready(event.map_in(Either::Left).map_out(|_| { | ||
unreachable!("`GenerateEvent` is handled in a branch above; qed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion is statically correct. event
is non-exhaustive, so they added .map_in()
and .map_out()
methods to exhaustively process it.
In this case .map_out()
's argument corresponds to ev
from ToSwarm::GenerateEvent(ev)
above and since we did match on ToSwarm::GenerateEvent(ev)
already, there is no way .map_out()
will ever be called, hence unreachable
.
|
||
config.set_replication_factor(kademlia_replication_factor); | ||
// Populate kad with both the legacy and the new protocol names. | ||
// Remove the legacy protocol: | ||
// https://github.com/paritytech/polkadot-sdk/issues/504 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I specifically used kademlia_legacy_protocol: _,
above instead of full removal of the property to reduce the inevitable conflict with #6109 whichever ends up being first.
SwarmConfig::with_executor(TokioExecutor(runtime)) | ||
// This is taken care of by notification protocols in non-test environment | ||
// It is very slow in test environment for some reason, hence larger timeout | ||
.with_idle_connection_timeout(Duration::from_secs(300)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked differently due to timeouts as described in libp2p/rust-libp2p#4306
I did not investigate deeply why this particular test takes so long to run with others sometimes (it runs quickly on its own though), someone should definitely look into it at some point. However this test only tests a single behavior in isolation, while the actual Substrate network notification protocol keeps connection alive, see substrate/client/network/src/protocol/notifications/handler.rs
.
transport::build_transport( | ||
local_identity.clone().into(), | ||
config_mem, | ||
network_config.yamux_window_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we saw some improvements, but nothing groundbreaking. Although our DSN networking stack works substantially differently from Substrate to draw any conclusions here.
)))) => "sync-notifications-clogged", | ||
Some(ConnectionError::Handler(Either::Left(Either::Left( | ||
Either::Right(Either::Left(_)), | ||
)))) => "ping-timeout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect they were merged into Io
, ConnectionError::Handler
variant doesn't exist upstream anymore
@@ -29,28 +29,21 @@ use libp2p::{ | |||
}; | |||
use std::{sync::Arc, time::Duration}; | |||
|
|||
// TODO: Create a wrapper similar to upstream `BandwidthTransport` that tracks sent/received bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, this tracker is ultimately used to display in/out bandwidth in the informer, which will be needed regardless of prometheus library used. Though we will be able to avoid this for metrics purposes with prometheus library upgrade indeed.
# Conflicts: # Cargo.lock # substrate/client/authority-discovery/src/worker/tests.rs # substrate/client/network/src/event.rs # substrate/client/network/src/litep2p/discovery.rs # substrate/client/network/src/litep2p/service.rs # substrate/client/network/src/service.rs # substrate/client/network/src/service/traits.rs
Merged |
Thanks @nazar-pc . Versi burn-in revealed spamming by messages
|
The peer count (in)stability with libp2p upgraded is the same as in the network of nodes built from master commit merged into this PR, so it's "not libp2p fault". |
/cmd prdoc --help |
Command help:
|
Description
Fixes #5996
https://github.com/libp2p/rust-libp2p/releases/tag/libp2p-v0.53.0
https://github.com/libp2p/rust-libp2p/blob/master/CHANGELOG.md
Integration
Nothing special is needed, just note that
yamux_window_size
is no longer applicable to libp2p (litep2p seems to still have it though).Review Notes
There are a few simplifications and improvements done in libp2p 0.53 regarding swarm interface, I'll list a few key/applicable here.
libp2p/rust-libp2p#4788 removed
write_length_prefixed
function, so I inlined its code instead.libp2p/rust-libp2p#4120 introduced new
libp2p::SwarmBuilder
instead of now deprecatedlibp2p::swarm::SwarmBuilder
, the transition is straightforward and quite ergonomic (can be seen in tests).libp2p/rust-libp2p#4581 is the most annoying change I have seen that basically makes many enums
#[non_exhaustive]
. I mapped some, but those that couldn't be mapped I dealt with by printing log messages once they are hit (the best solution I could come up with, at least with stable Rust).libp2p/rust-libp2p#4306 makes connection close as soon as there are no handler using it, so I had to replace
KeepAlive::Until
with an explicit future that flips internal boolean after timeout, achieving the old behavior, though it should ideally be removed completely at some point.yamux_window_size
is no longer used by libp2p thanks to libp2p/rust-libp2p#4970 and generally Yamux should have a higher performance now.I have resolved and cleaned up all deprecations related to libp2p except
BandwidthSinks
. Libp2p deprecated it (though it is still present in 0.54.1, which is why I didn't handle it just yet). Ideally Substrate would finally switch to the official Prometheus client, in which case we'd get metrics for free. Otherwise a bit of code will need to be copy-pasted to maintain current behavior withBandwidthSinks
gone, which I left a TODO about.The biggest change in 0.54.0 is libp2p/rust-libp2p#4568 that changed transport APIs and enabled unconditional potential port reuse, which can lead to very confusing errors if running two Substrate nodes on the same machine without changing listening port explicitly.
Overall nothing scary here, but testing is always appreciated.
Checklist
T
required)Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd