Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network/sync: Panic on unexpected generic response protocol #6581

Closed
lexnv opened this issue Nov 21, 2024 · 4 comments · Fixed by #6603
Closed

network/sync: Panic on unexpected generic response protocol #6581

lexnv opened this issue Nov 21, 2024 · 4 comments · Fixed by #6603
Assignees
Labels
I2-bug The node fails to follow expected behavior.

Comments

@lexnv
Copy link
Contributor

lexnv commented Nov 21, 2024

Kusama validator is panics on:

warn!(
target: LOG_TARGET,
"Unexpected generic response protocol {protocol_name}, strategy key \
{key:?}",
);
debug_assert!(false);
}

From the log line it looks like the sync engine no longer takes into account the legacy request-response protocol name.

Related PR:

Version deployed: version 1.16.1-ca8beaed148

Logs

Grafana link.

2024-11-21 13:37:36.937  WARN tokio-runtime-worker sync: Unexpected generic response protocol /ksmcc3/sync/2, strategy key StrategyKey("ChainSync")
====================
Version: 1.16.1-ca8beaed148
   0: sp_panic_handler::panic_hook
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:170:18
      sp_panic_handler::set::{{closure}}
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:63:12
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/alloc/src/boxed.rs:2084:9
      std::panicking::rust_panic_with_hook
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:808:13
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:667:13
   3: std::sys::backtrace::__rust_end_short_backtrace
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/sys/backtrace.rs:168:18
   4: rust_begin_unwind
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:665:5
   5: core::panicking::panic_fmt
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panicking.rs:74:14
   6: core::panicking::panic
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panicking.rs:148:5
   7: <sc_network_sync::strategy::chain_sync::ChainSync<B,Client> as sc_network_sync::strategy::SyncingStrategy<B>>::on_generic_response
   8: <sc_network_sync::strategy::polkadot::PolkadotSyncingStrategy<B,Client> as sc_network_sync::strategy::SyncingStrategy<B>>::on_generic_response
   9: sc_network_sync::engine::SyncingEngine<B,Client>::process_response_event
             at /builds/parity/mirrors/polkadot-sdk/substrate/client/network/sync/src/engine.rs:978:5
  10: sc_network_sync::engine::SyncingEngine<B,Client>::run::{{closure}}
             at /builds/parity/mirrors/polkadot-sdk/substrate/client/network/sync/src/engine.rs:551:6
  11: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
             at /builds/parity/mirrors/polkadot-sdk/substrate/client/service/src/task_manager/prometheus_future.rs:61:3
  12: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panic/unwind_safe.rs:297:9
      <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.31/src/future/future/catch_unwind.rs:37:42
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once

cc @paritytech/networking

@lexnv lexnv added the I2-bug The node fails to follow expected behavior. label Nov 21, 2024
@lexnv lexnv added this to Networking Nov 21, 2024
@lexnv
Copy link
Contributor Author

lexnv commented Nov 21, 2024

Warp request protocols and state requests protocols are initializing their handler supporting 2 protocol names:

generate_protocol_name(genesis_hash, fork_id).into(),
std::iter::once(generate_legacy_protocol_name(protocol_id).into()).collect(),

They are supporting both /{}/{}/sync/2 (new genesis-based format) and /{}/sync/2 (legacy format).

We are only checking the new genesis-based format in chain sync

@dmitry-markin
Copy link
Contributor

We are only checking the new genesis-based format in chain sync

This is fine, as the main protocol name is always used by substrate to refer to the protocol. Legacy protocol name is only used on the wire.

@lexnv
Copy link
Contributor Author

lexnv commented Nov 22, 2024

Oki doki, thanks for the info 🙏

I believe in that case this is the proper fix:

Looking over the litep2p code, we always provide the negotiated protocol (regardless if it is fallback or main):

let handle1 =  ConfigBuilder::new(ProtocolName::from("/protocol/1/improved"))
            .with_max_size(1024usize)
            .with_fallback_names(vec![ProtocolName::from("/protocol/1")])
            .build();

let handle2 = handle2) = RequestResponseConfig::new(
        ProtocolName::from("/protocol/1") ...;
        
handle1
        .send_request(peer2, vec![1, 3, 3, 7], DialOptions::Reject);

handle2.next().await.unwrap(),
        RequestResponseEvent::RequestReceived {
            peer: peer1,
            fallback: None, ...):


handle2.send_response(request_id, vec![1, 3, 3, 8]);
    assert_eq!(
        handle1.next().await.unwrap(),
        RequestResponseEvent::ResponseReceived {
            peer: peer2,
            request_id,
            response: vec![1, 3, 3, 8],
            fallback: Some(ProtocolName::from("/protocol/1")),
        }
    );

Step 1: Initialization

  • handle1 initiated with main /protocol/1/improved and fallback /protocol/1
  • handle2 with main /protocol/1

Step 2: handle 1 -> handle 2 request

  • handle1 sends request to /protocol/1
  • handle2 sends response

Step3: handle 1 receives message for fallback

  • handle1 contains fallback /protocol/1

Then into substrate, we parse the received requests as follows:

Some(RequestResponseEvent::ResponseReceived { peer, request_id, fallback, response }) => {
self.on_inbound_response(peer, request_id, fallback, response);
},

And we forward the protocol fallback if any:

fallback.map_or_else(|| self.protocol.clone(), Into::into),

@dmitry-markin
Copy link
Contributor

This is fine, as the main protocol name is always used by substrate to refer to the protocol. Legacy protocol name is only used on the wire.

It turned out this is only the case for libp2p network backend. The state of things in libp2p/litep2p network backends is explained in detail in this comment: #6603 (comment)

github-merge-queue bot pushed a commit that referenced this issue Nov 27, 2024
Request responses are initialized with a main protocol name, and
optional protocol names as a fallback.

Running litep2p in kusama as a validator has surfaced a `debug_asserts`
coming from the sync component:

https://github.com/paritytech/polkadot-sdk/blob/3906c578c96d97a8a099a4bdac4685acbe375a7c/substrate/client/network/sync/src/strategy/chain_sync.rs#L640-L646

The issue is that we initiate a request-response over the main protocol
name `/genesis/sync/2` but receive a response over the legacy procotol
`ksm/sync/2`. This behavior is correct because litep2p propagates to the
higher levels the protocol that responded.

In contrast, libp2p provides the main protocol name regardless of
negotiating a legacy protocol.

Because of this, higher level components assume that only the main
protocol name will respond.
To not break this assumption, this PR alings litep2p shim layer with the
libp2p behavior.


Closes: #6581

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
@github-project-automation github-project-automation bot moved this to Blocked ⛔️ in Networking Nov 27, 2024
github-actions bot pushed a commit that referenced this issue Nov 27, 2024
Request responses are initialized with a main protocol name, and
optional protocol names as a fallback.

Running litep2p in kusama as a validator has surfaced a `debug_asserts`
coming from the sync component:

https://github.com/paritytech/polkadot-sdk/blob/3906c578c96d97a8a099a4bdac4685acbe375a7c/substrate/client/network/sync/src/strategy/chain_sync.rs#L640-L646

The issue is that we initiate a request-response over the main protocol
name `/genesis/sync/2` but receive a response over the legacy procotol
`ksm/sync/2`. This behavior is correct because litep2p propagates to the
higher levels the protocol that responded.

In contrast, libp2p provides the main protocol name regardless of
negotiating a legacy protocol.

Because of this, higher level components assume that only the main
protocol name will respond.
To not break this assumption, this PR alings litep2p shim layer with the
libp2p behavior.

Closes: #6581

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
(cherry picked from commit 2a0b268)
github-actions bot pushed a commit that referenced this issue Nov 27, 2024
Request responses are initialized with a main protocol name, and
optional protocol names as a fallback.

Running litep2p in kusama as a validator has surfaced a `debug_asserts`
coming from the sync component:

https://github.com/paritytech/polkadot-sdk/blob/3906c578c96d97a8a099a4bdac4685acbe375a7c/substrate/client/network/sync/src/strategy/chain_sync.rs#L640-L646

The issue is that we initiate a request-response over the main protocol
name `/genesis/sync/2` but receive a response over the legacy procotol
`ksm/sync/2`. This behavior is correct because litep2p propagates to the
higher levels the protocol that responded.

In contrast, libp2p provides the main protocol name regardless of
negotiating a legacy protocol.

Because of this, higher level components assume that only the main
protocol name will respond.
To not break this assumption, this PR alings litep2p shim layer with the
libp2p behavior.

Closes: #6581

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
(cherry picked from commit 2a0b268)
github-actions bot pushed a commit that referenced this issue Nov 27, 2024
Request responses are initialized with a main protocol name, and
optional protocol names as a fallback.

Running litep2p in kusama as a validator has surfaced a `debug_asserts`
coming from the sync component:

https://github.com/paritytech/polkadot-sdk/blob/3906c578c96d97a8a099a4bdac4685acbe375a7c/substrate/client/network/sync/src/strategy/chain_sync.rs#L640-L646

The issue is that we initiate a request-response over the main protocol
name `/genesis/sync/2` but receive a response over the legacy procotol
`ksm/sync/2`. This behavior is correct because litep2p propagates to the
higher levels the protocol that responded.

In contrast, libp2p provides the main protocol name regardless of
negotiating a legacy protocol.

Because of this, higher level components assume that only the main
protocol name will respond.
To not break this assumption, this PR alings litep2p shim layer with the
libp2p behavior.

Closes: #6581

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
(cherry picked from commit 2a0b268)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior.
Projects
Status: Blocked ⛔️
Development

Successfully merging a pull request may close this issue.

2 participants