Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keepalive with Fibonacci Backoff, fault tolerant + randomised + concurrent orchestrator signing logic and more #18

Merged
merged 6 commits into from
Jan 27, 2024

Conversation

RedaOps
Copy link
Contributor

@RedaOps RedaOps commented Jan 25, 2024

Hi! This is a pretty interesting project. Read through the code and decided to help implement some stuff :)

Issues

Implemented Features

  • generate-committee fails if the provided directory doesn't exist, it now creates it by default
  • Keepalive between the orchestrator and MPC nodes with fibonacci backoff
  • Updated rpc requests log levels from info to debug. Imho, info level logs shouldn't be spammed by low level request messages
  • Made the orchestrator signing logic concurrent, random (maybe this will be replaced with a better selection algorithm in the future?) and fault tolerant
  • Added a status method to the orchestrator jsonrpc which returns the alive/dead MPC nodes (using the key identifier, not address, which could be private to the orchestrator).
  • Updated installation instructions - it was not very clear circom and snarkjs binaries are required to do stuff, so I changed the README to be more clear :)

Let's get into the exciting details!

MemberStatus

A new MemberStatus enum was implemented which is used to classify the state of an MPC member inside the orchestrator:

  • Online - The node is online and responsive
  • Disconnected - The node is unresponsive but there will be reconnection attempts with fibonacci delays
  • Offline - The node is unresponsive and reconnection will not be attempted anymore. This can happen either because we tried reconnecting too many times in the Disconnected state, or because the node generated a round 1 signing commitment but didn't manage to complete round 2, meaning it is very unreliable and maybe is running a different version than the other nodes.

The status of each node will be held inside a RwLock<MemberStatusState> structure which is passed inside the jsonrpsee context.

status jsonrpc method

Pretty self explanatory, this is what the status method returns:

curl -X POST http://127.0.0.1:8890 -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":"test","method":"status","params":[11]}'

{"jsonrpc":"2.0","result":{"online_members":["0000000000000000000000000000000000000000000000000000000000000001"],"offline_members":["0000000000000000000000000000000000000000000000000000000000000002","0000000000000000000000000000000000000000000000000000000000000003"]},"id":"test"}

Fibonacci backoff keepalive

When the orchestrator starts, it will ping the newly implemented ping endpoint of each MPC member and determine if it's alive or not. It will keep on querying the ping endpoint every KEEPALIVE_WAIT_SECONDS seconds (defined in constants.rs). It will only ping Online or Disconnected nodes and update the status accordingly. The fibonacci backoff means that if a node doesn't respond (is in the Disconnected state), it will try again in 5, 8, 13, 21 etc seconds until it reaches a KEEPALIVE_MAX_RETRIES number of retries, when it will classify the node as Offline.

New concurrent fault-tolerant signing logic

This is how the new logic works and is pretty fault tolerant:

  1. The orchestrator generates a list of Online nodes and shuffles them.
  2. If the number of available nodes is below the required threshold, the orchestrator returns a not enough available signers error.
  3. The orchestrator picks first n nodes required for the threshold.
  4. It then does the first signing round concurrently.
  5. When checking the results, if any node has not returned a commitment, it classifies that node as Disconnected and jumps back to step 1. This way, it will try again with a new set of nodes.
  6. Otherwise, it then does the second signing round concurrently.
  7. When checking the results, if any node has not returned a signature, it classifies that node as Offline and jumps back to step 1.
  8. Otherwise, it generates the transaction and returns Ok.

Testing

I have tested every new feature, but I highly encourage you to test it out more!

In order to perform the tests for the fault tolerance, I created some new committee keys and spun up my own orchestrator and MPC nodes with a new ZKBITCOIN address. These are the tests I have performed and the results:

Test 1

I tried "using" a zkapp while only 1/3 MPC nodes were online. I have received the following error, which is expected:

Error: error while sending request to orchestrator

Caused by:
    0: bob request failed
    1: RPC error response: RpcError { code: -32001, message: "error while unlocking funds", data: Some(RawValue("the request didn't validate: not enough available signers")) }

Test 2

I spun up 2/3 nodes, but with a catch. One of the running nodes is going to panic when asked for round 2:

async fn round_2_signing(
    params: Params<'static>,
    context: Arc<NodeState>,
) -> RpcResult<Round2Response> {
    panic!("Panic for fun") // panic for testing
    // get commitments from params
    let round2request: [Round2Request; 1] = params.parse()?;
    let round2request = &round2request[0];
    info!("received request: {:?}", round2request);

The orchestrator didn't crash and just returned the same error as in test 1. This is because it tried doing the signing logic, but failed on step 2 and classified the failing node as Offline. It then tried again, but only saw 1/3 available nodes and returned the error.

[2024-01-25T01:33:07Z WARN  zkbitcoin::committee::orchestrator] Round 2 error with http://127.0.0.1:8891, marking as offline and retrying from round 1: error sending request for url (http://127.0.0.1:8891/): connection closed before message completed
[2024-01-25T01:33:09Z WARN  zkbitcoin::committee::orchestrator] http://127.0.0.1:8893 is offline. Re-trying in 34 seconds... (4/10)

Test 3 - the most important test

I also did test 2 but with 3/3 nodes running, and only one of them will crash when asked for round 1 commitment. What happened is that it actually did randomly selected the faulty node, but when it didn't receive a commitment from it, it classified that node as Disconnected, and then tried again with the other 2 nodes and successfully generated a signature and submitted the spend transaction. This is great fault tolerance!

Test 4

I spun up 2/3 nodes with no catches. This should work in this scenario, and it did.

Here is my ZKBITCOIN address: https://mempool.space/testnet/address/tb1przdyzca6zxlykmas4tvdum8qtac3m0ppsfm9p8akfckqkwnw07xs2u9cwf
There are 4 transactions: 2 deploys, one spent on test 3 and one spent on test 4

If you have something to add or any suggestions on what I implemented I would love to hear them!

@mimoo
Copy link
Contributor

mimoo commented Jan 27, 2024

Looking through the PR now! This is amazing already so a big thank-you again for this PR :D

also for testing crashes and things like that, do you know https://crates.io/crates/fail? Maybe it would be a good idea to create some tests with it (with #20)

@RedaOps
Copy link
Contributor Author

RedaOps commented Jan 27, 2024

Yeah, that's definitely something we should look into. I can do some research and open another PR once I find a good solution. Then we can even include the e2e and the normal tests in the CI pipeline.

Copy link
Contributor

@mimoo mimoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful!

async fn is_alive(_params: Params<'static>, _context: Arc<NodeState>) -> String {
"hello".to_string()
async fn is_alive(params: Params<'static>, _context: Arc<NodeState>) -> RpcResult<u64> {
Ok(params.parse::<[u64; 1]>()?[0].clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually best practice in order to ensure it's not a cached response or smthg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, a lot of APIs use the current timestamp in ping keepalive requests, so yeah it can be considered a standard. In this case it's more of a sanity check, but just in case the node has a weird caching reverse proxy set up, this is good to have.

.iter()
.map(|(_, member)| async {
let rpc_ctx =
RpcCtx::new(Some("2.0"), None, Some(member.address.clone()), None, None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that default timeout atm is 10s, which is quite a lot :D

warn!("Round 1 error with {}, marking as disconnected and retrying round 1: {rpc_error}", member.address);
let mut ms_w = member_status.write().unwrap();
ms_w.mark_as_disconnected(member_id);
continue 'retry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually don't have to restart the whole process at this point, we can also ask more people (including the disconnected peer perhaps), but I think this is good enough and simpler!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! We can definitely do that. Right now it's fine, because even with like 5-10 MPC nodes will be fast enough, but when there will be more nodes the requests that already succeeded are redundant.

};

let response: bitcoincore_rpc::jsonrpc::Response = serde_json::from_str(&resp)?;
let resp: Round1Response = response.result()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this deserialization issue will prevent the process from continuing. I think it'd be nice to treat these as a disconnected member as well (perhaps with a different log msg)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct! Also, maybe we can use custom structs for the response. Both here and other MPC node responses.

};

let response: bitcoincore_rpc::jsonrpc::Response = serde_json::from_str(&resp)?;
let round2_response: Round2Response = response.result()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue here, I think we should retry instead of returning an error


// return the signed transaction
return Ok(BobResponse {
unlocked_tx: transaction,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we could return the witness instead and have bob include it in the signature by themselves (it would be a smaller response)

@@ -36,12 +40,235 @@ use super::node::{Round2Request, Round2Response};
// Orchestration logic
//

type RpcOrchestratorContext = Arc<(Orchestrator, Arc<RwLock<MemberStatusState>>)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe cleaner if Orchestrator has a field with this RwLock inside

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to do that but had some issues with jsonrpsee's context automatically wrapping the struct passed with an Arc. I tried to avoid double Arcs but it wasn't really possible since I already needed an Arc of the RwLock before passing it to the RpcModule, and the RpcModule has not function to fetch a clone of the Arc. But hey, we have double Arc anyway so we can definitely do what you suggested.

// to wait for a read lock every time an rpc handler needs to access the config.
// This way, handlers will only wait for read locks when they need the status of the members.
pub struct MemberStatusState {
pub key_to_addr: HashMap<frost_secp256k1_tr::Identifier, String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that now we have this information in two different places, it already exists in Orchestrator.committee_cfg.members

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Probably best if we moved the status logic to the Orchestrator struct so it would always have access to that map, but figured this would be better for separating this auxiliary functionality. The map is not going to grow large anyway.

.key_to_addr
.iter()
.map(|(key, addr)| {
let status = r_lock.status.get(key).unwrap().clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe cleaner:

let matcher = |s| matches!(s, MemberStatus::Online) | matches!(MemberStatus:Disconnected((r, _)) if r < unixtime());
r_lock.status.value().filter(matcher).collect_vec()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can filter first using the cleaner matches! macro, but we would need the map for the next part of the code

let member_status = state_w.status.get_mut(key).unwrap();
let last_retries = match old_status {
MemberStatus::Disconnected((_, last_retry_number)) => *last_retry_number,
_ => 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that it is possible that the node was marked as offline in the mean time (due to a failure to respond in round 2 of a signature), so here we might want to have

match old_status {
                        MemberStatus::Disconnected((_, last_retry_number)) => *last_retry_number,
                        MemberStatus::Online => 0,
                        MemberStatus::Offline => continue,
}

@mimoo mimoo merged commit 19405dc into sigma0-dev:main Jan 27, 2024
RedaOps added a commit to RedaOps/zkbitcoin that referenced this pull request Jan 27, 2024
RedaOps added a commit to RedaOps/zkbitcoin that referenced this pull request Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

randomize and "concurrencize" the flow ping & status
2 participants