Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TSS startup checks more forgiving #934

Merged
merged 11 commits into from
Jul 16, 2024
15 changes: 15 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions crates/threshold-signature-server/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ reqwest-eventsource="0.6"
serde_derive ="1.0.147"
synedrion ={ git="https://github.com/entropyxyz/synedrion", rev="25373111cbb01e1a25d8a5c5bb8f4652c725b3f1" }
strum ="0.26.2"
backoff ={ version="0.4.0", features=["tokio"] }

# Async
futures="0.3"
Expand Down
88 changes: 75 additions & 13 deletions crates/threshold-signature-server/src/helpers/launch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -241,21 +241,14 @@ pub struct StartupArgs {
pub mnemonic_file: Option<PathBuf>,
}

pub async fn has_mnemonic(kv: &KvManager) -> (bool, String) {
pub async fn has_mnemonic(kv: &KvManager) -> bool {
let exists = kv.kv().exists(FORBIDDEN_KEY_MNEMONIC).await.expect("issue querying DB");
let mut account_id = "".to_string();

if exists {
tracing::debug!("Existing mnemonic found in keystore.");
let mnemonic = kv.kv().get(FORBIDDEN_KEYS[0]).await.expect("Issue getting mnemonic");
let pair = <sr25519::Pair as Pair>::from_phrase(
&String::from_utf8(mnemonic).expect("Issue converting mnemonic to string"),
None,
)
.expect("Issue converting mnemonic to pair");
account_id = AccountId32::new(pair.0.public().into()).to_ss58check();
}

(exists, account_id)
exists
}

pub fn development_mnemonic(validator_name: &Option<ValidatorName>) -> bip39::Mnemonic {
Expand All @@ -275,8 +268,8 @@ pub fn development_mnemonic(validator_name: &Option<ValidatorName>) -> bip39::Mn
.expect("Unable to parse given mnemonic.")
}

pub async fn setup_mnemonic(kv: &KvManager, mnemonic: bip39::Mnemonic) -> String {
if has_mnemonic(kv).await.0 {
pub async fn setup_mnemonic(kv: &KvManager, mnemonic: bip39::Mnemonic) {
if has_mnemonic(kv).await {
tracing::warn!("Deleting account related keys from KVDB.");

kv.kv()
Expand Down Expand Up @@ -341,7 +334,16 @@ pub async fn setup_mnemonic(kv: &KvManager, mnemonic: bip39::Mnemonic) -> String
fs::write(".entropy/account_id", format!("{id}")).expect("Failed to write account_id file");

tracing::debug!("Starting process with account ID: `{id}`");
id.to_ss58check()
}

pub async fn threshold_account_id(kv: &KvManager) -> String {
let mnemonic = kv.kv().get(FORBIDDEN_KEY_MNEMONIC).await.expect("Issue getting mnemonic");
let pair = <sr25519::Pair as Pair>::from_phrase(
&String::from_utf8(mnemonic).expect("Issue converting mnemonic to string"),
None,
)
.expect("Issue converting mnemonic to pair");
AccountId32::new(pair.0.public().into()).to_ss58check()
}

pub async fn setup_latest_block_number(kv: &KvManager) -> Result<(), KvError> {
Expand Down Expand Up @@ -392,3 +394,63 @@ pub async fn setup_only(kv: &KvManager) {

println!("{}", output);
}

pub async fn check_node_prerequisites(url: &str, account_id: &str) {
use crate::chain_api::{get_api, get_rpc};

let connect_to_substrate_node = || async {
tracing::info!("Attempting to establish connection to Substrate node at `{}`", url);

let api = get_api(url).await.map_err(|_| {
Err::<(), String>("Unable to connect to Substrate chain API".to_string())
})?;

let rpc = get_rpc(url)
.await
.map_err(|_| Err("Unable to connect to Substrate chain RPC".to_string()))?;

Ok((api, rpc))
};

// Note: By default this will wait 15 minutes before it stops retry attempts.
let backoff = backoff::ExponentialBackoff::default();
match backoff::future::retry(backoff, connect_to_substrate_node).await {
Ok((api, rpc)) => {
tracing::info!("Sucessfully connected to Substrate node!");

tracing::info!("Checking balance of threshold server AccountId `{}`", &account_id);
let balance_query = crate::validator::api::check_balance_for_fees(
&api,
&rpc,
account_id.to_string(),
entropy_shared::MIN_BALANCE,
)
.await
.map_err(|_| Err::<bool, String>("Failed to get balance of account.".to_string()));

match balance_query {
Ok(has_minimum_balance) => {
if has_minimum_balance {
tracing::info!(
"The account `{}` has enough funds for submitting extrinsics.",
&account_id
)
} else {
tracing::warn!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should crash the node here, it is to me the same theory as a compile error vs a runtime error, what this does is allow validators to set up a node that will fail eventually

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree. An operator could see the warning on startup and do something about it later on. I could be swayed here though.

Maybe we defer the decision to @vitropy? td;dr for you Vi, do we allow a TSS to start up without a funded TSS account?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this would depend on the failure mode: if the TSS can not do anything meaningful without a funded account, it might be wise to emit an error and bail. On the other hand, this makes the problem an operator's problem, and if I'm any blueprint, operators are probably not going to understand how to fix such an error. If the TSS account can be funded after start up, then I'd rather log the error and continue to start up, because at least that way the Docker container stays running and the TSS can communicate with the network, even if it can't initiate transactions on its own. I imagine when it gets funded later, the TSS will simply come alive healthily, without requiring the operator to re-start the container.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @vitropy! What would happen if the TSS started without any funds is that if it tried to participate in operations (e.g registration or signing) it would error out with something like: "Insufficient funds for operation".

However, as soon as it was funded it would operate normally.

The point about not having to restart Docker containers is a good one, and to me that makes a clear case for not crashing the process straight away.

@JesseAbram do you have a strong opinion against this? If not I say we merge this as is and we can revisit later if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, as we personally have made this exact mistake before and it was pretty hard to diagnose. Instead of letting an operator run a node that would fail during a critical process (registering) we should force them to restart the docker container

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they can't figure out how to fund an account they should not be a validator hard stop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they can't figure out how to fund an account they should not be a validator hard stop

Would a TSS account that had no funds interfere with or cause issues for other validators on the network or otherwise be able to function as a validator in the first place? If not, what does it matter to you if their container is running or not?

Also,

we personally have made this exact mistake before and it was pretty hard to diagnose.

Why was it hard to diagnose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they can't figure out how to fund an account they should not be a validator hard stop

Would a TSS account that had no funds interfere with or cause issues for other validators on the network or otherwise be able to function as a validator in the first place? If not, what does it matter to you if their container is running or not?

yes in the case where they are in the signing group they would not be able to signal that they have completed the dkg to the chain and cause the whole registering process to fail (affects user and other validators)

Also,

we personally have made this exact mistake before and it was pretty hard to diagnose.

Why was it hard to diagnose?

This was before we set up loki so I had to read all the individual logs.......however in the case of it not being our validator and some random other persons it would be extremely challenging without having their logs

Copy link
Contributor

@vitropy vitropy Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case where they are in the signing group they would not be able to signal that they have completed the dkg to the chain and cause the whole registering process to fail (affects user and other validators)

Perhaps my lack of domain expertise is showing but:

  • as I understand it, signing groups are going away in favor of the new tofino spec's "signing committee" subset from the available TSS pool, no?
  • if a TSS server is running with an unfunded account, shouldn't that inherently make it ineligible for participating in the signing committee, a decision that should happen at the core protocol level during actual runtime, and not rely on the presupposition that any running TSS container was correctly launched with a funded key in the first place?

in the case of it not being our validator and some random other persons it would be extremely challenging without having their logs

This strikes me as a problem for me to have and for you to not worry about; I do appreciate that you are trying to prevent an ops issue for less experienced/careful system administrators by "failing during the equivalent of compile time" in a way but I don't think this is going to play out the way you might be thinking. Besides, if a sysop who isn't on my team is running a node and can't access their own logs then I agree there's a more fundamental skills issue at play and I genuinely just would encourage you as core devs not to worry about that edge case. 🙏🏻

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents to close this out:

With the logging we have it should not be hard for an operator to find out that their
node does not have sufficient funds at process startup and address this without needing
to muck around with their infrastructure again.

Having Docker containers crash to me seems like the wrong way to tell them of this, and I
don't want to have to couple process startup with having a funded account.

The implications of running a node with a non-funded account are not problematic on a full
network. It's been problematic for us during genesis specifically because there's been no
redundancy, i.e we've needed every machine we've spun up to work. On a live network this
isn't the case, some machines can be down and the network can continue to operate.

If we find that operations are unable to figure this out on their own or if the
implications of machines running without funds increases we can revisit this.

"The account `{}` does not meet the minimum balance of `{}`",
&account_id,
entropy_shared::MIN_BALANCE,
)
}
},
Err(_) => {
tracing::warn!("Unable to query the account balance of `{}`", &account_id)
},
}
},
Err(_err) => {
tracing::error!("Unable to establish connection with Substrate node at `{}`", url);
panic!("Unable to establish connection with Substrate node.");
},
}
}
23 changes: 9 additions & 14 deletions crates/threshold-signature-server/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,12 @@ use std::{net::SocketAddr, str::FromStr};

use clap::Parser;

use entropy_shared::MIN_BALANCE;
use entropy_tss::{
app,
chain_api::{get_api, get_rpc},
launch::{
development_mnemonic, load_kv_store, setup_latest_block_number, setup_mnemonic, setup_only,
Configuration, StartupArgs, ValidatorName,
},
validator::api::check_balance_for_fees,
AppState,
};

Expand Down Expand Up @@ -91,17 +88,16 @@ async fn main() {
})
});

let account_id = if let Some(mnemonic) = user_mnemonic {
if let Some(mnemonic) = user_mnemonic {
setup_mnemonic(&kv_store, mnemonic).await
} else if cfg!(test) || validator_name.is_some() {
setup_mnemonic(&kv_store, development_mnemonic(&validator_name)).await
} else {
let (has_mnemonic, account_id) = entropy_tss::launch::has_mnemonic(&kv_store).await;
let has_mnemonic = entropy_tss::launch::has_mnemonic(&kv_store).await;
assert!(
has_mnemonic,
"No mnemonic provided. Please provide one or use a development account."
);
account_id
};

setup_latest_block_number(&kv_store).await.expect("Issue setting up Latest Block Number");
Expand All @@ -112,14 +108,13 @@ async fn main() {
if args.setup_only {
setup_only(&kv_store).await;
} else {
let api = get_api(&app_state.configuration.endpoint).await.expect("Error getting api");
let rpc = get_rpc(&app_state.configuration.endpoint).await.expect("Error getting rpc");
let has_fee_balance = check_balance_for_fees(&api, &rpc, account_id.clone(), MIN_BALANCE)
.await
.expect("Error in check balance");
if !has_fee_balance {
panic!("threshold account needs balance: {:?}", account_id);
}
let account_id = entropy_tss::launch::threshold_account_id(&kv_store).await;
entropy_tss::launch::check_node_prerequisites(
&app_state.configuration.endpoint,
&account_id,
)
.await;

let listener = tokio::net::TcpListener::bind(&addr)
.await
.expect("Unable to bind to given server address.");
Expand Down
16 changes: 12 additions & 4 deletions crates/threshold-signature-server/src/user/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,9 @@ use crate::{
get_signer,
helpers::{
launch::{
development_mnemonic, load_kv_store, setup_mnemonic, Configuration, ValidatorName,
DEFAULT_BOB_MNEMONIC, DEFAULT_CHARLIE_MNEMONIC, DEFAULT_ENDPOINT, DEFAULT_MNEMONIC,
development_mnemonic, load_kv_store, setup_mnemonic, threshold_account_id,
Configuration, ValidatorName, DEFAULT_BOB_MNEMONIC, DEFAULT_CHARLIE_MNEMONIC,
DEFAULT_ENDPOINT, DEFAULT_MNEMONIC,
},
signing::Hasher,
substrate::{query_chain, submit_transaction},
Expand Down Expand Up @@ -143,9 +144,16 @@ async fn test_get_signer_does_not_throw_err() {
initialize_test_logger().await;
clean_tests();

let pair = <sr25519::Pair as Pair>::from_phrase(crate::helpers::launch::DEFAULT_MNEMONIC, None)
.expect("Issue converting mnemonic to pair");
let expected_account_id = AccountId32::new(pair.0.public().into()).to_ss58check();

let kv_store = load_kv_store(&None, None).await;
let account = setup_mnemonic(&kv_store, development_mnemonic(&None)).await;
assert_eq!(account, "5DACCJgQV6sHoYUKfTGEimddFxe16NJXgkzHZ3RC9QCBShMH");
setup_mnemonic(&kv_store, development_mnemonic(&None)).await;
development_mnemonic(&None).to_string();
let account = threshold_account_id(&kv_store).await;

assert_eq!(account, expected_account_id);
get_signer(&kv_store).await.unwrap();
clean_tests();
}
Expand Down