Make TSS startup checks more forgiving #934

HCastano · 2024-07-12T18:52:18Z

Right now the TSS panics and crashes if it's not able to establish a connection with a
Substrate node or if the TSS account doesn't meet a minimum balance.

This PR makes it so that the TSS server retries establishing a connection for ~15 mins
before calling it quits. It also doesn't crash if the TSS account doesn't have a minimum
balance, instead it prints a warning and continues.

JesseAbram · 2024-07-12T20:38:38Z

crates/threshold-signature-server/src/helpers/launch.rs

+                            &account_id
+                        )
+                    } else {
+                        tracing::warn!(


I really think we should crash the node here, it is to me the same theory as a compile error vs a runtime error, what this does is allow validators to set up a node that will fail eventually

Disagree. An operator could see the warning on startup and do something about it later on. I could be swayed here though.

Maybe we defer the decision to @vitropy? td;dr for you Vi, do we allow a TSS to start up without a funded TSS account?

I guess this would depend on the failure mode: if the TSS can not do anything meaningful without a funded account, it might be wise to emit an error and bail. On the other hand, this makes the problem an operator's problem, and if I'm any blueprint, operators are probably not going to understand how to fix such an error. If the TSS account can be funded after start up, then I'd rather log the error and continue to start up, because at least that way the Docker container stays running and the TSS can communicate with the network, even if it can't initiate transactions on its own. I imagine when it gets funded later, the TSS will simply come alive healthily, without requiring the operator to re-start the container.

Thanks for the feedback @vitropy! What would happen if the TSS started without any funds is that if it tried to participate in operations (e.g registration or signing) it would error out with something like: "Insufficient funds for operation".

However, as soon as it was funded it would operate normally.

The point about not having to restart Docker containers is a good one, and to me that makes a clear case for not crashing the process straight away.

@JesseAbram do you have a strong opinion against this? If not I say we merge this as is and we can revisit later if needed

I do, as we personally have made this exact mistake before and it was pretty hard to diagnose. Instead of letting an operator run a node that would fail during a critical process (registering) we should force them to restart the docker container

if they can't figure out how to fund an account they should not be a validator hard stop

if they can't figure out how to fund an account they should not be a validator hard stop

Would a TSS account that had no funds interfere with or cause issues for other validators on the network or otherwise be able to function as a validator in the first place? If not, what does it matter to you if their container is running or not?

Also,

we personally have made this exact mistake before and it was pretty hard to diagnose.

Why was it hard to diagnose?

if they can't figure out how to fund an account they should not be a validator hard stop

Would a TSS account that had no funds interfere with or cause issues for other validators on the network or otherwise be able to function as a validator in the first place? If not, what does it matter to you if their container is running or not?

yes in the case where they are in the signing group they would not be able to signal that they have completed the dkg to the chain and cause the whole registering process to fail (affects user and other validators)

Also,

we personally have made this exact mistake before and it was pretty hard to diagnose.

Why was it hard to diagnose?

This was before we set up loki so I had to read all the individual logs.......however in the case of it not being our validator and some random other persons it would be extremely challenging without having their logs

in the case where they are in the signing group they would not be able to signal that they have completed the dkg to the chain and cause the whole registering process to fail (affects user and other validators)

Perhaps my lack of domain expertise is showing but:

as I understand it, signing groups are going away in favor of the new tofino spec's "signing committee" subset from the available TSS pool, no?

if a TSS server is running with an unfunded account, shouldn't that inherently make it ineligible for participating in the signing committee, a decision that should happen at the core protocol level during actual runtime, and not rely on the presupposition that any running TSS container was correctly launched with a funded key in the first place?

in the case of it not being our validator and some random other persons it would be extremely challenging without having their logs

This strikes me as a problem for me to have and for you to not worry about; I do appreciate that you are trying to prevent an ops issue for less experienced/careful system administrators by "failing during the equivalent of compile time" in a way but I don't think this is going to play out the way you might be thinking. Besides, if a sysop who isn't on my team is running a node and can't access their own logs then I agree there's a more fundamental skills issue at play and I genuinely just would encourage you as core devs not to worry about that edge case. 🙏🏻

My 2 cents to close this out:

With the logging we have it should not be hard for an operator to find out that their
node does not have sufficient funds at process startup and address this without needing
to muck around with their infrastructure again.

Having Docker containers crash to me seems like the wrong way to tell them of this, and I
don't want to have to couple process startup with having a funded account.

The implications of running a node with a non-funded account are not problematic on a full
network. It's been problematic for us during genesis specifically because there's been no
redundancy, i.e we've needed every machine we've spun up to work. On a live network this
isn't the case, some machines can be down and the network can continue to operate.

If we find that operations are unable to figure this out on their own or if the
implications of machines running without funds increases we can revisit this.

HCastano requested a review from JesseAbram July 12, 2024 18:52

JesseAbram reviewed Jul 12, 2024

View reviewed changes

HCastano requested a review from JesseAbram July 15, 2024 15:02

Base automatically changed from Migrate-to-Threshold to master July 16, 2024 16:03

HCastano added 11 commits July 16, 2024 12:19

Add exponential backoff when connecting to Substrate node

36d811a

Split out code for getting threshold account ID from mnemonic check

0e36283

Make TSS balance check non-fatal

d1211b2

Move code for checking node connection and TSS balance to a helper

cf713a2

Move account check out of setup_mnemonic

3f0089b

Get rid of import warnings

395b731

Update lockfile

6693a2f

TaploFmt

74c9c30

Use threshold_account_id() in tests

9f49bea

Use correct account in test

55d57e9

Remove dbg! statement

e64fa4d

HCastano force-pushed the hc/exponential-backoff-at-startup branch from 47d43a8 to e64fa4d Compare July 16, 2024 16:23

JesseAbram approved these changes Jul 16, 2024

View reviewed changes

HCastano merged commit 15fbec8 into master Jul 16, 2024
13 checks passed

HCastano deleted the hc/exponential-backoff-at-startup branch July 16, 2024 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make TSS startup checks more forgiving #934

Make TSS startup checks more forgiving #934

HCastano commented Jul 12, 2024

JesseAbram Jul 12, 2024

HCastano Jul 12, 2024

vitropy Jul 12, 2024

HCastano Jul 15, 2024

JesseAbram Jul 15, 2024

JesseAbram Jul 15, 2024

vitropy Jul 15, 2024

JesseAbram Jul 15, 2024

vitropy Jul 15, 2024 •

edited

Loading

HCastano Jul 16, 2024

Make TSS startup checks more forgiving #934

Make TSS startup checks more forgiving #934

Conversation

HCastano commented Jul 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitropy Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitropy Jul 15, 2024 •

edited

Loading