Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TSS startup checks more forgiving #934

Merged
merged 11 commits into from
Jul 16, 2024

Conversation

HCastano
Copy link
Collaborator

Right now the TSS panics and crashes if it's not able to establish a connection with a
Substrate node or if the TSS account doesn't meet a minimum balance.

This PR makes it so that the TSS server retries establishing a connection for ~15 mins
before calling it quits. It also doesn't crash if the TSS account doesn't have a minimum
balance, instead it prints a warning and continues.

@HCastano HCastano requested a review from JesseAbram July 12, 2024 18:52
&account_id
)
} else {
tracing::warn!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should crash the node here, it is to me the same theory as a compile error vs a runtime error, what this does is allow validators to set up a node that will fail eventually

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree. An operator could see the warning on startup and do something about it later on. I could be swayed here though.

Maybe we defer the decision to @vitropy? td;dr for you Vi, do we allow a TSS to start up without a funded TSS account?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this would depend on the failure mode: if the TSS can not do anything meaningful without a funded account, it might be wise to emit an error and bail. On the other hand, this makes the problem an operator's problem, and if I'm any blueprint, operators are probably not going to understand how to fix such an error. If the TSS account can be funded after start up, then I'd rather log the error and continue to start up, because at least that way the Docker container stays running and the TSS can communicate with the network, even if it can't initiate transactions on its own. I imagine when it gets funded later, the TSS will simply come alive healthily, without requiring the operator to re-start the container.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @vitropy! What would happen if the TSS started without any funds is that if it tried to participate in operations (e.g registration or signing) it would error out with something like: "Insufficient funds for operation".

However, as soon as it was funded it would operate normally.

The point about not having to restart Docker containers is a good one, and to me that makes a clear case for not crashing the process straight away.

@JesseAbram do you have a strong opinion against this? If not I say we merge this as is and we can revisit later if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, as we personally have made this exact mistake before and it was pretty hard to diagnose. Instead of letting an operator run a node that would fail during a critical process (registering) we should force them to restart the docker container

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they can't figure out how to fund an account they should not be a validator hard stop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they can't figure out how to fund an account they should not be a validator hard stop

Would a TSS account that had no funds interfere with or cause issues for other validators on the network or otherwise be able to function as a validator in the first place? If not, what does it matter to you if their container is running or not?

Also,

we personally have made this exact mistake before and it was pretty hard to diagnose.

Why was it hard to diagnose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they can't figure out how to fund an account they should not be a validator hard stop

Would a TSS account that had no funds interfere with or cause issues for other validators on the network or otherwise be able to function as a validator in the first place? If not, what does it matter to you if their container is running or not?

yes in the case where they are in the signing group they would not be able to signal that they have completed the dkg to the chain and cause the whole registering process to fail (affects user and other validators)

Also,

we personally have made this exact mistake before and it was pretty hard to diagnose.

Why was it hard to diagnose?

This was before we set up loki so I had to read all the individual logs.......however in the case of it not being our validator and some random other persons it would be extremely challenging without having their logs

Copy link
Contributor

@vitropy vitropy Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case where they are in the signing group they would not be able to signal that they have completed the dkg to the chain and cause the whole registering process to fail (affects user and other validators)

Perhaps my lack of domain expertise is showing but:

  • as I understand it, signing groups are going away in favor of the new tofino spec's "signing committee" subset from the available TSS pool, no?
  • if a TSS server is running with an unfunded account, shouldn't that inherently make it ineligible for participating in the signing committee, a decision that should happen at the core protocol level during actual runtime, and not rely on the presupposition that any running TSS container was correctly launched with a funded key in the first place?

in the case of it not being our validator and some random other persons it would be extremely challenging without having their logs

This strikes me as a problem for me to have and for you to not worry about; I do appreciate that you are trying to prevent an ops issue for less experienced/careful system administrators by "failing during the equivalent of compile time" in a way but I don't think this is going to play out the way you might be thinking. Besides, if a sysop who isn't on my team is running a node and can't access their own logs then I agree there's a more fundamental skills issue at play and I genuinely just would encourage you as core devs not to worry about that edge case. 🙏🏻

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents to close this out:

With the logging we have it should not be hard for an operator to find out that their
node does not have sufficient funds at process startup and address this without needing
to muck around with their infrastructure again.

Having Docker containers crash to me seems like the wrong way to tell them of this, and I
don't want to have to couple process startup with having a funded account.

The implications of running a node with a non-funded account are not problematic on a full
network. It's been problematic for us during genesis specifically because there's been no
redundancy, i.e we've needed every machine we've spun up to work. On a live network this
isn't the case, some machines can be down and the network can continue to operate.

If we find that operations are unable to figure this out on their own or if the
implications of machines running without funds increases we can revisit this.

@HCastano HCastano requested a review from JesseAbram July 15, 2024 15:02
Base automatically changed from Migrate-to-Threshold to master July 16, 2024 16:03
@HCastano HCastano force-pushed the hc/exponential-backoff-at-startup branch from 47d43a8 to e64fa4d Compare July 16, 2024 16:23
@HCastano HCastano merged commit 15fbec8 into master Jul 16, 2024
13 checks passed
@HCastano HCastano deleted the hc/exponential-backoff-at-startup branch July 16, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants