-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make TSS startup checks more forgiving #934
Conversation
&account_id | ||
) | ||
} else { | ||
tracing::warn!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should crash the node here, it is to me the same theory as a compile error vs a runtime error, what this does is allow validators to set up a node that will fail eventually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree. An operator could see the warning on startup and do something about it later on. I could be swayed here though.
Maybe we defer the decision to @vitropy? td;dr for you Vi, do we allow a TSS to start up without a funded TSS account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this would depend on the failure mode: if the TSS can not do anything meaningful without a funded account, it might be wise to emit an error and bail. On the other hand, this makes the problem an operator's problem, and if I'm any blueprint, operators are probably not going to understand how to fix such an error. If the TSS account can be funded after start up, then I'd rather log the error and continue to start up, because at least that way the Docker container stays running and the TSS can communicate with the network, even if it can't initiate transactions on its own. I imagine when it gets funded later, the TSS will simply come alive healthily, without requiring the operator to re-start the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @vitropy! What would happen if the TSS started without any funds is that if it tried to participate in operations (e.g registration or signing) it would error out with something like: "Insufficient funds for operation".
However, as soon as it was funded it would operate normally.
The point about not having to restart Docker containers is a good one, and to me that makes a clear case for not crashing the process straight away.
@JesseAbram do you have a strong opinion against this? If not I say we merge this as is and we can revisit later if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do, as we personally have made this exact mistake before and it was pretty hard to diagnose. Instead of letting an operator run a node that would fail during a critical process (registering) we should force them to restart the docker container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they can't figure out how to fund an account they should not be a validator hard stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they can't figure out how to fund an account they should not be a validator hard stop
Would a TSS account that had no funds interfere with or cause issues for other validators on the network or otherwise be able to function as a validator in the first place? If not, what does it matter to you if their container is running or not?
Also,
we personally have made this exact mistake before and it was pretty hard to diagnose.
Why was it hard to diagnose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they can't figure out how to fund an account they should not be a validator hard stop
Would a TSS account that had no funds interfere with or cause issues for other validators on the network or otherwise be able to function as a validator in the first place? If not, what does it matter to you if their container is running or not?
yes in the case where they are in the signing group they would not be able to signal that they have completed the dkg to the chain and cause the whole registering process to fail (affects user and other validators)
Also,
we personally have made this exact mistake before and it was pretty hard to diagnose.
Why was it hard to diagnose?
This was before we set up loki so I had to read all the individual logs.......however in the case of it not being our validator and some random other persons it would be extremely challenging without having their logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case where they are in the signing group they would not be able to signal that they have completed the dkg to the chain and cause the whole registering process to fail (affects user and other validators)
Perhaps my lack of domain expertise is showing but:
- as I understand it, signing groups are going away in favor of the new tofino spec's "signing committee" subset from the available TSS pool, no?
- if a TSS server is running with an unfunded account, shouldn't that inherently make it ineligible for participating in the signing committee, a decision that should happen at the core protocol level during actual runtime, and not rely on the presupposition that any running TSS container was correctly launched with a funded key in the first place?
in the case of it not being our validator and some random other persons it would be extremely challenging without having their logs
This strikes me as a problem for me to have and for you to not worry about; I do appreciate that you are trying to prevent an ops issue for less experienced/careful system administrators by "failing during the equivalent of compile time" in a way but I don't think this is going to play out the way you might be thinking. Besides, if a sysop who isn't on my team is running a node and can't access their own logs then I agree there's a more fundamental skills issue at play and I genuinely just would encourage you as core devs not to worry about that edge case. 🙏🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents to close this out:
With the logging we have it should not be hard for an operator to find out that their
node does not have sufficient funds at process startup and address this without needing
to muck around with their infrastructure again.
Having Docker containers crash to me seems like the wrong way to tell them of this, and I
don't want to have to couple process startup with having a funded account.
The implications of running a node with a non-funded account are not problematic on a full
network. It's been problematic for us during genesis specifically because there's been no
redundancy, i.e we've needed every machine we've spun up to work. On a live network this
isn't the case, some machines can be down and the network can continue to operate.
If we find that operations are unable to figure this out on their own or if the
implications of machines running without funds increases we can revisit this.
47d43a8
to
e64fa4d
Compare
Right now the TSS panics and crashes if it's not able to establish a connection with a
Substrate node or if the TSS account doesn't meet a minimum balance.
This PR makes it so that the TSS server retries establishing a connection for ~15 mins
before calling it quits. It also doesn't crash if the TSS account doesn't have a minimum
balance, instead it prints a warning and continues.