-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make TSS startup checks more forgiving #934
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
36d811a
Add exponential backoff when connecting to Substrate node
HCastano 0e36283
Split out code for getting threshold account ID from mnemonic check
HCastano d1211b2
Make TSS balance check non-fatal
HCastano cf713a2
Move code for checking node connection and TSS balance to a helper
HCastano 3f0089b
Move account check out of `setup_mnemonic`
HCastano 395b731
Get rid of import warnings
HCastano 6693a2f
Update lockfile
HCastano 74c9c30
TaploFmt
HCastano 9f49bea
Use `threshold_account_id()` in tests
HCastano 55d57e9
Use correct account in test
HCastano e64fa4d
Remove `dbg!` statement
HCastano File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should crash the node here, it is to me the same theory as a compile error vs a runtime error, what this does is allow validators to set up a node that will fail eventually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree. An operator could see the warning on startup and do something about it later on. I could be swayed here though.
Maybe we defer the decision to @vitropy? td;dr for you Vi, do we allow a TSS to start up without a funded TSS account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this would depend on the failure mode: if the TSS can not do anything meaningful without a funded account, it might be wise to emit an error and bail. On the other hand, this makes the problem an operator's problem, and if I'm any blueprint, operators are probably not going to understand how to fix such an error. If the TSS account can be funded after start up, then I'd rather log the error and continue to start up, because at least that way the Docker container stays running and the TSS can communicate with the network, even if it can't initiate transactions on its own. I imagine when it gets funded later, the TSS will simply come alive healthily, without requiring the operator to re-start the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @vitropy! What would happen if the TSS started without any funds is that if it tried to participate in operations (e.g registration or signing) it would error out with something like: "Insufficient funds for operation".
However, as soon as it was funded it would operate normally.
The point about not having to restart Docker containers is a good one, and to me that makes a clear case for not crashing the process straight away.
@JesseAbram do you have a strong opinion against this? If not I say we merge this as is and we can revisit later if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do, as we personally have made this exact mistake before and it was pretty hard to diagnose. Instead of letting an operator run a node that would fail during a critical process (registering) we should force them to restart the docker container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they can't figure out how to fund an account they should not be a validator hard stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a TSS account that had no funds interfere with or cause issues for other validators on the network or otherwise be able to function as a validator in the first place? If not, what does it matter to you if their container is running or not?
Also,
Why was it hard to diagnose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes in the case where they are in the signing group they would not be able to signal that they have completed the dkg to the chain and cause the whole registering process to fail (affects user and other validators)
This was before we set up loki so I had to read all the individual logs.......however in the case of it not being our validator and some random other persons it would be extremely challenging without having their logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps my lack of domain expertise is showing but:
This strikes me as a problem for me to have and for you to not worry about; I do appreciate that you are trying to prevent an ops issue for less experienced/careful system administrators by "failing during the equivalent of compile time" in a way but I don't think this is going to play out the way you might be thinking. Besides, if a sysop who isn't on my team is running a node and can't access their own logs then I agree there's a more fundamental skills issue at play and I genuinely just would encourage you as core devs not to worry about that edge case. 🙏🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents to close this out:
With the logging we have it should not be hard for an operator to find out that their
node does not have sufficient funds at process startup and address this without needing
to muck around with their infrastructure again.
Having Docker containers crash to me seems like the wrong way to tell them of this, and I
don't want to have to couple process startup with having a funded account.
The implications of running a node with a non-funded account are not problematic on a full
network. It's been problematic for us during genesis specifically because there's been no
redundancy, i.e we've needed every machine we've spun up to work. On a live network this
isn't the case, some machines can be down and the network can continue to operate.
If we find that operations are unable to figure this out on their own or if the
implications of machines running without funds increases we can revisit this.