Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goal: Validator Monitoring #443

Open
4 tasks
outerlook opened this issue Aug 6, 2024 · 17 comments
Open
4 tasks

Goal: Validator Monitoring #443

outerlook opened this issue Aug 6, 2024 · 17 comments

Comments

@outerlook
Copy link
Contributor

outerlook commented Aug 6, 2024

Objective

Assess the participation of validators (node operators) in the network during normal operations.

Description

For example, a visual representation of this could be a graph showing how many blocks a day a certain operator was also supporting the network.

The mechanism includes a node's signature that gets validated, and then the indexer exposes its public key on block information after the consensus.

Probably, the internal cometBFT endpoint already supports that: https://github.com/cometbft/cometbft/blob/v0.38.x/spec/core/data_structures.md

Then, what is the easiest path to expose this data from the kwil indexer? Should we expose a node cometBFT API endpoint?

To Do

Define what is the scope of this goal:

  • just providing API to get the data
  • create the metrics consumption pipeline
  • visualizing data

Problems

Blocked By

Instrumentation

Validator monitoring

@outerlook outerlook changed the title Validator Monitoring Goal: Validator Monitoring Aug 6, 2024
@outerlook
Copy link
Contributor Author

Hey, @zolotokrylin, can you also verify and include this goal with the correct priority on the roadmap?

Please share if you think we need more information to evaluate the business value for it correctly 🙏

@rsoury
Copy link

rsoury commented Aug 7, 2024

@brennanjl - Can you confirm that CometBFT block data is available via the Indexer?

@brennanjl
Copy link
Collaborator

For example, a visual representation of this could be a graph showing how many blocks a day a certain operator was also supporting the network.

The mechanism includes a node's signature that gets validated, and then the indexer exposes its public key on block information after the consensus.

I'm actually not sure what this means. It is mostly presumed that a node operator is supporting the network 100% of the time; if at any point in time <=2/3rds of the validating power is not running, the network will halt. Are you simply looking to track for how long a certain validator has been a validator?

Can you confirm that CometBFT block data is available via the Indexer?

The full block data is not (this can be read from a node directly), but indexed block metadata such as proposer can be queried from the indexer.

@outerlook
Copy link
Contributor Author

Are you simply looking to track for how long a certain validator has been a validator?

If I'm correctly aligned, the scenario is that there will soon be 12 node operators running TSN. Even if they are all registered as validators, but 2 of them (less than 1/3rds) remain disconnected for days in a month or are inconsistent, we should have easy ways to track it

@rsoury
Copy link

rsoury commented Aug 8, 2024

@brennanjl -

I'm actually not sure what this means

@outerlook - basically clarified.

We want to index the CometBFT blocks to determine which Validators are participating on each block.
"<=2/3rds of the validating power" is only considered when the public key is marked as a Validator.
We want to determine how many public keys and signatures appear for each block from CometBFT to determine whether they match our count of partner Node Operators.

this can be read from a node directly

In this case, we could create (or use an existing) CometBFT indexer - correct?

@rsoury rsoury mentioned this issue Aug 14, 2024
27 tasks
@rsoury
Copy link

rsoury commented Aug 15, 2024

Confirmed by @brennanjl

The indexer does not support this.
We can make it index this quite easily.

How high priority is this? - @zolotokrylin to determine priority of this. Especially as related to #415.

@zolotokrylin
Copy link
Contributor

@rsoury, if you are not busy with anything else, please define the Spec document for this Goal.
While Raffael and Mic are working the other goals, this goal can be speced.

@rsoury
Copy link

rsoury commented Aug 16, 2024

@zolotokrylin - The Spec for this has been merged with #415, and is already established.

The idea of Observability whether for internal and external analysis is essentially a single Goal.
This specific issue covers a single data source for this overarching Reliability goal.

@zolotokrylin
Copy link
Contributor

@rsoury, could you please remove (or merge into the Specs doc if still relevant) everything from the description of this task and attach the relevant Spec file here?
Is there a clear separation between this and that goal in the spec doc?

@rsoury
Copy link

rsoury commented Aug 17, 2024

Yes, it's referenced under the Validator Monitoring: https://docs.google.com/document/d/1-yxCyunqLhIHqLGJrIqScqRduo_lB3Ee6LGyeyY4B3A/edit#heading=h.bjrhx35jayz0

It's distinguished quite clearly. Where blockchain consensus data is a source for observability and reliability, it covers this issue associated to Validator Monitoring.

@markholdex
Copy link
Collaborator

markholdex commented Sep 17, 2024

@outerlook is this goal a duplicate of

@zolotokrylin
Copy link
Contributor

@markholdex no. This Goal is about understanding validators' performance.

@markholdex
Copy link
Collaborator

@markholdex no. This Goal is about understanding validators' performance.

@zolotokrylin but in the Reliability, there are problems and specs around the performance and penalty for validators that perform poorly. So It's confusing me or maybe I don't get something.

@outerlook
Copy link
Contributor Author

outerlook commented Sep 19, 2024

@markholdex, along the way, per #443 (comment) I see it was merged in the process. I previously saw #415 as an individual level reliability issue (are our nodes operating well? are they contributing to the network?) and this goal as a network level monitoring (which nodes aren't contributing?)

they are very related and are overlapping in some ways. We can:

  • split Goal: 99.9% TSN Reliability #415 validator monitoring to here, but that would be against the initial decision of merging (which I believe has a reason)
  • close this goal as duplicate, merging any remaining aspect to the other
  • maintain this goal to a network level monitoring, keeping Goal: 99.9% TSN Reliability #415 focused on our nodes data, and simpler

@zolotokrylin
Copy link
Contributor

@markholdex, feel free to optimise the naming if you need it.

@markholdex
Copy link
Collaborator

@outerlook I believe that:

  • within Goal: 99.9% TSN Reliability #415 you will gather information on which nodes are not contributing and about their operation. Right?
  • Then we can only keep this goal if there is anything extra we would like to do for the analysis of validator performance. I don't see it for now and considering to close this Goal as duplicate.

@outerlook
Copy link
Contributor Author

outerlook commented Sep 30, 2024

@markholdex

you will gather information on which nodes are not contributing

Partially. I initially thought it's a simpler step for #415 goal to emit their own data about contribution (already available)

That goal, to be simpler, would answer: "how is my node contributing to the network?" and have alarms for it as its our own responsibility to maintain it, and know when we messed up. It seems to be simpler (almost free) than the next issue:

"How are all nodes contributing to the network?" is what I thought about this (#443) goal. It has a little more setup because it needs an indexer-like behavior collecting blocks, getting the node list that was supposed to contribute, and emitting a metric for each if it contributed.

Maybe this will get easier or less relevant after #415

But again, this is what I understood, and it made sense. However, I'm ok with evaluating the need again after #415 -- if it's really easy to assess other nodes contribution within those tasks, I sure will to avoid more effort here

another point of view: would #443 already cover what we ask about validation on #415? yes, but #443 is harder, and #415 just needs what is already done on cometbft available metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants