Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/finalized project monitoring #77

Merged
merged 9 commits into from
May 8, 2024

Conversation

Seth-Schmidt
Copy link

#76

Fixes #

Checklist

  • My branch is up-to-date with upstream/develop branch.
  • Everything works and tested for Python 3.8.0 and above.
  • I ran pre-commit checks against my changes.
  • I've written tests against my changes and all the current present tests are passing.

Current behaviour

Master nodes do not currently have any project state monitoring. Node operators must manually check dashboards or make calls to the protocol state contract in order to check the state of the snapshots.

New expected behaviour

The core-api /health endpoint will periodically make calls to the protocol state contract when called from the docker health check. It compares the last finalized epoch for a random sample of projects against the current epoch id, and it will then report any projects that have been unfinalized for more than 50 epochs to Slack.

Change logs

Added

  • snapshotter_last_finalized_check function to data_utils.py. Used to gather project state from the contract.
  • UnfinalizedProject data model.

Changed

  • Core-api /health endpoint to report any unfinalized projects from the sample retrieved by snapshotter_last_finalized_check.
  • Added UNFINALIZED_PROJECT to the SnapshotterReportState data model.

Deployment Instructions

Will need to provide a Slack reporting url in order to make use of this feature if it was not already provided.

@Seth-Schmidt Seth-Schmidt requested a review from a team April 10, 2024 23:09
@Seth-Schmidt Seth-Schmidt changed the base branch from main to nms_master April 17, 2024 15:05
Copy link
Member

@anomit anomit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks an alright approach to me.

The healthcheck is called at a high frequency and snapshotter_last_finalized_check() takes a random sample of 5 projects and checks their finalized status on the protocol state. Probabilistically, the high frequency checks coupled with random sampling should detect issues practically soon enough.

@SwaroopH SwaroopH requested a review from xadahiya April 19, 2024 05:09
Copy link
Member

@xadahiya xadahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good! Let's just finalize one small thing then I can approve.

snapshotter/core_api.py Show resolved Hide resolved
)

if unfinalized_projects:
for unfinalized_project in unfinalized_projects:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good! My only concern is that if the node is down for some reason, since core_api is independent of that, it will keep blasting a lot of failure notifications to Slack, hitting the rate limit. Let's limit the number of failure notifications we send out per minute.

@anomit, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes memoize this to have a check on the outflow of notifications

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a check to see if an unfinalized project has already been reported to limit the number of notifications in the case where projects cannot finalize. The endpoint will only report a maximum of 5 projects per node every 10 minutes, and if the network goes down, they will only report as many projects that have been assigned to each node.

cc701ff

Copy link
Member

@xadahiya xadahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@xadahiya xadahiya merged commit d301004 into nms_master May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants