Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goal: 99.9% TSN Reliability #415

Closed
27 tasks done
outerlook opened this issue Jul 22, 2024 · 27 comments
Closed
27 tasks done

Goal: 99.9% TSN Reliability #415

outerlook opened this issue Jul 22, 2024 · 27 comments

Comments

@outerlook
Copy link
Contributor

outerlook commented Jul 22, 2024

Spec


This Goal follows after these:

Issues

Instrumentation

Monitoring Service Selection

Monitoring Service Setup

Data Routing to Monitoring Service

Analysis and Visualization

Quality Control

Observability

Monitoring Configuration (Post Service Selection)

These tasks should be addressed after selecting the monitoring service.

@outerlook
Copy link
Contributor Author

Please consider this initial input a draft. Feel free to discuss, add, or remove items from the list.

Q:

  • let's make it only about TSN or also about the truflation data provider? I.e., tests about contract deployments
  • Is it necessary to convert it into a specs document?

@truflation/team-tsn-core

@zolotokrylin
Copy link
Contributor

Let's execute on it right after we finish:

@zolotokrylin zolotokrylin changed the title Goal: Enhance TSN Reliability Goal: 99.9% TSN Reliability Jul 22, 2024
@MicBun
Copy link
Contributor

MicBun commented Jul 22, 2024

I do think some kind of stress test periodically is needed. It might be time-consuming and slow at hand but it helps us discover issues that might not be happening during regular tests.

@zolotokrylin
Copy link
Contributor

zolotokrylin commented Jul 24, 2024

Quite a few tools (open source and paid) can help with stress testing.
Once we have:

in place, we will work on this (#415) Goal.

@zolotokrylin
Copy link
Contributor

@rsoury, could you please help to prepare and finalize the specs document here?

@outerlook
Copy link
Contributor Author

outerlook commented Aug 5, 2024

@zolotokrylin @MicBun @rsoury

I do think some kind of stress test periodically is needed. It might be time-consuming and slow at hand but it helps us discover issues that might not be happening during regular tests.

How do you feel about making stress tests and benchmarks part of another goal, more specific to these? Take this as a suggestion, as although it's also important for the system's reliability, the architecture and planning of this kind of testing may be complex enough to have its own set of objectives.

Then, we keep this one more focused on simple logic + observability. And the other would be about answering: "how much pressure our system supports, and how it behaves"

@rsoury
Copy link

rsoury commented Aug 6, 2024

I believe as another goal is best suited.

The kicker here is that we can create observability and monitoring on TSN - however, it'll begin only covering the Nodes that we operate at Truflation.

A separate challenge entirely is requesting this observability metrics via a pipeline from our Node Operator partners.

@markholdex
Copy link
Collaborator

@rsoury where are we with this goal? Is the description sufficient or still subject to debate? Should we move it to Google Docs to allow everyone to comment?

@rsoury
Copy link

rsoury commented Aug 14, 2024

@markholdex - Yes, I believe a Google Doc would be good for this.

Our focus is currently on the more priority #393 and #438.
I believe that #443 will come before this too.

We may need to re-adjust the backlog to accommodate for this pipeline - @zolotokrylin

Once the doc is ready for me to start drafting on, I'll begin the Spec for this.

@zolotokrylin
Copy link
Contributor

@rsoury you can start the doc :) no problem if you will be the creator of it.

@rsoury
Copy link

rsoury commented Aug 15, 2024

@markholdex - Spec has been added to this issue: #415 (comment)
It's important we establish what level of coverage we want to attain first before we attempt to stipulate how to integrate the observability framework.
In essence, we'll be replicating the efforts of Kwil, as per #383, but within the TSN node directly. We can also integrate APM related technology to our servers/containers operating the TSN Nodes too.
@outerlook - please complete the coverage section of the spec. We should then have a follow up call regarding the APMs to use, etc.

Please note: I've essentially integrated #443 into this Issue -- such that #443 will become a "Problem: Cannot observe Network Operator participation in Blockchain Consensus Data"

@outerlook
Copy link
Contributor Author

outerlook commented Sep 6, 2024

Hey team, I added some suggestions to be accepted (just so I can know it's acknowledged) and some comments

I also created a task list for it, but I can change it as soon as there's some drift with the current state there.

See Tasks

Issues

Instrumentation

  • Problem: Host metrics collection not implemented
  • Problem: Host logs collection not set up
  • Problem: Application metrics not instrumented with OpenTelemetry
  • Problem: Application logs not collected from Docker containers
  • Problem: Validator network statistics not collected from CometBFT Prometheus metrics

Analysis and Visualization

  • Problem: Monitoring service account not set up

Validator Monitoring

  • Problem: Validator participation tracking not implemented

Quality Control

  • Problem: Automated tests for system contract not implemented
  • Problem: Automated tests for primitive and composed shared logic not created
  • Problem: Automated tests for primitive stream contract not developed
  • Problem: Automated tests for composed stream contract not implemented

Observability

  • Problem: Deployment success signals not added to TSN node startup scripts
  • Problem: Deployment success signals not added to Indexer startup script
  • Problem: Deployment success signals not added to KGW startup script
  • Problem: Monitoring service for metric collection and analysis not chosen
  • Problem: TSN Node, KGW, and Indexer Vectors' destination not configured

These ones should be improved after we choose a monitoring service.

  • Problem: Status page for TSN Nodes, KGW, and Indexer not created
  • Problem: Alarms for instance resources not configured
  • Problem: Alarms for status page downtimes not set up
  • Problem: Error log alarms not implemented

@zolotokrylin
Copy link
Contributor

Please provide commenting access to the document

@rsoury
Copy link

rsoury commented Sep 9, 2024

@zolotokrylin - Done

@outerlook
Copy link
Contributor Author

outerlook commented Sep 11, 2024

Hey team, we have partially clear tasks here. The things missing and harder to plan are around selecting the metrics and visualization vendor. I suggested DataDog at the specs because it fits the interface and we can probably get 1 free year with it. Maybe going after this trial will be part of the tasks here (if accepted).

But may we start the coding parts with the tasks available?

I think it will be easier to integrate and select vendors with the real data we can achieve here (vs the planned) -- we already defined that it should interface with Vector / Open Tel at the specs

@markholdex
Copy link
Collaborator

@outerlook let me know when you begin working on this goal. Thx!

@outerlook
Copy link
Contributor Author

Hey @markholdex, I'm finally starting this one!

@markholdex
Copy link
Collaborator

@outerlook do you have an ETA in mind for the completion of the Goal?

@outerlook
Copy link
Contributor Author

@markholdex

ETA is Oct, 10 (next Thursday)

  • It's a bit pessimistic. I think I can complete it before. But I'm respecting some unknowns in some tasks
  • CometBFT metrics availability will be released on the next kwil version. Some of these tasks might be delayed to be tackled after that version (not worth solving it if it will already be there). However, if it's easy, I'll create synthetic metrics to solve more important issues, such as consensus health checks.

@outerlook
Copy link
Contributor Author

I'll put the finalizations on this on hold to help with https://github.com/truflation/tsn-data-provider/issues/255 as it seems more urgent (blocking ingestor team)

However, our tasks here are almost as complete as they can be. The only one that is missing is #595. The benefit of solving it is enabling our other internal teams to know if our systems are down (and which ones) without having a seat on Grafana.

The other pending issues are more relevant after Kwil's metrics implementation (next version)

I also have a support ticket open with grafana cloud to solve an excess metric that happened during exploration, but should be fine too.

dashboard
image

alarms
image

that friendly reminder not to leave SSH security so soft
image

@rsoury
Copy link

rsoury commented Oct 8, 2024

Side note: The benefit of using Grafana Cloud is that nearly everything "managed" by this third-party can be brought and operated in-house.

@markholdex
Copy link
Collaborator

markholdex commented Oct 8, 2024

@outerlook thanks for letting us know. I believe you can wrap up your remaining work on:

Let us know the ETA for its completion. Thx!


while @rsoury is still writing specs for the new ingestor:

The remaining Kwill dependent task will be isolated in a separate goal.

@outerlook
Copy link
Contributor Author

outerlook commented Oct 8, 2024

Hey @markholdex, it's done now! Here's the public service status:
https://truflation.grafana.net/public-dashboards/6fe3021962bb4fe1a4aebf5baddecab6

See that I focused only on being up or not, and didn't focus on being pretty; the initial intention is to show it internally in case it's needed.

I also locked the time picker to be within a day for simplicity.

I'll also update this repo readme to include it. It should be done in 30 min

@markholdex
Copy link
Collaborator

markholdex commented Oct 9, 2024

@outerlook looks cool. Great effort 💪 Please feel free to update the readme file. I will move the remaining problems into:

Also, can we have access to the extensive dashboard that you shared in the screenshot above?

@markholdex
Copy link
Collaborator

@outerlook got access to the dashboard now. Please let me know when Readme is updated.

@outerlook
Copy link
Contributor Author

@markholdex #656 done!

@zolotokrylin
Copy link
Contributor

@outerlook awesome! 👏 🏆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants