-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Goal: 99.9% TSN Reliability #415
Comments
Please consider this initial input a draft. Feel free to discuss, add, or remove items from the list. Q:
@truflation/team-tsn-core |
Let's execute on it right after we finish: |
I do think some kind of |
Quite a few tools (open source and paid) can help with stress testing.
in place, we will work on this (#415) Goal. |
@rsoury, could you please help to prepare and finalize the specs document here? |
How do you feel about making stress tests and benchmarks part of another goal, more specific to these? Take this as a suggestion, as although it's also important for the system's reliability, the architecture and planning of this kind of testing may be complex enough to have its own set of objectives. Then, we keep this one more focused on simple logic + observability. And the other would be about answering: "how much pressure our system supports, and how it behaves" |
I believe as another goal is best suited. The kicker here is that we can create observability and monitoring on TSN - however, it'll begin only covering the Nodes that we operate at Truflation. A separate challenge entirely is requesting this observability metrics via a pipeline from our Node Operator partners. |
@rsoury where are we with this goal? Is the description sufficient or still subject to debate? Should we move it to Google Docs to allow everyone to comment? |
@markholdex - Yes, I believe a Google Doc would be good for this. Our focus is currently on the more priority #393 and #438. We may need to re-adjust the backlog to accommodate for this pipeline - @zolotokrylin Once the doc is ready for me to start drafting on, I'll begin the Spec for this. |
@rsoury you can start the doc :) no problem if you will be the creator of it. |
@markholdex - Spec has been added to this issue: #415 (comment) Please note: I've essentially integrated #443 into this Issue -- such that #443 will become a "Problem: Cannot observe Network Operator participation in Blockchain Consensus Data" |
Hey team, I added some suggestions to be accepted (just so I can know it's acknowledged) and some comments I also created a task list for it, but I can change it as soon as there's some drift with the current state there. See TasksIssuesInstrumentation
Analysis and Visualization
Validator Monitoring
Quality Control
Observability
These ones should be improved after we choose a monitoring service.
|
Please provide commenting access to the document |
@zolotokrylin - Done |
Hey team, we have partially clear tasks here. The things missing and harder to plan are around selecting the metrics and visualization vendor. I suggested DataDog at the specs because it fits the interface and we can probably get 1 free year with it. Maybe going after this trial will be part of the tasks here (if accepted). But may we start the coding parts with the tasks available? I think it will be easier to integrate and select vendors with the real data we can achieve here (vs the planned) -- we already defined that it should interface with Vector / Open Tel at the specs |
@outerlook let me know when you begin working on this goal. Thx! |
Hey @markholdex, I'm finally starting this one! |
@outerlook do you have an ETA in mind for the completion of the Goal? |
ETA is Oct, 10 (next Thursday)
|
I'll put the finalizations on this on hold to help with https://github.com/truflation/tsn-data-provider/issues/255 as it seems more urgent (blocking ingestor team) However, our tasks here are almost as complete as they can be. The only one that is missing is #595. The benefit of solving it is enabling our other internal teams to know if our systems are down (and which ones) without having a seat on Grafana. The other pending issues are more relevant after Kwil's metrics implementation (next version) I also have a support ticket open with grafana cloud to solve an excess metric that happened during exploration, but should be fine too. |
Side note: The benefit of using Grafana Cloud is that nearly everything "managed" by this third-party can be brought and operated in-house. |
@outerlook thanks for letting us know. I believe you can wrap up your remaining work on: Let us know the ETA for its completion. Thx! while @rsoury is still writing specs for the new ingestor: The remaining Kwill dependent task will be isolated in a separate goal. |
Hey @markholdex, it's done now! Here's the public service status: See that I focused only on being up or not, and didn't focus on being pretty; the initial intention is to show it internally in case it's needed. I also locked the time picker to be within a day for simplicity. I'll also update this repo readme to include it. It should be done in 30 min |
@outerlook looks cool. Great effort 💪 Please feel free to update the readme file. I will move the remaining problems into: Also, can we have access to the extensive dashboard that you shared in the screenshot above? |
@outerlook got access to the dashboard now. Please let me know when Readme is updated. |
@markholdex #656 done! |
@outerlook awesome! 👏 🏆 |
Spec
This Goal follows after these:
Issues
Instrumentation
Monitoring Service Selection
Monitoring Service Setup
Data Routing to Monitoring Service
Analysis and Visualization
Quality Control
Observability
Monitoring Configuration (Post Service Selection)
These tasks should be addressed after selecting the monitoring service.
The text was updated successfully, but these errors were encountered: