Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we occasionally test deployed instances and report errors? #52

Open
2byrds opened this issue Aug 19, 2024 · 3 comments
Open

How do we occasionally test deployed instances and report errors? #52

2byrds opened this issue Aug 19, 2024 · 3 comments
Assignees

Comments

@2byrds
Copy link
Collaborator

2byrds commented Aug 19, 2024

For instance our witnesses, api, and verifier are deployed to dev an test. But do we test it daily/automatically to determine if they are healthy?

@ronakseth96
Copy link
Collaborator

We have implemented most of these things and are in the final step of setting up email alerts.

Service Health Checks:
Most of these services are currently set up with health checks that monitor their operational status. These health checks are configured, which examine the services at 5-second intervals to verify they are functioning as expected. In case the service becomes unhealthy, the copilot will trigger an automatic restart to minimize downtime and restore service.

Autoscaling setup:
The test witness service is now configured with autoscaling, allowing it to dynamically scale between a set range of tasks, currently set to 1 and 2. The triggers are presently set up based on CPU & memory usage, with certain thresholds, so the service scales up automatically during increased load and scales down when the load decreases.

CloudWatch Monitoring/Alarms:
Besides health checks and autoscaling, we are utilizing AWS CloudWatch to monitor key performance metrics such as CPU and memory usage. A CloudWatch dashboard has been set up for the test witness service, and alarms are configured to trigger when certain thresholds are crossed, and which will help us manage performance.


Automated Alerts:
The final thing is setting up automated alerts that will notify us via email when an alarm is activated. And would allow us to identify and address any potential service disruptions or performance issues.

@2byrds
Copy link
Collaborator Author

2byrds commented Aug 27, 2024

@ronakseth96 thank you for the synopsis! Can you create the necessary follow-on issues and make sure they are in the reg-pilot project.

@ronakseth96
Copy link
Collaborator

updates with reference to the service autoscaling, monitoring, and alerts:

  1. Autoscaling setup:
Based on the recent evaluations, the autoscaling configuration has also been implemented for the verification and api services in the dev domain. This setup enables dynamic scaling between 1 and 2 tasks and is triggered by predefined CPU and memory usage thresholds. Following a thorough review with no issues, the same setup was also extended to the test domain.

  2. CloudWatch monitoring/alarms:
A dedicated CloudWatch dashboard named reg-pilot has been established for both services. This dashboard provides in-depth metrics on memory usage, CPU utilization, and filesystem storage. Here, continuous monitoring will enhance our ability to fine-tune resource capacity planning and optimize performance.


  3. Automated alerts setup:
manual alerts have been temporarily configured for witness service while automated email alerts are in progress. These alerts will notify the via email of any performance issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants