How do we occasionally test deployed instances and report errors? #52

2byrds · 2024-08-19T20:53:23Z

For instance our witnesses, api, and verifier are deployed to dev an test. But do we test it daily/automatically to determine if they are healthy?

ronakseth96 · 2024-08-26T21:03:41Z

We have implemented most of these things and are in the final step of setting up email alerts.

Service Health Checks: Most of these services are currently set up with health checks that monitor their operational status. These health checks are configured, which examine the services at 5-second intervals to verify they are functioning as expected. In case the service becomes unhealthy, the copilot will trigger an automatic restart to minimize downtime and restore service.

Autoscaling setup: The test witness service is now configured with autoscaling, allowing it to dynamically scale between a set range of tasks, currently set to 1 and 2. The triggers are presently set up based on CPU & memory usage, with certain thresholds, so the service scales up automatically during increased load and scales down when the load decreases.

CloudWatch Monitoring/Alarms: Besides health checks and autoscaling, we are utilizing AWS CloudWatch to monitor key performance metrics such as CPU and memory usage. A CloudWatch dashboard has been set up for the test witness service, and alarms are configured to trigger when certain thresholds are crossed, and which will help us manage performance. 

Automated Alerts: The final thing is setting up automated alerts that will notify us via email when an alarm is activated. And would allow us to identify and address any potential service disruptions or performance issues.

2byrds · 2024-08-27T12:27:47Z

@ronakseth96 thank you for the synopsis! Can you create the necessary follow-on issues and make sure they are in the reg-pilot project.

ronakseth96 · 2024-09-19T12:49:35Z

updates with reference to the service autoscaling, monitoring, and alerts:

Autoscaling setup: Based on the recent evaluations, the autoscaling configuration has also been implemented for the verification and api services in the dev domain. This setup enables dynamic scaling between 1 and 2 tasks and is triggered by predefined CPU and memory usage thresholds. Following a thorough review with no issues, the same setup was also extended to the test domain.
CloudWatch monitoring/alarms: A dedicated CloudWatch dashboard named reg-pilot has been established for both services. This dashboard provides in-depth metrics on memory usage, CPU utilization, and filesystem storage. Here, continuous monitoring will enhance our ability to fine-tune resource capacity planning and optimize performance. 
Automated alerts setup: manual alerts have been temporarily configured for witness service while automated email alerts are in progress. These alerts will notify the via email of any performance issues.

2byrds added this to the reg-pilot alpha completion milestone Aug 19, 2024

2byrds assigned ronakseth96 Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we occasionally test deployed instances and report errors? #52

How do we occasionally test deployed instances and report errors? #52

2byrds commented Aug 19, 2024

ronakseth96 commented Aug 26, 2024

2byrds commented Aug 27, 2024

ronakseth96 commented Sep 19, 2024

How do we occasionally test deployed instances and report errors? #52

How do we occasionally test deployed instances and report errors? #52

Comments

2byrds commented Aug 19, 2024

ronakseth96 commented Aug 26, 2024

2byrds commented Aug 27, 2024

ronakseth96 commented Sep 19, 2024