Outage root cause and remediation plan #36

robcxyz · 2022-07-30T10:04:28Z

On 7/29, the transformer went down for about 1.5 hours in one region and did not produce blocks. At the same time, the tracker API also went down in that region. Additionally, the RPC nodes were struggling when hit externally though the extractor was still producing blocks using an internal load balancer. The RPC nodes ended up coming back before the API did. In that region, a backfill was happening which prior to the incident, had crawled down to an extremely slow pace.

So to recap some of the potential issues

Kafka
- Offset was increasing so likely not this. Also API nodes went down which would not have been a kafka issue
External LB
- Probably not this as the transformer does not interact with this

Nothing in the logs was indicative of a failure. Actual downtime was about 15 minutes which was about 10 min after the initial alarm to turn over the DNS records to the other region.

Only thing to actually prevent down time from this in the future is if we hook up a health check that cloudflare is reacting on to turn off the zone. Also if two zones are being used, then likely there would have been no downtime though at the time, only one zone was on while the other one was doing backfill syncs.

robcxyz mentioned this issue Aug 3, 2022

Testing for how the tracker can possibly return faulty data sudoblockio/icon-tracker-frontend#60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outage root cause and remediation plan #36

Outage root cause and remediation plan #36

robcxyz commented Jul 30, 2022 •

edited

Loading

Outage root cause and remediation plan #36

Outage root cause and remediation plan #36

Comments

robcxyz commented Jul 30, 2022 • edited Loading

robcxyz commented Jul 30, 2022 •

edited

Loading