Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outage root cause and remediation plan #36

Open
robcxyz opened this issue Jul 30, 2022 · 0 comments
Open

Outage root cause and remediation plan #36

robcxyz opened this issue Jul 30, 2022 · 0 comments

Comments

@robcxyz
Copy link
Contributor

robcxyz commented Jul 30, 2022

On 7/29, the transformer went down for about 1.5 hours in one region and did not produce blocks. At the same time, the tracker API also went down in that region. Additionally, the RPC nodes were struggling when hit externally though the extractor was still producing blocks using an internal load balancer. The RPC nodes ended up coming back before the API did. In that region, a backfill was happening which prior to the incident, had crawled down to an extremely slow pace.

So to recap some of the potential issues

  • Kafka
    • Offset was increasing so likely not this. Also API nodes went down which would not have been a kafka issue
  • External LB
    • Probably not this as the transformer does not interact with this

Nothing in the logs was indicative of a failure. Actual downtime was about 15 minutes which was about 10 min after the initial alarm to turn over the DNS records to the other region.

Only thing to actually prevent down time from this in the future is if we hook up a health check that cloudflare is reacting on to turn off the zone. Also if two zones are being used, then likely there would have been no downtime though at the time, only one zone was on while the other one was doing backfill syncs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant