You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On 7/29, the transformer went down for about 1.5 hours in one region and did not produce blocks. At the same time, the tracker API also went down in that region. Additionally, the RPC nodes were struggling when hit externally though the extractor was still producing blocks using an internal load balancer. The RPC nodes ended up coming back before the API did. In that region, a backfill was happening which prior to the incident, had crawled down to an extremely slow pace.
So to recap some of the potential issues
Kafka
Offset was increasing so likely not this. Also API nodes went down which would not have been a kafka issue
External LB
Probably not this as the transformer does not interact with this
Nothing in the logs was indicative of a failure. Actual downtime was about 15 minutes which was about 10 min after the initial alarm to turn over the DNS records to the other region.
Only thing to actually prevent down time from this in the future is if we hook up a health check that cloudflare is reacting on to turn off the zone. Also if two zones are being used, then likely there would have been no downtime though at the time, only one zone was on while the other one was doing backfill syncs.
The text was updated successfully, but these errors were encountered:
On 7/29, the transformer went down for about 1.5 hours in one region and did not produce blocks. At the same time, the tracker API also went down in that region. Additionally, the RPC nodes were struggling when hit externally though the extractor was still producing blocks using an internal load balancer. The RPC nodes ended up coming back before the API did. In that region, a backfill was happening which prior to the incident, had crawled down to an extremely slow pace.
So to recap some of the potential issues
Nothing in the logs was indicative of a failure. Actual downtime was about 15 minutes which was about 10 min after the initial alarm to turn over the DNS records to the other region.
Only thing to actually prevent down time from this in the future is if we hook up a health check that cloudflare is reacting on to turn off the zone. Also if two zones are being used, then likely there would have been no downtime though at the time, only one zone was on while the other one was doing backfill syncs.
The text was updated successfully, but these errors were encountered: