Add alerts for current leading indicator of slow ingest #2207

masih · 2023-08-03T09:13:23Z

Add alerts, integrated into Slack and OpsGenie which trigger when the ingest rate slows down and the provider lag grows. We already have an alert for ingest rate stopping for more than an hour which is not catching the gap in ingest issues.

We should look at existing alternative leading indicators to alert on this. Namely:

Probelab providers, which check lookup success for CIDs published within 5 minutes of their publication
Lag value reported for providers at /provider backed. In both recent incidents NFT.Storage lag on /provider backends consistently grew. The lag for this particular provider should typically remain below 20.

gammazero · 2023-11-15T21:14:12Z

Added additional alerts from metrics collected by the telemetry service. Problab data probably does not apply anymore.

Telemetry service can poll the head advertisement from NFT storage, get some multihashes from that, and then lookup those multihashes. An alert can be generated if the multihashes cannot be looked up after some amount of time. Alternatively, the NFT storage provider distance can be tracked, and an alert generated if the distance grows too large.

masih added the P0 label Aug 3, 2023

masih assigned gammazero Aug 3, 2023

masih added P1 and removed P0 labels Aug 3, 2023

gammazero removed the P1 label Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alerts for current leading indicator of slow ingest #2207

Add alerts for current leading indicator of slow ingest #2207

masih commented Aug 3, 2023

gammazero commented Nov 15, 2023

Add alerts for current leading indicator of slow ingest #2207

Add alerts for current leading indicator of slow ingest #2207

Comments

masih commented Aug 3, 2023

gammazero commented Nov 15, 2023