[Postmortem] OpenFoodFacts.net down (#1)

Date: 03/10/2021

Authors: ocervello,

Status: Complete, action items in progress

Summary: openfoodfacts.net down after Docker storage driver configuration change. Proxmox containers off-net, robotoff-net, robotoff-dev, mongo-dev, and monitoring are unreachable.

Impact: Integration tests failing on openfoodfacts-dart (example), pre-prod environment (openfoodfacts.net) down.

Root Causes: Cascading failure probably due to Docker storage driver configuration change from vfs to fuse-overlayfs, and probably an incompatibility between LXC, Docker and fuse-overlayfs, causing containers to crash, and unable to SSH. Exact root cause is still unknown, as some containers using fuse-overlayfs have not crashed.

Trigger: Unknown. First outage happened 3 days after the Docker storage driver change.

Resolution

Short term: revert Docker storage driver configuration from fuse-overlayfs to vfs.
Long term: run Docker containers in a QEMU host instead.

Detection: message on Slack #infrastructure channel + openfoodfacts-dart integration tests failing with timeouts.

Action Items

Action Item	Type	Owner	Status
Revert Docker storage driver to `vfs`	mitigate	olivier	DONE
Snapshot off-net CT and start a new CT from the snapshot	mitigate	charles	FAILED
Create vanilla CT with storage driver `vfs` and re-deploy openfoodfacts.net on it + ZFS Mounts + NGINX config change	mitigate	charles,olivier,stephane,christian	IN PROGRESS
Open ticket to Proxmox forums to investigate the crash	process	charles	TODO
Run all crashed Docker containers on QEMU VM for stability + ZFS mounts + NGINX configuration	prevent	olivier,charles,stephane,christian	#62

Lessons Learned

What went well

Community + staff got quickly alerted of openfoodfacts.net being down
Worked together to solve the issues

What went wrong

No explicit alert message was sent to productopener-alerts Slack channel → need integration tests on openfoodfacts-server repository
Too many CTs brought down simultaneously - should have done the storage engine change on only 1 host and wait
Proxmox container cloning failed, increasing the ETTR (Estimated Time To Repair)
Proxmox container failed to reboot, increasing the ETTR
Too much noise on productopener-alerts, failed deployments were missed.
Sysadmins were not aware about all the impacts of off-net downtime.
No single point to track the investigation and resolution (e.g. GitHub issue)

Where we got lucky

Did not bring down production as it is still running on the Free machines.
Automated deployments allowed us to re-deploy openfoodfacts.net pretty fast
The right people were available.

What we learned

Assuming the root cause is correct: Proxmox LXC + Docker + ZFS + fuse-overlayfs storage driver can trigger severe issues where even Proxmox administration tools do not work (clones, snapshots, etc…)

Timeline

29-09-2021 (All times CEST)

15:46 ROOT CAUSE — Docker storage driver switch from vfs to fuse-overlayfs made on all CTs w/ Docker deployments.

03-10-2021 (All times CEST)

02:17 OUTAGE BEGINS — Automated message on #infrastructure-alerts Slack channel about timeouts when trying to access world.openfoodfacts.net
19:16 OUTAGE BEGINS — Manual message by contributor on #infrastructure Slack channel about timeouts when trying to access world.openfoodfacts.net

04-10-2021 (All times CEST)

9:23 Message on #infrastructure Slack channel that multiple containers are unresponsive.

06-10-2021 (All times CEST)

14:36 OUTAGE MITIGATED, deployed openfoodfacts.net and a new machine. Mounts are still missing on disk.
14:45 Decision taken to switch Docker containers to QEMU VM.
15:36 Creation of QEMU VM 128GB RAM, 8 cores, 196GB drive.

07-10-2021 (All times CEST)

09:00 Starting to manually deploy openfoodfacts-server, robotoff, robotoff-ann and monitoring containers on QEMU VM
09:30 Openfoodfacts server is deployed on QEMU VM
10:20 Robotoff deployment is blocked by a CPU flag issue (avx flag needed for Tensorflow library)

Supporting information:

Document a clear realistic “acceptable downtime” for each CT/VM/machines we manage (using the existing spreadsheet).
Document the main owner and his/her co-owner (?) of each machine, ie people able to restore a service within the “acceptable downtime” and owning this responsibility.
Decide how we document the infrastructure (not well decided yet).
Is it possible to publish only real alerts in #infrastructure-alerts? Eg, only publish alerts if the machine is down for more than 15 minutes. Most of the alerts seems to be false positives.
Define a process to resolve future incidents (e.g. should we systematically file a github issue for each incident?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2021-10-03-network-down.md

2021-10-03-network-down.md

[Postmortem] OpenFoodFacts.net down (#1)

Resolution

Action Items

Lessons Learned

What went well

What went wrong

Where we got lucky

What we learned

Timeline

29-09-2021 (All times CEST)

03-10-2021 (All times CEST)

04-10-2021 (All times CEST)

06-10-2021 (All times CEST)

07-10-2021 (All times CEST)

Supporting information:

Files

2021-10-03-network-down.md

Latest commit

History

2021-10-03-network-down.md

File metadata and controls

[Postmortem] OpenFoodFacts.net down (#1)

Resolution

Action Items

Lessons Learned

What went well

What went wrong

Where we got lucky

What we learned

Timeline

29-09-2021 (All times CEST)

03-10-2021 (All times CEST)

04-10-2021 (All times CEST)

06-10-2021 (All times CEST)

07-10-2021 (All times CEST)

Supporting information: