Date: 03/10/2021
Authors: ocervello,
Status: Complete, action items in progress
Summary: openfoodfacts.net down after Docker storage driver configuration change. Proxmox containers off-net
, robotoff-net
, robotoff-dev
, mongo-dev
, and monitoring
are unreachable.
Impact: Integration tests failing on openfoodfacts-dart (example), pre-prod environment (openfoodfacts.net) down.
Root Causes: Cascading failure probably due to Docker storage driver configuration change from vfs
to fuse-overlayfs
, and probably an incompatibility between LXC, Docker and fuse-overlayfs
, causing containers to crash, and unable to SSH. Exact root cause is still unknown, as some containers using fuse-overlayfs
have not crashed.
Trigger: Unknown. First outage happened 3 days after the Docker storage driver change.
- Short term: revert Docker storage driver configuration from
fuse-overlayfs
tovfs
. - Long term: run Docker containers in a QEMU host instead.
Detection: message on Slack #infrastructure channel + openfoodfacts-dart integration tests failing with timeouts.
Action Item | Type | Owner | Status |
---|---|---|---|
Revert Docker storage driver to vfs |
mitigate | olivier | DONE |
Snapshot off-net CT and start a new CT from the snapshot | mitigate | charles | FAILED |
Create vanilla CT with storage driver vfs and re-deploy openfoodfacts.net on it + ZFS Mounts + NGINX config change |
mitigate | charles,olivier,stephane,christian | IN PROGRESS |
Open ticket to Proxmox forums to investigate the crash | process | charles | TODO |
Run all crashed Docker containers on QEMU VM for stability + ZFS mounts + NGINX configuration | prevent | olivier,charles,stephane,christian | #62 |
- Community + staff got quickly alerted of openfoodfacts.net being down
- Worked together to solve the issues
- No explicit alert message was sent to productopener-alerts Slack channel → need integration tests on openfoodfacts-server repository
- Too many CTs brought down simultaneously - should have done the storage engine change on only 1 host and wait
- Proxmox container cloning failed, increasing the ETTR (Estimated Time To Repair)
- Proxmox container failed to reboot, increasing the ETTR
- Too much noise on
productopener-alerts
, failed deployments were missed. - Sysadmins were not aware about all the impacts of off-net downtime.
- No single point to track the investigation and resolution (e.g. GitHub issue)
- Did not bring down production as it is still running on the Free machines.
- Automated deployments allowed us to re-deploy openfoodfacts.net pretty fast
- The right people were available.
- Assuming the root cause is correct: Proxmox LXC + Docker + ZFS + fuse-overlayfs storage driver can trigger severe issues where even Proxmox administration tools do not work (clones, snapshots, etc…)
- 15:46 ROOT CAUSE — Docker storage driver switch from
vfs
tofuse-overlayfs
made on all CTs w/ Docker deployments.
- 02:17 OUTAGE BEGINS — Automated message on #infrastructure-alerts Slack channel about timeouts when trying to access world.openfoodfacts.net
- 19:16 OUTAGE BEGINS — Manual message by contributor on #infrastructure Slack channel about timeouts when trying to access world.openfoodfacts.net
- 9:23 Message on #infrastructure Slack channel that multiple containers are unresponsive.
- 14:36 OUTAGE MITIGATED, deployed openfoodfacts.net and a new machine. Mounts are still missing on disk.
- 14:45 Decision taken to switch Docker containers to QEMU VM.
- 15:36 Creation of QEMU VM 128GB RAM, 8 cores, 196GB drive.
- 09:00 Starting to manually deploy openfoodfacts-server, robotoff, robotoff-ann and monitoring containers on QEMU VM
- 09:30 Openfoodfacts server is deployed on QEMU VM
- 10:20 Robotoff deployment is blocked by a CPU flag issue (avx flag needed for Tensorflow library)
- Document a clear realistic “acceptable downtime” for each CT/VM/machines we manage (using the existing spreadsheet).
- Document the main owner and his/her co-owner (?) of each machine, ie people able to restore a service within the “acceptable downtime” and owning this responsibility.
- Decide how we document the infrastructure (not well decided yet).
- Is it possible to publish only real alerts in #infrastructure-alerts? Eg, only publish alerts if the machine is down for more than 15 minutes. Most of the alerts seems to be false positives.
- Define a process to resolve future incidents (e.g. should we systematically file a github issue for each incident?)