Skip to content

Commit

Permalink
docs: ovh3 better backups
Browse files Browse the repository at this point in the history
  • Loading branch information
alexgarel committed Oct 31, 2024
1 parent 3e91ffd commit 92855ee
Show file tree
Hide file tree
Showing 3 changed files with 135 additions and 17 deletions.
19 changes: 19 additions & 0 deletions docs/logs-ovh3.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,25 @@
Report here the timeline of incidents and interventions on ovh3 server.
Keep things short or write a report.

## 2024-10-31 system taking 100% CPU

* Server is not accessible via SSH.
* Munin show 100% CPU taken by system.
* We ask for a hard reboot on OVH console.
* After restart system continues to use 100% CPU.
* Top shows that arc_prune + arc_evict are using 100% CPU
* exploring logs does not show strange messages
* `cat /proc/spl/kstat/zfs/arcstats|grep arc_meta` shows a arc_meta_used < arc_meta_max (so it's ok)
* We soft reboot the server
* It is back to normal

## 2024-10-10

sda on ovh3 is faulty (64 Current_Pending_Sector, 2 Reallocated_Event_Count).
See https://github.com/openfoodfacts/openfoodfacts-infrastructure/issues/424



## 2023-12-05 certificates for images expired

Images not displaying anymore on the website due to SSL problem (signaled by Edouard, with alert by blackbox exporter)
Expand Down
46 changes: 29 additions & 17 deletions docs/proxmox.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,31 @@ On ovh1 and ovh2 we use proxmox to manage VMs.

## Proxmox Backups

Every VM / CT is backuped twice a week using general proxmox backup, in a specific zfs dataset
(see Datacenter -> backup)
**IMPORTANT:** We don't use standard proxmox backup[^previous_backups] (see Datacenter -> backup).

Instead we use [syncoid / sanoid](./sanoid.md) to snapshot and synchronize data to other servers.

[^previous_backups]: Previously every VM / CT is backuped twice a week using general proxmox backup, in a specific zfs dataset

## Storage synchronization

We don't use standard proxmox replication of storages, because it is incompatible with using [syncoid / sanoid](./sanoid.md), as it removes snapshots on destination and does not allow to choose destination location.

It means that restoring a container / VM won't be automatic and will need a manual intervention.

### Replication (don't use it)

Previously, VM and container storage were regularly synchronized to ovh3 (and eventually to ovh1/2).

Replication can be seen in the web interface, clicking on "replication" section on a particular container / VM.

This is managed with command line `pvesr` (PVE Storage replication). See [official doc](https://pve.proxmox.com/wiki/Storage_Replication)

* To Add replication a replication on a container / VM:
* In the Replication menu of the container, "Add" one
* Target: the server you want
* Schedule: */5 if you want every 5 minutes (takes less than 10 seconds, thanks to ZFS)


## Host network configuration

Expand Down Expand Up @@ -117,22 +140,16 @@ At OVH we have special DNS entries:
* `proxy1.openfoodfacts.org` pointing to OVH reverse proxy
* `off-proxy.openfoodfacts.org` pointing to Free reverse proxy

## Storage synchronization

VM and container storage are regularly synchronized to ovh3 (and eventually to ovh1/2) to have a continuous backup.

Replication can be seen in the web interface, clicking on "replication" section on a particular container / VM.

This is managed with command line `pvesr` (PVE Storage replication). See [official doc](https://pve.proxmox.com/wiki/Storage_Replication)


## How to migrate a container / VM

You may want to move containers or VM from one server to another.

Just go to the interface, right click on the VM / Container and ask to migrate !
**FIXME** this will not work with sanoid/syncoid.

~~Just go to the interface, right click on the VM / Container and ask to migrate !~~

If you have a large disk, you may want to first setup replication of your disk to the target server (see [Storage synchronization](#storage-synchronization)), schedule it immediatly (schedule button)− and then run the migration.
~~If you have a large disk, you may want to first setup replication of your disk to the target server (see [Storage synchronization](#storage-synchronization)), schedule it immediatly (schedule button)− and then run the migration.~~

## How to Unlock a Container

Expand Down Expand Up @@ -254,11 +271,6 @@ Using the web interface:
* Start at boot: Yes
* Protection: Yes (to avoid deleting it by mistake)

* Eventually Add replication to ovh3 or off1/2 (if we are not using sanoid/syncoid instead)
* In the Replication menu of the container, "Add" one
* Target: ovh3
* Schedule: */5 if you want every 5 minutes (takes less than 10 seconds, thanks to ZFS)

Also think about [configuring email](./mail.md#postfix-configuration) in the container

## Logging in to a container or VM
Expand Down
87 changes: 87 additions & 0 deletions docs/reports/2024-10-30-ovh3-backups.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# 2024-10-30 OVH3 backups

We need an intervention to change a disk on ovh3.

We still have very few backups for OVH services.

Before the operation, I want to at least have replication of OVH backups on the new MOJI server.

We previously tried to do it while keeping replication, but it does not work well.

So here is what we are going to do:
* remove replication and let sanoid / syncoid deal with replication to ovh3
* we will have less snapshots on the ovh1/ovh2 side and we will use a replication snapshot
to avoid relying on a common existing snapshot made by syncoid

Note that we don't replicate between ovh1 and ovh2 because we have very few space left on disks.

## Changing sanoid / syncoid config and removing replication

First because we won't use replication anymore, we have to create the ovh3operator on ovh1 and ovh2,
and as we want to use replication snapshot, we have to use corresponding rights for ZFS,
See [Sanoid / creating operator on PROD_SERVER](../sanoid.md#creating-operator-on-prod_server)

I also add to link zfs command in /usr/bin: `ln -s /usr/sbin/zfs /usr/bin`

For each VM / CT separately I did:
* disable replication
* I didn't have to change syncoid policy on ovh1/2
* on ovh3
* configure sanoid policy from replicated_volumes to a regular synced_data one
* use syncoid to sync the volume and configure add the line to syncoid-args (using a specific snapshot)

I tried with CT 130 (contents) first. I tried the syncoid command manually:
```bash
syncoid --no-privilege-elevation [email protected]:rpool/subvol-130-disk-0 rpool/subvol-130-disk-0
```
I did some less important CT and VM: 113, 140, 200 and decided to wait until next day to control everything is ok.

The day after, I did the same for CT 101, 102, 103, 104,105, 106, 108, 202 and 203 on ovh1,
and 107, 110 and 201 on ovh2.

I removed 109 120

I also removed sync of CT 107 to ovh1 (because we are nearly out of disk space) and removed the volume there.

## Syncing between ovh1 and ovh2

We have two VMs that are really important to replicate (but using syncoid) from ovh1 to ovh2.

So I [created an ovh2operator on ovh1](../sanoid.md#creating-operator-on-prod_server)

I installed the syncoid systemd service and enabled it.
Same for `sanoid_check`.

I added synced_data template to sanoid.conf to use it for synced volumes.
I removed volume 101 and 102 and manually synced them from ovh1 to ovh2.
Then I added them to `syncoid-args.conf`.

## Removing dump backups on ovh1/2

We also decided to remove dump backups on ovh1/2/3.

Going in proxmox interface, Datacenter, Backup, I disabled backups.

## Removing replication snapshots

On ovh1, ovh2, ovh3 I removed the __replicate_ snapshots.

```bash
zfs list -r -t snap -o name rpool|grep __replicate_
zfs list -r -t snap -o name rpool|grep __replicate_|xargs -n 1 zfs destroy
```

Also on osm45
```bash
zfs list -r -t snap -o name hdd-zfs/off-backups/ovh3-rpool|grep __replicate_
zfs list -r -t snap -o name hdd-zfs/off-backups/ovh3-rpool|grep __replicate_|xargs -n 1 zfs destroy
```

I did the same for vz_dump snapshots, as now backups are no more active.


## Checking syncs on osm45 (Moji)

We would not need to use a sanoid specific snapshot on moji anymore, but I'll leaved it like it for now !

Syncs seems ok.

0 comments on commit 92855ee

Please sign in to comment.