Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backups are broken since upgrade to 1.27 #4308

Open
franco-martin opened this issue Nov 18, 2023 · 19 comments
Open

Backups are broken since upgrade to 1.27 #4308

franco-martin opened this issue Nov 18, 2023 · 19 comments

Comments

@franco-martin
Copy link

franco-martin commented Nov 18, 2023

Summary

Backup consists of a 213-215 byte tar.gz file containing nothing. This started happening after upgrading to 1.27
Running /snap/microk8s/current/bin/migrator --endpoint unix://${SNAP_DATA}/var/kubernetes/backend/kine.sock:12379 --mode backup-dqlite --db-dir ./ hangs for more than a day and never completes.
I created another issue but responses stopped 3 weeks ago. #4259

What Should Happen Instead?

The backup should work and probably (and statistically) be around 90mb.

Reproduction Steps

  • Create script.sh with the following content
#!/bin/bash
if [ "$1" = "" ] ; then
        echo "Provide the server's IP address as the first parameter"
        exit 1
fi
cd
wget https://github.com/franco-martin/test-franco/releases/download/microk8s-4308/small-broken-backup.tar.gz
snap install microk8s --classic --channel=1.28/stable
microk8s stop
cp small-broken-backup.tar.gz /var/snap/microk8s/current/var/kubernetes/
cd /var/snap/microk8s/current/var/kubernetes/
mv backend backend_2
tar -xzvf small-broken-backup.tar.gz
IP=$1
sed -i s/192.168.1.79/$IP/g backend/localnode.yaml
sed -i s/192.168.1.79/$IP/g backend/info.yaml
sed -i s/192.168.1.79/$IP/g backend/cluster.yaml
/snap/microk8s/current/bin/dqlite   -s 127.0.0.1:19001   -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt   -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key   k8s ".reconfigure /var/snap/microk8s/current/var/kubernetes/backend/ /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml"
cd
microk8s start
  • Get the IP of your server and run bash ./script YOUR_IP
  • Verify the backup was restored by running microk8s kubectl get nodes. You should have two nodes.
  • Run a backup
  • Verify its size

Introspection Report

inspection-report-20231019_220512.tar.gz

Can you suggest a fix?

Are you interested in contributing with a fix?

Ill help with whatever I can

@ktsakalozos
Copy link
Member

ktsakalozos commented Nov 20, 2023

Hi @franco-martin would it be possible for you to use microk8s.dbctl backup -o backup? microk8s.dbctl is the expected way to take backups. Of course you can take a copy of the entire DB directory as it was suggested in #4259 (comment)

@franco-martin
Copy link
Author

franco-martin commented Nov 20, 2023 via email

@franco-martin
Copy link
Author

Folks I understand you all have a lot of work to do, but please communicate if this is not a priority. Ghosting people that are reporting issues and willing to troubleshoot them is not the way to get more contributors.
@ktsakalozos Im waiting for a way of sending you the backup so that you can reproduce the issue. Im not going to post my whole kubernetes cluster information here for anyone to download.

@ktsakalozos
Copy link
Member

ktsakalozos commented Nov 27, 2023

Hi @franco-martin sorry for the late reply. My main issue is that in my tests backups do not produce an empty file.Even if you sent me an empty tar.gz I would not be able to do anything with it. Can you help me reproduce this? Using VMs can you get me through the steps you follow that result an empty tar ball, thank you.

@franco-martin
Copy link
Author

franco-martin commented Nov 27, 2023 via email

@ktsakalozos
Copy link
Member

I have a backup that when imported into a cluster, breaks the backup functionality. How do I send it privately?

Lets talk on the #microk8s channel over at the kubernetes slack. I am kjackal there.

@franco-martin
Copy link
Author

Updated the description with reproduction steps and a publicly available download link. Hopefully thatll speed things up

@ktsakalozos
Copy link
Member

Hi @franco-martin, going though the steps you provided I was not able to get to a functioning cluster.

@franco-martin
Copy link
Author

franco-martin commented Dec 8, 2023

I just tried it in a new node and it worked. Whats failing on your side? Im testing 1.28/stable went through step by step and it works every time.
Here are things you can try:

  • use ubuntu 22.04 (idk if it makes any difference)
  • set the hostname to test-mk8s
  • set the IP to 192.168.1.79 (I tested it with other ips as well but its the original IP of the testing node I used)
    If neither work please tell me where it breaks. Im able to run microk8s kubectl get nodes and it shows two nodes (mk8s09 and test-mk8s)

@franco-martin
Copy link
Author

I just updated the description. I built a script that automates the process. I verified it in a brand new t3.medium instance on AWS and it reproduced.

@sbidoul
Copy link

sbidoul commented Feb 10, 2024

I'm also seeing this issue of empty backup files. They contain a single empty directory. My cluster is on 1.29, but I don't know when the problem started.

@jmyoung
Copy link

jmyoung commented Apr 24, 2024

Also seeing this issue, even with v1.29.2 (latest version of the 1.29/stable track). microk8s dbctl backup produces a tgz of size 216 bytes which contains nothing.

@JonasFlodin-S
Copy link

Hello everyone.
I can confirm this behavior after updating a multi-node microk8s cluster (classic) from 1.26.14 stable to 1.27.11 stable a couple weeks ago. Before, our daily backups had a size of around 6.2M. Now they're at 216 Byte, just as @franco-martin described.

Doesn't make a difference if we only use microk8s.dbctl backup or specifying a certain filename microk8s.dbctl backup -o backup.tar.gz, the resulting "backup"-file is always useless.

@ole1986
Copy link

ole1986 commented May 15, 2024

According to https://microk8s.io/docs/restore-quorum#stop-dqlite-on-all-nodes-1 it is required to stop all dqlite instances using microk8s stop while backing up the dqlite. Might this be the reason why microk8s dbctl backup fail?

@JonasFlodin-S
Copy link

According to https://microk8s.io/docs/restore-quorum#stop-dqlite-on-all-nodes-1 it is required to stop all dqlite instances using microk8s stop while backing up the dqlite. Might this be the reason why microk8s dbctl backup fail?

It looks like you stumbled across the microk8s documentation that explains recovering a multi-node cluster after a failure with a lost quorum. The issue explained in this thread however deals with non-working backups using microk8s dbctl in general, even during normal cluster operation without any known issues. A healthy cluster shouldn't have to be stopped via microk8s stop for a backup task to work properly. I cannot shutdown a whole cluster with 200+ pods every night for a backup process to run and start it again afterwards...

As the author of this thread and many others mentioned (me included) the microk8s dbctl backup -procedure worked flawlessly until upgrading from microk8s channel 1.26.x to a newer version, after which the backup process broke. Hence I don't think there's a link to stopping the service before creating backups via dbctl. But I tried to follow your suggestion and stopped the service (in a clean test environment). Nothing happens when I try to create a backup afterwards. In the console there is an output “Backing up datastore” and it seems like the backup process freezes because the services are not running.

I just reproduced the problem with the following steps:

  • create a Linux VM, e.g. Ubuntu 22.04.03 LTS / Server
  • be root, or add sudo when executing the following commands in a terminal
  • install microk8s via snap install microk8s --classic --channel=1.26
  • check if working as intended with e.g. microk8s status, microk8s inspect, microk8s kubectl get all -A
  • start a dbctl backup task microk8s dbctl backup, which should create a file like backup-2024-05-21-15-08-24.tar.gz in your pwd
  • show content of backup tar-file with tar -tf backup-2024-05-21-15-08-24.tar.gz, which should give you an output of several .data and .key files, named with numbers (e.g. backup-2024-05-21-15-08-24/167.data, backup-2024-05-21-15-08-24/143.data, backup-2024-05-21-15-08-24/95.key, etc.).
  • The backup-2024-05-21-15-08-24.tar.gz file has a size of 116K

So far so good. Seems to be working with 1.26.14.
Now update the snap to a higher major version and try the backup again.

  • update via snap refresh microk8s --channel=1.27/stable
  • start backup microk8s dbctl backup, which again should create a file like backup-2024-05-21-15-24-00.tar.gz in the current folder
  • check its content tar -tf backup-2024-05-21-15-24-00.tar.gz
  • You'll see its empty. The only output will look like this: backup-2024-05-21-15-24-00/ - that's it.
  • the backup-2024-05-21-15-24-00.tar.gz file has a size of 216 Bytes

This seems to be exactly the problem that the author described above.
I'd really like to know why this happens and how to solve it.

@koss822
Copy link

koss822 commented Jun 5, 2024

Same issue, it does not work even with 1.30, problem is I cannot revert to older version since when I revert cluster stop working.

$ microk8s dbctl backup -o test
Backing up the datastore
INFO[0000] Starting migrator                             dir=/tmp/tmpy_e7ckac/test endpoint="unix:///var/snap/microk8s/6876/var/kubernetes/backend/kine.sock:12379" mode=backup-dqlite
The backup is: test.tar.gz
$ ll test.tar.gz
-rw-rw-r-- 1 user group 192 Jun  5 06:31 test.tar.gz

@sbidoul
Copy link

sbidoul commented Jun 28, 2024

Hi, is there anything we users can do to diagnose the issue? --debug does not show any additional information, nor is anything visible in the logs. But the resulting tar.gz is empty.

# microk8s dbctl --debug backup -o test
Backing up the datastore
INFO[0000] Starting migrator                             dir=/tmp/tmpe6zb425z/test endpoint="unix:///var/snap/microk8s/6876/var/kubernetes/backend/kine.sock:12379" mode=backup-dqlite
The backup is: test.tar.gz

@franco-martin
Copy link
Author

@sbidoul I tried a bunch of things, but without much success. What you could do is debugging "https://github.com/canonical/microk8s/blob/master/scripts/wrappers/dbctl.py" with an existing cluster. I haven't had the time to do that, but if there's business interest in this, you might be able to convince a stakeholder to back you up in the name of open source.

@sbidoul
Copy link

sbidoul commented Jun 29, 2024

I traced a little bit until the kine client.List call which indeed returns an empty result.

https://github.com/canonical/k8s-dqlite/blob/36366fd2213050d57fa2a36fb6f29f7e04a8d7b2/pkg/migrator/migrator.go#L94

No time to dig deeper for now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants