Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-master support for K3s Ansible playbook? #32

Closed
Tracked by #165
geerlingguy opened this issue May 20, 2020 · 39 comments
Closed
Tracked by #165

Multi-master support for K3s Ansible playbook? #32

geerlingguy opened this issue May 20, 2020 · 39 comments
Labels
enhancement New feature or request

Comments

@geerlingguy
Copy link
Contributor

This is something I think we might be able to get configured in the Ansible playbook, but I didn't see (at a glance at least) if it was something supported by this playbook yet; namely, a multi-master configuration with an external database: High Availability with an External DB.

In this playbook's case, maybe it would delegate the task of configuring an external database cluster to the user (e.g. use a separate Ansible playbook that builds an RDS cluster in Amazon, or a separate two or three node DB cluster on some other bare metal servers alongside the K3s cluster), but then how could we make it so this playbook supports the multi-master configuration described in the docs page linked above.

@sethcohn
Copy link

This role does this:
https://github.com/PyratLabs/ansible-role-k3s

@MrDienns
Copy link

MrDienns commented Aug 6, 2020

Would love to have this supported by Rancher out of the box. Any plans on implementing this feature in this repository?

@2fst4u
Copy link

2fst4u commented Nov 15, 2020

Now that internal HA is implemented, even if only experimental, does this playbook automatically install and set up a cluster to utilise this feature already or if not, can it be updated to support this?

@St0rmingBr4in
Copy link
Collaborator

I've implemented the base mechanism in PR #97 It setups the HA with the embedded etcd database mode.
It is still missing some parts and still need some work to be mergable. But if you want to play a bit you could try to checkout the branch and tell me how it goes :)

@2fst4u
Copy link

2fst4u commented Nov 15, 2020

I'd personally love to but my pitiful single node cluster is clearly not going to be much help!

I'd love to know if someone else can confirm it though. This would be so awesome.

@St0rmingBr4in
Copy link
Collaborator

Maybe it's time to create a proper vagrant file to test it, or finish the work in #52 to test it inside docker.

@clarkezone
Copy link

Would love this as well

@2fst4u
Copy link

2fst4u commented Feb 13, 2021

I've implemented the base mechanism in PR #97 It setups the HA with the embedded etcd database mode.
It is still missing some parts and still need some work to be mergable. But if you want to play a bit you could try to checkout the branch and tell me how it goes :)

How is this going? I see the PR looks pretty ready to go. Did it work nicely in testing?

@St0rmingBr4in
Copy link
Collaborator

I've implemented the base mechanism in PR #97 It setups the HA with the embedded etcd database mode.
It is still missing some parts and still need some work to be mergable. But if you want to play a bit you could try to checkout the branch and tell me how it goes :)

How is this going? I see the PR looks pretty ready to go. Did it work nicely in testing?

Actually it's working pretty well, I use it to deploy my cluster at home. The only thing missing is to have a variable that stores the loadbalanced endpoint that slaves need to use to talk to the api server. I will work on this tomorrow this has been in a stalled state for far too long.

@2fst4u
Copy link

2fst4u commented Feb 13, 2021

Sounds brilliant.

@St0rmingBr4in
Copy link
Collaborator

@2fst4u I updated #97 If you could give it a try that would be great ^^ I'll be updating the readme accordingly soon.

@2fst4u
Copy link

2fst4u commented Feb 27, 2021

@St0rmingBr4in just getting around to giving this a go (as my new Rpi 4s arrived in the mail!) and I notice you put a comment in your commit stating

# If you define multiple masters you should be providing a loadbalanced
# apiserver endpoint to all masters here. This default value is only suitable
# for a non-HA setup, if used in a HA setup, it will not protect you if the
# first node fails.

Which I'm not sure I fully understand. In the official K3S docs it appears adding a highly available second and third master is just a matter of joining them to the first.

K3S_TOKEN=SECRET k3s server --server https://<ip or hostname of server1>:6443

with no mention of a highly available API endpoint. Is there something I'm missing?

@scruffynerf
Copy link

with no mention of a highly available API endpoint. Is there something I'm missing?

For an HA to work, you need a IP living on a load balancer (or some other IP front facing answer, like round robin DNS (not recommended, just an example), so that if node 1 fails, the "HA IP" remains stable and working.

If node 1 is also the primary IP, when it fails, there is no way to talk to the others at that IP.

@2fst4u
Copy link

2fst4u commented Feb 27, 2021

Is this something that might need to be added to those K3S docs to make it more clear? I can't seem to find mention of this being a concern (as it seems like a pretty big one).

I'm nowhere near talented enough to do digging through their code but I just thought there might be some sort of quorum system where all the nodes designated as masters might share their IPs, and use heartbeats to figure out if one is down. Am I way off base thinking that?

My concern is I fear going to the point of using a load balancer might be introducing yet another single point of failure unless I then run the load balancer in HA and so on.

@St0rmingBr4in
Copy link
Collaborator

Hi,
Yes you need some kind of HA loadbalancer if you don't want any single point of failure. Master nodes do not need it, you can live with an HA setup of k3s using only masters without having to setup a loadbalancer. But once you add a slave then the slave needs a way of talking to masters via some kind of loadbalancer.

There are actually multiple ways to configure master and slaves in HA mode.
You should read https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md for a lot more information on HA setups. (You can replace in your mind k8s by k3s and kubespray by k3s-ansible they are analogous 🙂)

Here I am implementing the "External, no internal" endpoint type as described in kubespray since it is the easiest to setup. This setup requires the user to have an external loadbalancer running so that slaves and kube cli users have an HA way of talking to the apiserver.

We might also want in the future to provide this playbook with the ability to setup an HAProxy or a nginx running on each slave that will ensure we can talk to the apiserver using a HA endpoint by talking to localhost instead. (The "Local LB" endpoint type as described in kubespray)
The local LB way of talking to the apiserver is a bit more complex but a user of this playbook would not need to setup an external loadbalancer.

@2fst4u
Copy link

2fst4u commented Feb 27, 2021

Ah I see so it's the workers who need a load balanced endpoint. If hypothetically you just had three masters and no workers, no load balanced endpoint would be required, right?

@St0rmingBr4in
Copy link
Collaborator

Right

@2fst4u
Copy link

2fst4u commented Feb 27, 2021

Cool thanks for that. Sorry for the noob questions, hopefully if someone else is as confused they'll be able to find this clarification.

@2fst4u
Copy link

2fst4u commented Mar 1, 2021

@St0rmingBr4in I gave it a try. With your PR #97 with three mater nodes and no workers. I can't get past the following error:

TASK [k3s/master : Verify that all nodes actually joined]
FAILED - RETRYING: Verify that all nodes actually joined (20 retries left).

And it tries all 20 times and fails. As far as I can tell the k3s.service doesn't get started and therefore k3s isn't running on the first master node. I then tried running the master branch playbook first and then the PR version (to see if that might leave the service running and take over) but it unsurprisingly failed.

I tried and reset everything a few times to be sure and it seems to consistently fail unfortunately.

@St0rmingBr4in
Copy link
Collaborator

There is a typo that slipped in RestartSec=2i should be RestartSec=2 I'm updating the PR

@2fst4u
Copy link

2fst4u commented Mar 2, 2021

Unfortunately still getting the same issue. When I run the main branch the service starts successfully but when I run PR#97 I get the following error when running systemctl status k3s.service

● k3s.service
     Loaded: not-found (Reason: Unit k3s.service not found.)
     Active: failed (Result: exit-code) since Tue 2021-03-02 04:21:40 UTC; 6min ago
   Main PID: 94608 (code=exited, status=1/FAILURE)
      Tasks: 0 (limit: 9258)
     Memory: 6.2M
     CGroup: /system.slice/k3s.service

Mar 02 04:19:46 k3s-node1 k3s[94608]: E0302 04:19:46.676590   94608 remote_runtime.go:332] ContainerStatus "131939dfb55a07bbaaa0399b701c5feb08b0db88aa172053cbff9744a3b95106" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "131939dfb55a07bbaaa0399b701c5feb08b0db88aa172053cbff9744a3b95106": not found
Mar 02 04:19:46 k3s-node1 k3s[94608]: I0302 04:19:46.676695   94608 kuberuntime_gc.go:347] Error getting ContainerStatus for containerID "131939dfb55a07bbaaa0399b701c5feb08b0db88aa172053cbff9744a3b95106": rpc error: code = NotFound desc = an error occurred when try to find container "131939dfb55a07bbaaa0399b701c5feb08b0db88aa172053cbff9744a3b95106": not found
Mar 02 04:21:40 k3s-node1 systemd[1]: Stopping Lightweight Kubernetes...
Mar 02 04:21:40 k3s-node1 k3s[94608]: I0302 04:21:40.727567   94608 network_policy_controller.go:173] Shutting down network policies controller
Mar 02 04:21:40 k3s-node1 k3s[94608]: time="2021-03-02T04:21:40.730021824Z" level=info msg="Shutting down k3s.cattle.io/v1, Kind=Addon workers"
Mar 02 04:21:40 k3s-node1 k3s[94608]: I0302 04:21:40.730896   94608 controller.go:185] Shutting down kubernetes service endpoint reconciler
Mar 02 04:21:40 k3s-node1 k3s[94608]: time="2021-03-02T04:21:40.731163619Z" level=fatal msg="context canceled"
Mar 02 04:21:40 k3s-node1 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Mar 02 04:21:40 k3s-node1 systemd[1]: k3s.service: Failed with result 'exit-code'.
Mar 02 04:21:40 k3s-node1 systemd[1]: Stopped Lightweight Kubernetes.

Edit: and to clarify I made sure it was the version with the typo fixed.

@St0rmingBr4in
Copy link
Collaborator

it's normal for the first initialisation of the cluster this change launch k3s in the k3s-init service. could you send the logs of each of the k3s-init services? Also if you already setup a non ha cluster on the nodes, you need to run the reset playbook so that you start from a clean state

@2fst4u
Copy link

2fst4u commented Mar 2, 2021

Absolutely, I'm definitely running a reset each time.

Just to confirm, would that be systemctl status k3s-init.service? Just want to make sure I'm getting the command right before I go bumbling around.

@St0rmingBr4in
Copy link
Collaborator

Yes you can also use journalctl -ef -u k3s-init to see the logs in real time.

@2fst4u
Copy link

2fst4u commented Mar 2, 2021

I've attached the last 200 lines of each node's output. 1, 2 and 3 are in the order they appear in the hosts file.

I initially thought the issue might have been the use of FQDN in the hosts.ini, so I reverted to IP and it does the same thing. Let me know if you need more logs.

log1.txt
log2.txt
log3.txt

@St0rmingBr4in
Copy link
Collaborator

I'm also using FQDNs so it is not a problem. I just retested on debian buster with v1.20.4+k3s1 and v1.19.5+k3s1 I can confirm it works perfectly for me. Maybe it would help to have the full logs. Do you have time to give this another try ? If you want we could also take a look at this together.

@St0rmingBr4in
Copy link
Collaborator

@2fst4u I updated the review based on @narkaTee comment, you can run the reset playbook and retry. Tell me how it goes

@mattthhdp
Copy link

mattthhdp commented Mar 18, 2021

@St0rmingBr4in a clean install of ubunu give me this error (the 20 retry and fail) on each master node i am tring to join, with 0 worker node (if that can help)

Mar 17 23:15:12 Rancher-01 k3s[9900]: I0317 23:15:12.325203    9900 reconciler.go:319] Volume detached for volume "helm-traefik-token-fccln" (UniqueName: "kubernetes.io/secret/14abc0b9-dd4c-4068-832f-2e0c14dfebf2-helm-traefik-token-fccln") on node "rancher-01" DevicePath ""
Mar 17 23:15:12 Rancher-01 k3s[9900]: I0317 23:15:12.325209    9900 reconciler.go:319] Volume detached for volume "values" (UniqueName: "kubernetes.io/configmap/14abc0b9-dd4c-4068-832f-2e0c14dfebf2-values") on node "rancher-01" DevicePath ""
Mar 17 23:15:13 Rancher-01 k3s[9900]: W0317 23:15:13.044943    9900 pod_container_deletor.go:79] Container "2cded66bbca92e6097862fe9aa41fbc37b2b9ca6714d7032c6cf809b48eea7b5" not found in pod's containers
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740228    9900 remote_runtime.go:332] ContainerStatus "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740251    9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": rpc error: code = NotFound desc = an error occurred when try to find container "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740607    9900 remote_runtime.go:332] ContainerStatus "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740624    9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": rpc error: code = NotFound desc = an error occurred when try to find container "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740862    9900 remote_runtime.go:332] ContainerStatus "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740875    9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": rpc error: code = NotFound desc = an error occurred when try to find container "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741108    9900 remote_runtime.go:332] ContainerStatus "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741121    9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": rpc error: code = NotFound desc = an error occurred when try to find container "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741329    9900 remote_runtime.go:332] ContainerStatus "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741350    9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": rpc error: code = NotFound desc = an error occurred when try to find container "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741590    9900 remote_runtime.go:332] ContainerStatus "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741610    9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": rpc error: code = NotFound desc = an error occurred when try to find container "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741889    9900 remote_runtime.go:332] ContainerStatus "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741904    9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": rpc error: code = NotFound desc = an error occurred when try to find container "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": not found
Mar 17 23:18:05 Rancher-01 systemd[1]: Stopping /usr/local/bin/k3s server --cluster-init...
Mar 17 23:18:05 Rancher-01 k3s[9900]: I0317 23:18:05.436838    9900 network_policy_controller.go:157] Shutting down network policies full sync goroutine
Mar 17 23:18:05 Rancher-01 k3s[9900]: {"level":"warn","ts":"2021-03-17T23:18:05.444Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory\". Reconnecting..."}

`

@St0rmingBr4in
Copy link
Collaborator

@mattthhdp Do k3s work using a one node cluster ? The error you are getting seems unrelated. The error in question is defined here https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_gc.go#L360 reading kubernetes/kubernetes#63336 gives a bit more information, is containerd working on your machine ?

@mattthhdp
Copy link

mattthhdp commented Mar 20, 2021

So with only 1 node (rancher-01)

k3s-init.service - /usr/local/bin/k3s server
   Loaded: loaded (/run/systemd/transient/k3s-init.service; transient)
Transient: yes
   Active: active (running) since Sat 2021-03-20 00:47:19 UTC; 7s ago
 Main PID: 13123 (k3s-server)
    Tasks: 20 (limit: 1073)
   CGroup: /system.slice/k3s-init.service
           ├─13123 /usr/local/bin/k3s server
           └─13199 containerd

Mar 20 00:47:26 Rancher-01 k3s[13123]: I0320 00:47:26.476509   13123 shared_informer.go:240] Waiting for caches to sync for expand
Mar 20 00:47:26 Rancher-01 k3s[13123]: I0320 00:47:26.484332   13123 controllermanager.go:554] Started "endpointslicemirroring"
Mar 20 00:47:26 Rancher-01 k3s[13123]: I0320 00:47:26.484397   13123 endpointslicemirroring_controller.go:211] Starting EndpointSliceMirroring controller
Mar 20 00:47:26 Rancher-01 k3s[13123]: I0320 00:47:26.484402   13123 shared_informer.go:240] Waiting for caches to sync for endpoint_slice_mirroring
Mar 20 00:47:26 Rancher-01 k3s[13123]: I0320 00:47:26.489346   13123 controllermanager.go:554] Started "replicationcontroller"
Mar 20 00:47:26 Rancher-01 k3s[13123]: I0320 00:47:26.489558   13123 replica_set.go:182] Starting replicationcontroller controller
Mar 20 00:47:26 Rancher-01 k3s[13123]: I0320 00:47:26.489607   13123 shared_informer.go:240] Waiting for caches to sync for ReplicationController
Mar 20 00:47:26 Rancher-01 k3s[13123]: time="2021-03-20T00:47:26.519419976Z" level=info msg="Done waiting for CRD addons.k3s.cattle.io to become available"
Mar 20 00:47:26 Rancher-01 k3s[13123]: time="2021-03-20T00:47:26.519439666Z" level=info msg="Waiting for CRD helmcharts.helm.cattle.io to become available"
Mar 20 00:47:26 Rancher-01 k3s[13123]: E0320 00:47:26.908557   13123 node.go:161] Failed to retrieve node info: nodes "rancher-01" not found

in the ansible vm i got

Saturday 20 March 2021  00:47:39 +0000 (0:00:00.156)       0:00:23.876 ********
===============================================================================
k3s/master : Verify that all nodes actually joined --------------------- 12.01s
k3s/master : Enable and check K3s service ------------------------------- 4.40s
download : Download k3s binary x64 -------------------------------------- 1.71s
Gathering Facts --------------------------------------------------------- 0.66s
k3s/master : Copy K3s service file -------------------------------------- 0.46s
Gathering Facts --------------------------------------------------------- 0.46s
k3s/master : Clean previous runs of k3s-init ---------------------------- 0.42s
k3s/master : Configure kubectl cluster to https://rancher-01:6443 ------- 0.35s
k3s/master : Kill the temporary service used for initialization --------- 0.28s
k3s/master : Change file access node-token ------------------------------ 0.25s
k3s/master : Wait for node-token ---------------------------------------- 0.25s
k3s/master : Register node-token file access mode ----------------------- 0.24s
prereq : Enable IPv4 forwarding ----------------------------------------- 0.23s
k3s/master : Read node-token from master -------------------------------- 0.23s
raspberrypi : Test for raspberry pi /proc/cpuinfo ----------------------- 0.23s
k3s/master : Copy config file to user home directory -------------------- 0.17s
k3s/master : Init cluster inside the transient k3s-init service --------- 0.17s
k3s/master : Restore node-token file access ----------------------------- 0.16s
k3s/master : Create crictl symlink -------------------------------------- 0.16s
k3s/master : Create directory .kube ------------------------------------- 0.15s
jaune@ansible:~/ansible/ha$

and after a few second on rancher-01 i got

jaune@Rancher-01:~$ systemctl status k3s-init.service
● k3s-init.service - /usr/local/bin/k3s server
   Loaded: loaded (/run/systemd/transient/k3s-init.service; transient)
Transient: yes
   Active: failed (Result: exit-code) since Sat 2021-03-20 00:47:32 UTC; 1min 19s ago
 Main PID: 13123 (code=exited, status=1/FAILURE)

Mar 20 00:47:32 Rancher-01 k3s[13123]: time="2021-03-20T00:47:32.056720803Z" level=info msg="Shutting down batch/v1, Kind=Job workers"
Mar 20 00:47:32 Rancher-01 k3s[13123]: I0320 00:47:32.057065   13123 dynamic_cafile_content.go:182] Shutting down request-header::/var/lib/rancher/k3s/server/tls/request-header-ca.crt
Mar 20 00:47:32 Rancher-01 k3s[13123]: time="2021-03-20T00:47:32.056723973Z" level=info msg="Shutting down helm.cattle.io/v1, Kind=HelmChart workers"
Mar 20 00:47:32 Rancher-01 k3s[13123]: I0320 00:47:32.057067   13123 dynamic_cafile_content.go:182] Shutting down client-ca-bundle::/var/lib/rancher/k3s/server/tls/client-ca.crt
Mar 20 00:47:32 Rancher-01 k3s[13123]: time="2021-03-20T00:47:32.056725833Z" level=info msg="Shutting down helm.cattle.io/v1, Kind=HelmChartConfig workers"
Mar 20 00:47:32 Rancher-01 k3s[13123]: I0320 00:47:32.057072   13123 controller.go:89] Shutting down OpenAPI AggregationController
Mar 20 00:47:32 Rancher-01 k3s[13123]: time="2021-03-20T00:47:32.056727883Z" level=fatal msg="controllers exited"
Mar 20 00:47:32 Rancher-01 systemd[1]: k3s-init.service: Main process exited, code=exited, status=1/FAILURE
Mar 20 00:47:32 Rancher-01 systemd[1]: k3s-init.service: Failed with result 'exit-code'.
Mar 20 00:47:32 Rancher-01 systemd[1]: Stopped /usr/local/bin/k3s server.

and finally with the command journalctl -ef -u k3s-init
a wall of text XD
log.txt

if that can help in /etc/rancher/k3s/k3s.yaml

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS$$$$(trunkate)
    server: https://127.0.0.1:6443
  name: default
contexts:
- context:
    cluster: default
    user: default
  name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
  user:
    client-certificate-data: LS0tLS1CRUdJTiBD$$$$((trunkate)

hope you dosent need anything else

Edit:
also the k3s script is working as it's suppose with only 1 node

jaune@Rancher-01:~$ curl -sfL https://get.k3s.io | sh -
[INFO]  Finding release for channel stable
[INFO]  Using v1.20.4+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.20.4+k3s1/sha256sum-amd64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.20.4+k3s1/k3s
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[INFO]  Creating /usr/local/bin/ctr symlink to k3s
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s
jaune@Rancher-01:~$ kubectl get nodes
WARN[2021-03-20T01:02:04.006895386Z] Unable to read /etc/rancher/k3s/k3s.yaml, please start server with --write-kubeconfig-mode to modify kube config permissions
error: error loading config file "/etc/rancher/k3s/k3s.yaml": open /etc/rancher/k3s/k3s.yaml: permission denied
jaune@Rancher-01:~$ sudo kubectl get nodes
NAME         STATUS   ROLES                  AGE   VERSION
rancher-01   Ready    control-plane,master   15s   v1.20.4+k3s1

@St0rmingBr4in
Copy link
Collaborator

St0rmingBr4in commented Mar 23, 2021

@mattthhdp this issue you are having is not related to the pr #97 since it also does not work using a one node cluster. I thing your error is related to this one k3d-io/k3d#110

@mattthhdp
Copy link

@St0rmingBr4in any idea why the rancher script
curl -sfL https://get.k3s.io | sh -
is working but not the ansible playbook ? i will try the system prune -a and give you feedback. Also thank you alot the the hand !!

@St0rmingBr4in
Copy link
Collaborator

I am not really sure, did you run the reset playbook or run the uninstall script between each runs ?

@mattthhdp
Copy link

@St0rmingBr4in i did. Im a little short on time today. i will try to recreate my 3 vm's (clean ubuntu install wihtout any package) and test again tomorrow.

@mattthhdp
Copy link

So with 3 clean ubuntu vm. While the script is in the waiting 20 time for nodes to join the cluster i cannot issus kubectl command
and here is the log. Do i have to install docker or run the rancher k3s script before the ansible-playbook ? Im a little lost here....

Master1.txt

@mathew-fleisch
Copy link

@mattthhdp I had similar error messages and came to this thread. There were a couple of things wrong in my case, that might also apply to you. First, I ran a reset on the cluster, but one node didn't reset all of the way, and kept trying to connect to the master node. I ran an ansible-playbook reset.yml on the old/bad node that was flooding the logs with handshake errors.

example of handshake error:

http: TLS handshake error from 10.0.10.136:47936: remote error: tls: bad certificate

Once it was uninstalled, the logs cleaned up a bit, but I still had the permission-denied/TLS error you were getting as well (near the end of the log). I then manually chown'ed the file it was complaining about, on the master node, and was able to access the cluster again. I am new to ansible, and not sure where it would go to make this change within the playbook. I am also new to k3s, so I am not sure if this is a good/bad idea, however it unblocked me.

sudo chown $USER:$USER /etc/rancher/k3s/k3s.yaml

@2fst4u
Copy link

2fst4u commented May 15, 2021

@2fst4u I updated the review based on @narkaTee comment, you can run the reset playbook and retry. Tell me how it goes

Super late response.

I never got a chance to retry this, I've started relying on the cluster in a single master form and started running things "in production" (not really in production, just in my home).

But back when I was trying this I stumbled upon some info that my issue may have been related to running the OS on SD cards, which etcd finds too slow. Does that sound like it could be the issue?

@yehtetmaungmaung
Copy link

yehtetmaungmaung commented Aug 7, 2023

Ah I see so it's the workers who need a load balanced endpoint. If hypothetically you just had three masters and no workers, no load balanced endpoint would be required, right?

Hello, new to k3s and ansible here. I setup 3 server and 2 agent nodes cluster. If a primary server(the one where cluster-init is executed) down, the agent can communicate to two other remaining server nodes. Here's the logs from agents:

Updated load balancer k3s-agent-load-balancer server addresses -> [192.168.122.149:6443 192.168.122.47:6443]

I think agent nodes have some kind of loadbalancer with health checks for all servers nodes and automatically failover. My question is, Is it expected behavior? If so then, Do we still need an external LB for api server?

@dereknola
Copy link
Member

@yehtetmaungmaung K3s comes with a Service load balancer, see https://docs.k3s.io/networking#service-load-balancer for information. But what this does is expose ingress traffic on every node. So say you have nginx setup on pods in node A, with port 80. What the K3s provided balancer does is allow you to send nginx traffic to port 80 on Node B, and it will automatically route the nginx traffic to the pods on Node A. It basically fulfills a component that many cloud provided LBs (GKE EKS).

What this isn't is a external load balancer. It doesn't not loadbalance ingress traffic across nodes. It doesn't replace the functionality providing a single registration point for servers as described in https://docs.k3s.io/datastore/cluster-loadbalancer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.