Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usernetes Gen2: depends on Rootless Docker on hosts #287

Merged
merged 1 commit into from
Sep 5, 2023

Conversation

AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda commented Aug 26, 2023

For


Usernetes (Gen2) deploys a Kubernetes cluster on Rootless Docker hosts.

Note

Usernetes (Gen2) has significantly diverged from the original Usernetes (Gen1),
which did not rely on Rootless Docker hosts.

See the gen1 branch for
the original Usernetes (Gen1).

Usernetes (Gen2) is similar to Rootless kind and Rootless minikube,
but Usernetes (Gen 2) supports creating a cluster with multiple hosts.

Components

  • Cluster configuration: kubeadm
  • CRI: containerd
  • OCI: runc
  • CNI: Flannel

Requirements

sudo mkdir -p /etc/systemd/system/[email protected]

cat <<EOF | sudo tee /etc/systemd/system/[email protected]/delegate.conf
[Service]
Delegate=cpu cpuset io memory pids
EOF

sudo systemctl daemon-reload
  • Kernel modules:
sudo modprobe vxlan

Using Ubuntu 22.04 hosts is recommended.

Usage

See make help.

# Bootstrap a cluster
make up
make kubeadm-init
make install-flannel

# Enable kubectl
make kubeconfig
export KUBECONFIG=$(pwd)/kubeconfig
kubectl get pods -A

# Multi-host
make join-command
scp join-command another-host:~/usernetes
ssh another-host make -C ~/usernetes up kubeadm-join

# Debug
make logs
make shell
make down-v
kubectl taint nodes --all node-role.kubernetes.io/control-plane-

@AkihiroSuda AkihiroSuda changed the title [WIP] Usernetes G2: depends on Rootless Docker on hosts [WIP] Usernetes Gen2: depends on Rootless Docker on hosts Aug 26, 2023
@AkihiroSuda AkihiroSuda changed the title [WIP] Usernetes Gen2: depends on Rootless Docker on hosts Usernetes Gen2: depends on Rootless Docker on hosts Aug 26, 2023
@vsoch
Copy link

vsoch commented Sep 5, 2023

At least ~/.local/share/docker has to be a local ext4 or XFS filesystem.

Okay so sounds like I should try to remove the shared filesystem and get rootless working? Is there a way to change that path and I can throw it somewhere else?

@AkihiroSuda
Copy link
Member Author

At least ~/.local/share/docker has to be a local ext4 or XFS filesystem.

Okay so sounds like I should try to remove the shared filesystem and get rootless working? Is there a way to change that path and I can throw it somewhere else?

I guess you can just make a symlink, or echo '{"data-root": "/somewhere"}' >~/.config/docker/daemon.json

@vsoch
Copy link

vsoch commented Sep 6, 2023

okay making progress! I make the nodes isolated for now - we can try the above later. I was able to get rootless docker installed and the control plane and nodes up - I'm trying to run the hack test now, and there is an error with the shell. When I control C:

$ kubectl  get pods
NAME            READY   STATUS    RESTARTS   AGE
dnstest-0       1/1     Running   0          117s
dnstest-1       1/1     Running   0          114s
dnstest-2       1/1     Running   0          111s
dnstest-shell   0/1     Eror     0          110s

It looks like the entrypoint is doing wget to the others, so I can try that manually. Ah, there is a timeout:

$ kubectl exec -it dnstest-1 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Error from server: error dialing backend: dial tcp 10.10.0.5:10250: i/o timeout

Update: the same timeout happens with make shell. Is it just memory maybe? oom score?

Sep 06 00:54:01 u7s-usernetes-compute-001 kubelet[877]: I0906 00:54:01.910936     877 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/coredns-5dd5756b68-6dljd" podStartSLOduration=19.910898197 podCreationTimestamp="2023-09-06 00:53:42 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2023-09-06 00:54:01.910498208 +0000 UTC m=+32.344325656" watchObservedRunningTime="2023-09-06 00:54:01.910898197 +0000 UTC m=+32.344725648"
Sep 06 00:58:30 u7s-usernetes-compute-001 kubelet[877]: E0906 00:58:30.319024     877 container_manager_linux.go:509] "Failed to ensure process in container with oom score" err="failed to apply oom score -999 to PID 877: write /proc/877/oom_score_adj: permission denied"

Maybe related? https://gitlab.freedesktop.org/dbus/dbus/-/issues/374

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Sep 6, 2023

  • Does the test work with a single-node mode? (with kubectl taint nodes --all node-role.kubernetes.io/control-plane-)
  • Does kubectl get nodes -o wide or kubectl describe nodes print some error?
  • 10250/tcp might be blocked by firewall?

Is it just memory maybe? oom score?

Unlikely

Maybe related? https://gitlab.freedesktop.org/dbus/dbus/-/issues/374

Unlikely

@vsoch
Copy link

vsoch commented Sep 6, 2023

Working on this now - it looks like the firewall is OK:

image

Testing the others now.

@AkihiroSuda
Copy link
Member Author

Also, please make sure "10.10.0.5" is the IP of the host (not the node container) that is reachable from other hosts.
If not, you may have to run make with HOST_IP=XXX.XXX.XXX.XX explicitly.

@vsoch
Copy link

vsoch commented Sep 6, 2023

This has happened twice now - it freezes on the worker node connecting:

[+] Running 1/0
 ✔ Container usernetes-node-1  Running                                                           0.0s 
docker compose exec -e U7S_HOST_IP=10.10.0.4 -e U7S_NODE_NAME=u7s-usernetes-compute-003 -e U7S_NODE_SUBNET=10.100.5.0/24 node kubeadm join 10.10.0.5:6443 --token w50k8z.cg55fshm4x9hmmrk --discovery-token-ca-cert-hash sha256:f3709024d7fb0f5ba150b05de6221bdfc6422fd524c593013154648c1d8418ad 
[preflight] Running pre-flight checks
	[WARNING SystemVerification]: missing optional cgroups: hugetlb

I'm just going to control c and continue with one worker node for now.

@vsoch
Copy link

vsoch commented Sep 6, 2023

actually I take it back - it's not working on either worker node now. This step hangs:

$ make -C /opt/usernetes up kubeadm-join
make: Entering directory '/opt/usernetes'
./Makefile.d/check-preflight.sh
[WARNING] systemd lingering is not enabled. Run `sudo loginctl enable-linger $(whoami)` to enable it, otherwise Kubernetes will exit on logging out.
[WARNING] Kernel module "ip6_tables" does not seem loaded? (negligible if built-in to the kernel)
[WARNING] Kernel module "ip6table_nat" does not seem loaded? (negligible if built-in to the kernel)
[WARNING] Kernel module "iptable_nat" does not seem loaded? (negligible if built-in to the kernel)
docker compose up --build -d
[+] Building 0.5s (11/11) FINISHED                                                     docker:default
 => [node internal] load .dockerignore                                                           0.0s
 => => transferring context: 66B                                                                 0.0s
 => [node internal] load build definition from Dockerfile                                        0.0s
 => => transferring dockerfile: 994B                                                             0.0s
 => [node internal] load metadata for docker.io/kindest/node:v1.28.0                             0.3s
 => [node internal] load build context                                                           0.0s
 => => transferring context: 84B                                                                 0.0s
 => [node] https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-  0.2s
 => [node stage-3 1/4] FROM docker.io/kindest/node:v1.28.0@sha256:b7a4cad12c197af3ba43202d3efe0  0.0s
 => CACHED [node cni-plugins-amd64 1/1] ADD https://github.com/containernetworking/plugins/rele  0.0s
 => CACHED [node stage-3 2/4] RUN --mount=type=bind,from=cni-plugins,dst=/mnt/tmp   tar Cxzvf /  0.0s
 => CACHED [node stage-3 3/4] RUN apt-get update && apt-get install -y --no-install-recommends   0.0s
 => CACHED [node stage-3 4/4] ADD Dockerfile.d/u7s-entrypoint.sh /                               0.0s
 => [node] exporting to image                                                                    0.0s
 => => exporting layers                                                                          0.0s
 => => writing image sha256:e05649c01de33dd232081d438e377c437f5ce1b098ffa2ac648a1fd8f1a5824d     0.0s
 => => naming to docker.io/library/usernetes-node                                                0.0s
[+] Running 1/0
 ✔ Container usernetes-node-1  Running                                                           0.0s 
docker compose exec -e U7S_HOST_IP=10.10.0.3 -e U7S_NODE_NAME=u7s-usernetes-compute-002 -e U7S_NODE_SUBNET=10.100.153.0/24 node kubeadm join 10.10.0.5:6443 --token xxxxxxxxxxxxxxx --discovery-token-ca-cert-hash sha256:xxxxxxxxxxxxxx
[preflight] Running pre-flight checks
	[WARNING SystemVerification]: missing optional cgroups: hugetlb

Even when I run the linger command and daemon-reload that message comes up.

@AkihiroSuda
Copy link
Member Author

--token

The token value shouldn't be pasted publicly.
Probably safe, as long as you are using private IP addresses though.

This step hangs:

Seems a networking issue.

Is 10.10.0.5:6443 reachable from 10.10.0.3 (and 10.10.0.5 itself)?

@vsoch
Copy link

vsoch commented Sep 6, 2023

Weird, I'm getting the error earlier (I haven't joined the worker nodes yet, this is from the control plane)

Sep 06 02:26:44 u7s-usernetes-compute-001 kubelet[870]: E0906 02:26:44.421716     870 container_manager_linux.go:509] "Failed to ensure process in container with oom score" err="failed to apply oom score -999 to PID 870: write /proc/870/oom_score_adj: permission denied"

@vsoch
Copy link

vsoch commented Sep 6, 2023

I don't think so - this is from the same host:

$ ping usernetes-compute-001
PING usernetes-compute-001.c.llnl-flux.internal (10.10.0.5) 56(84) bytes of data.
64 bytes from usernetes-compute-001.c.llnl-flux.internal (10.10.0.5): icmp_seq=1 ttl=64 time=0.025 ms
64 bytes from usernetes-compute-001.c.llnl-flux.internal (10.10.0.5): icmp_seq=2 ttl=64 time=0.041 ms
64 bytes from usernetes-compute-001.c.llnl-flux.internal (10.10.0.5): icmp_seq=3 ttl=64 time=0.030 ms
^C
--- usernetes-compute-001.c.llnl-flux.internal ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2024ms
rtt min/avg/max/mdev = 0.025/0.032/0.041/0.006 ms
sochat1_llnl_gov@usernetes-compute-001:/opt/usernetes$ ping 10.10.0.5
PING 10.10.0.5 (10.10.0.5) 56(84) bytes of data.
64 bytes from 10.10.0.5: icmp_seq=1 ttl=64 time=0.035 ms
64 bytes from 10.10.0.5: icmp_seq=2 ttl=64 time=0.039 ms
^C
--- 10.10.0.5 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1008ms
rtt min/avg/max/mdev = 0.035/0.037/0.039/0.002 ms
sochat1_llnl_gov@usernetes-compute-001:/opt/usernetes$ ping 10.10.0.5:6443
ping: 10.10.0.5:6443: Name or service not known

And here is from 002:

$ ping usernetes-compute-001
PING usernetes-compute-001.c.llnl-flux.internal (10.10.0.5) 56(84) bytes of data.
64 bytes from usernetes-compute-001.c.llnl-flux.internal (10.10.0.5): icmp_seq=1 ttl=64 time=0.738 ms
64 bytes from usernetes-compute-001.c.llnl-flux.internal (10.10.0.5): icmp_seq=2 ttl=64 time=0.133 ms
^C
--- usernetes-compute-001.c.llnl-flux.internal ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.133/0.435/0.738/0.302 ms
sochat1_llnl_gov@usernetes-compute-002:~$ ping 10.10.0.5
PING 10.10.0.5 (10.10.0.5) 56(84) bytes of data.
64 bytes from 10.10.0.5: icmp_seq=1 ttl=64 time=0.717 ms
64 bytes from 10.10.0.5: icmp_seq=2 ttl=64 time=0.142 ms
^C
--- 10.10.0.5 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1013ms
rtt min/avg/max/mdev = 0.142/0.429/0.717/0.287 ms
sochat1_llnl_gov@usernetes-compute-002:~$ ping 10.10.0.5:6443
ping: 10.10.0.5:6443: Name or service not known

@AkihiroSuda
Copy link
Member Author

ping: 10.10.0.5:6443: Name or service not known

ping cannot be used to ping TCP ports. curl -k https://10.10.0.5:6443 may suffice.
(If the server is functional, it prints "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",)

@vsoch
Copy link

vsoch commented Sep 6, 2023

No route to host. That's so weird, this just worked on the previous cluster I brought up (and no differences)

$ curl -k https://10.10.0.5:6443
curl: (7) Failed to connect to 10.10.0.5 port 6443 after 0 ms: No route to host

@vsoch
Copy link

vsoch commented Sep 6, 2023

I'm going to tear it down and bring up again from scratch.

@vsoch
Copy link

vsoch commented Sep 6, 2023

okay this time I am trying a larger node (just to sanity check) and the pre-flight check failed for the first node:

make: Entering directory '/opt/usernetes'
./Makefile.d/check-preflight.sh
[WARNING] systemd lingering is not enabled. Run `sudo loginctl enable-linger $(whoami)` to enable it, otherwise Kubernetes will exit on logging out.
[WARNING] Kernel module "ip6_tables" does not seem loaded? (negligible if built-in to the kernel)
[WARNING] Kernel module "ip6table_nat" does not seem loaded? (negligible if built-in to the kernel)
[WARNING] Kernel module "iptable_nat" does not seem loaded? (negligible if built-in to the kernel)
docker compose up --build -d
[+] Building 23.0s (11/11) FINISHED                                                   docker:rootless
 => [node internal] load build definition from Dockerfile                                        0.0s
 => => transferring dockerfile: 994B                                                             0.0s
 => [node internal] load .dockerignore                                                           0.0s
 => => transferring context: 66B                                                                 0.0s
 => [node internal] load metadata for docker.io/kindest/node:v1.28.0                             0.6s
 => [node] https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-  0.6s
 => [node stage-3 1/4] FROM docker.io/kindest/node:v1.28.0@sha256:b7a4cad12c197af3ba43202d3efe  15.7s
 => => resolve docker.io/kindest/node:v1.28.0@sha256:b7a4cad12c197af3ba43202d3efe03246b3f0793f1  0.0s
 => => sha256:b7a4cad12c197af3ba43202d3efe03246b3f0793f162afb40a33c923952d5b31 741B / 741B       0.0s
 => => sha256:9f3ff58f19dcf1a0611d11e8ac989fdb30a28f40f236f59f0bea31fb956ccf5c 743B / 743B       0.0s
 => => sha256:ad70201dab1369d251eeea8018a6e230a244e6ebd9cbd13599a1a9ac80d57bdb 1.94kB / 1.94kB   0.0s
 => => sha256:f86a56ded609290d97bd193f9c72e4f270c9e852bddae68e772b37828e76a 123.82MB / 123.82MB  1.8s
 => => sha256:32e9990d17952234896c1113bf84009f6e553dde4de92d5c1539b50ab0adb 310.29MB / 310.29MB  4.6s
 => => extracting sha256:f86a56ded609290d97bd193f9c72e4f270c9e852bddae68e772b37828e76a3e5        2.3s
 => => extracting sha256:32e9990d17952234896c1113bf84009f6e553dde4de92d5c1539b50ab0adb4ec        2.6s
 => [node internal] load build context                                                           0.0s
 => => transferring context: 818B                                                                0.0s
 => [node cni-plugins-amd64 1/1] ADD https://github.com/containernetworking/plugins/releases/do  0.1s
 => [node stage-3 2/4] RUN --mount=type=bind,from=cni-plugins,dst=/mnt/tmp   tar Cxzvf /opt/cni  1.2s
 => [node stage-3 3/4] RUN apt-get update && apt-get install -y --no-install-recommends   gette  4.7s 
 => [node stage-3 4/4] ADD Dockerfile.d/u7s-entrypoint.sh /                                      0.0s 
 => [node] exporting to image                                                                    0.6s 
 => => exporting layers                                                                          0.6s
 => => writing image sha256:39fbc7ab2ae1c40a8028d40ac5dfcb8ce0c6ae99a13f984b10ae46a4b4002a11     0.0s
 => => naming to docker.io/library/usernetes-node                                                0.0s
[+] Running 5/5
 ✔ Network usernetes_default    Created                                                          0.1s 
 ✔ Volume "usernetes_node-var"  Created                                                          0.0s 
 ✔ Volume "usernetes_node-opt"  Created                                                          0.0s 
 ✔ Volume "usernetes_node-etc"  Created                                                          0.0s 
 ✔ Container usernetes-node-1   Started                                                          4.9s 
docker compose exec -e U7S_HOST_IP=10.10.0.4 -e U7S_NODE_NAME=u7s-usernetes-compute-002 -e U7S_NODE_SUBNET=10.100.153.0/24 node kubeadm join 10.10.0.3:6443 --token xxxxxxxxxxx --discovery-token-ca-cert-hash sha256:xxxxxxxxxxxxxxxx
[preflight] Running pre-flight checks
	[WARNING SystemVerification]: missing optional cgroups: hugetlb
error execution phase preflight: [preflight] Some fatal errors occurred:
	[ERROR CRI]: container runtime is not running: output: time="2023-09-06T04:06:04Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: no such file or directory\""
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
make: *** [Makefile:97: kubeadm-join] Error 1

And for the second node it's still hanging.

@AkihiroSuda I think you are 15 hours ahead of me, so 4pm my time is 7am your time, 5pm my time is 8am (start of the work day?) We were planning on doing a small hackathon this Friday to work on this - and I wanted to invite you / see if you are available? 7am is quite early, but if you are up around 8am I think I could work a bit later on Friday. I could potentially do an hour later, just need some notice for that!

High level - I'd like to get this terraform setup working, and consistently, so I can contribute it here. I am going to try one more thing tonight - bringing it totally down and up, and running the scripts interactively. If there is some subtle difference with a service not persisting in this automated mode, that might do it. I will update the thread here, and let me know if you might have some time on Friday so we can bring up this setup and get your eyes on it (I am likely missing something obvious and this is very likely the best means to finishing it up!)

@vsoch
Copy link

vsoch commented Sep 6, 2023

Reproduced in manual running mode, so unlikely to be the automation bit.

 ✔ Network usernetes_default    Created                                                                                                                0.1s 
 ✔ Volume "usernetes_node-opt"  Created                                                                                                                0.0s 
 ✔ Volume "usernetes_node-etc"  Created                                                                                                                0.0s 
 ✔ Volume "usernetes_node-var"  Created                                                                                                                0.0s 
 ✔ Container usernetes-node-1   Started                                                                                                                4.9s 
docker compose exec -e U7S_HOST_IP=10.10.0.4 -e U7S_NODE_NAME=u7s-usernetes-compute-002 -e U7S_NODE_SUBNET=10.100.153.0/24 node kubeadm join 10.10.0.3:6443 --token boydm6.lgdgji6o10zhcrww --discovery-token-ca-cert-hash sha256:60006cde0edda31f26cae0f2a80ef7fac7803d1121ab98678fa81edc220c212a 
[preflight] Running pre-flight checks
	[WARNING SystemVerification]: missing optional cgroups: hugetlb
error execution phase preflight: [preflight] Some fatal errors occurred:
	[ERROR CRI]: container runtime is not running: output: time="2023-09-06T04:33:28Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: no such file or directory\""
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
make: *** [Makefile:97: kubeadm-join] Error 1
make: Leaving directory '/opt/usernetes'
make: Entering directory '/opt/usernetes'
./Makefile.d/check-preflight.sh

I can confirm this works on the main control-plane

$ curl -k https://10.10.0.3:6443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

so it's definitely just not being able to reach that port.

@vsoch
Copy link

vsoch commented Sep 6, 2023

I'm going to try adding egress for that port. It doesn't make sense that it worked the first time, but it's worth a shot!

@vsoch
Copy link

vsoch commented Sep 6, 2023

Nice! So the nodes (one worker node) is coming up again. So I think it was egress, but I can't say why it worked the first time! There is still some flakiness with something related to the actual instance and cgroups, I've seen this a couple of times (usually just one node - it's like one of the nodes randomly starts and doesn't have support for the updated cgroups (and reports missing systemd).

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
systemd is already the newest version (249.11-0ubuntu3.9).
systemd set to manually installed.
uidmap is already the newest version (1:4.8.1-2ubuntu2.1).
The following packages were automatically installed and are no longer required:
  libntfs-3g89 libnuma1
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 33 not upgraded.
[Service]
Delegate=cpu cpuset io memory pids
cat: /sys/fs/cgroup/user.slice/user-501043911.slice/[email protected]/cgroup.controllers: No such file or directory
Failed to connect to bus: No such file or directory
[INFO] systemd not detected, dockerd-rootless.sh needs to be started manually:

I would suspect this is Google Cloud or terraform related, not usernetes, but I don't know. But now that I know the egress was an issue and we had issue with the ports for the test app, I'm going to blow it up again and expose more for egress. Will send an update!

@vsoch
Copy link

vsoch commented Sep 6, 2023

okay reproduced what I had earlier - it seems a bit flaky (not the usernetes, the terraform) but this did work a second time. The place we are at is that the nodes come up, but the test doesn't work.

$ cd /opt/usernetes/hack
./test-smoke.sh
[INFO] Waiting for nodes to be ready
node/u7s-usernetes-compute-001 condition met
node/u7s-usernetes-compute-002 condition met
node/u7s-usernetes-compute-003 condition met
[INFO] Creating StatefulSet "dnstest" and headless Service "dnstest"
service/dnstest created
statefulset.apps/dnstest created
[INFO] Waiting for 3 replicas to be ready
Waiting for 3 pods to be ready...
Waiting for 2 pods to be ready...
Waiting for 2 pods to be ready...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...
[INFO] Connecting to dnstest-{0,1,2}.dnstest.default.svc.cluster.local
If you don't see a command prompt, try pressing enter.

Let me know if you might be able to join Friday! If not we can keep going back and forth here. The next thing to figure out is why I can't shell / connect to a pod.

@vsoch
Copy link

vsoch commented Sep 6, 2023

Perhaps if there is some range of ips that need to be open for the pods I should try adding them to egress. Adding the entire range seemed to bork the fix for 6443.

[INFO] Connecting to dnstest-{0,1,2}.dnstest.default.svc.cluster.local
If you don't see a command prompt, try pressing enter.
warning: couldn't attach to pod/dnstest-shell, falling back to streaming logs: error dialing backend: dial tcp 10.10.0.5:10250: i/o timeout
pod "dnstest-shell" deleted
Error from server: Get "https://10.10.0.5:10250/containerLogs/default/dnstest-shell/dnstest-shell": dial tcp 10.10.0.5:10250: i/o timeout

@vsoch
Copy link

vsoch commented Sep 6, 2023

I'm off to bed - thanks for the help today @AkihiroSuda !

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Sep 6, 2023

We were planning on doing a small hackathon this Friday to work on this - and I wanted to invite you / see if you are available? 7am is quite early, but if you are up around 8am I think I could work a bit later on Friday. I could potentially do an hour later, just need some notice for that!

👍

Google Cloud

Looks like VXLAN doesn't seem to work with Google Cloud by default, although it works with AWS and Azure:

Likely to be related to MTU.

@vsoch
Copy link

vsoch commented Sep 6, 2023

I'm going to ask if there are easy ways to get VXLAN working in GCP - ping @aojea. If not, I can prepare an equivalent setup on AWS. I have one for AWS with Flux, and I'd need to start that over to use a different ubuntu base, remove flux, etc. https://github.com/converged-computing/flux-terraform-ami.

@aojea
Copy link

aojea commented Sep 6, 2023

vxlan works, if there is a mtu problem then it is most probably solved by reducing the mtu on the origin or increasing it in the network (VM) https://cloud.google.com/vpc/docs/mtu so the encapsulation goes through

@AkihiroSuda
Copy link
Member Author

@vsoch Are you still planning something today? (8:22 AM Friday here)

@vsoch
Copy link

vsoch commented Sep 7, 2023

@AkihiroSuda my mistake in mixing up my reference days - it's still Thursday here! So our hackathon would be tomorrow at 3pm Mountain time in the US (it looks like that's about 21.5 hours from now). And we have two things we can look at - first is the usernetes setup here, and the second is an AWS equivalent I've started, although we are still in early steps (e.g., ensuring each node knows the hostname of the others).

@AkihiroSuda
Copy link
Member Author

@AkihiroSuda my mistake in mixing up my reference days - it's still Thursday here! So our hackathon would be tomorrow at 3pm Mountain time in the US (it looks like that's about 21.5 hours from now). And we have two things we can look at - first is the usernetes setup here, and the second is an AWS equivalent I've started, although we are still in early steps (e.g., ensuring each node knows the hostname of the others).

Sorry, I’m not attending then, but happy to help your experiment with AWS

@vsoch
Copy link

vsoch commented Sep 8, 2023

no worries! I can give you an update then. I can tell you that I can't consistently get the GCP setup working, maybe because of networking stuffs. It worked once, but then not again, even when I upped the MTU. I'm hoping we just have more luck on AWS and can develop there - will give you an update!

@vsoch
Copy link

vsoch commented Sep 8, 2023

And @AkihiroSuda we will make sure to plan another one that is on our Thursday which I am realizing is your Friday morning next time. Apologies for the oversight!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants