Skip to content
This repository has been archived by the owner on Feb 2, 2023. It is now read-only.

pure1-unplugged - kubelet node status | ERROR | kubelet node is not ready #41

Open
Sephzer opened this issue Feb 24, 2021 · 16 comments
Open

Comments

@Sephzer
Copy link

Sephzer commented Feb 24, 2021

I'm experiencing an issue with a brand new .ISO install.

After completing the guide and browsing to the pure1-unplugged IP I am getting err_connection_refused. After checking the CLI and running puctl infra status the kubelet node status is showing an error. Everything is passed and is green.

I have never used Kubernetes before so I am not sure where to look for help on this, especially as this looks to be a non-standard implementation. I have tried to check the logs specified in the infra status command but there is nothing that draws any attention to itself.

Can you let me know how best to troubleshoot this. I have also experienced the same error when installing from the .OVA as well.

Just to note, Zubair from Pure is also engaged from professional services and ideally we need to get Pure1 up and running ASAP. Cheers.

@Pure-AdamuKaapan
Copy link
Collaborator

Hello @Sephzer , sorry to hear you're running into this issue.

A few questions for you:

  • Does this kubelet issue happen right after you run puctl infra init, or only after completing the installation?
  • Can you run journalctl -u kubelet and upload the output so we can see why kubelet is crashing? Feel free to upload that here (censor if you'd like) or send in an email.

I haven't seen this issue before so I'm hoping it's an easy fix/environment factor.

@Sephzer
Copy link
Author

Sephzer commented Feb 26, 2021

Hi @Pure-AdamuKaapan , thanks for getting back to me.

Unfortunately, I won't be able to upload any logs as this is in a dark site environment and is completely locked down.

I noticed that after the initial install none of the services were running, it was showing 1/15 successful checks. So naturally I ran the init command and everything came up apart from that one service.

I know enough Linux to get around comfortably and read logs etc. Can you let me know what to check and what to look out for. Appreciate the fact that this is not going to be easy but we don't really have any other way to troubleshoot.

@Pure-AdamuKaapan
Copy link
Collaborator

1/15 before running anything sounds reasonable (don't remember what the 1 check is but probably some pre-condition).

Could you go ahead and check in the kubelet journalctl logs and just see if there's any obvious errors as to why it may not be starting? Placing my bets on networking setup for now but I guess we'll find out. Thanks!

@Sephzer
Copy link
Author

Sephzer commented Mar 3, 2021 via email

@Pure-AdamuKaapan
Copy link
Collaborator

@Sephzer I believe in journalctl you can scroll to the right using the arrow keys (at least, I'm able to on my instance of Unplugged). If not, you can save the output to a file and then use an editor (whether on the VM or on another server in the same network) to view the full thing.

@Sephzer
Copy link
Author

Sephzer commented Mar 3, 2021 via email

@Sephzer
Copy link
Author

Sephzer commented Mar 4, 2021

@Pure-AdamuKaapan I've managed to capture the logs properly so they're now legible. Thanks for the assist there. Now onto what's in the logs..

Lots of failures connecting to the API on port 6443, connection is always refused which looks like a FW issue. Though that's a bit strange as it's connecting to itself. And then we have a lot of 'node not found' errors. Also, lots of 'no networks found in /etc/cni/net.d' and container runtime network not ready, network plugin not ready.

Not sure if that helps. Let me know if you need any specifics.

@Pure-AdamuKaapan
Copy link
Collaborator

Okay, so what I'm gathering from that is that it's a Kubernetes networking issue. Can you do journalctl | grep "calico" and see if there's any obvious errors in there? Nothing matching that is also a telltale sign.

Also, what do the two calico checks in puctl infra status say? Does it say ready, or something else? It looks like you said everything was ready except for kubelet node status, just wanted to double check.

@Sephzer
Copy link
Author

Sephzer commented Mar 8, 2021

Morning @Pure-AdamuKaapan, I've just run the grep command and there are no errors in there. I have found matching entries and it all looks good, existing endpoints are found and mac, interfaces are added etc.

Can also confirm that the two calico checks are green, no issues there. The only error is the kubelet node status.

@Pure-AdamuKaapan
Copy link
Collaborator

@Sephzer that is bizarre... so far most of the errors still point towards network plugin issues/Calico not coming up right. Can you do ls /etc/cni/net.d and see if there's files like 10-calico.conflist and/or calico-kubeconfig in there?

Can you also try the following and send the output?

# export KUBECONFIG=/etc/kubernetes/admin.conf
# kubectl get pods -n kube-system

@Sephzer
Copy link
Author

Sephzer commented Mar 12, 2021

@Pure-AdamuKaapan no outport on the export command.

Second command gives the following (all entries are running and are ready 1/1). This has been typed out manually and abbreviated.

calico-node calicotl coredns coredns etc-hostname kube-apiserver-hostname kube-controller-manager-hostname kube-proxy kube-scheduler-hostname tille-deploy

Let me know if you need anything else.

@Pure-AdamuKaapan
Copy link
Collaborator

@Sephzer very strange... I need some time to do some research, this is an issue I've never seen before and everything seems to be fine as far as I can tell. I'll try to have some more ideas/information in the next couple days.

@Pure-AdamuKaapan
Copy link
Collaborator

@Sephzer just to check since I realized this could be the case: for the various errors you say it's spitting out, is it still doing so? Or were they just emitted at one point and now the services are sitting there with no output/different output?

Also, you say that you've encountered this issue with both the OVA and ISO so it sounds like just a reinstall won't fix things. I would be curious to know though: what's the IP address (or at least CIDR) of your VM, and what are your CIDRs set to in /etc/pure1-unplugged/config.yaml?

@Sephzer
Copy link
Author

Sephzer commented Apr 9, 2021

@Pure-AdamuKaapan Sorry it took so long to get back to you, things have been manic at the bank.

Let me see if I can dig up the details for you. Don't think I changed the internal subnets but I could be wrong.

@Sephzer
Copy link
Author

Sephzer commented Apr 23, 2021

@Pure-AdamuKaapan apologies once again, finally got round to this.

I've just run the status command and now we are getting 15 errors... no idea why. Might reboot the VM and try again. Looks like the IP address has now disappeared as well... not really sure what is going on now. POD CIDR = 192.168.0.0/16, serviceCIDR = 10.96.0.0/12.

Will attempt to get the box back online with an IP.

@Pure-AdamuKaapan
Copy link
Collaborator

Oooh that sounds fun, I wish you the best in resolving that. Once you get the VM back online, can you make sure that both the pod CIDR and service CIDR DO NOT CONTAIN the IP addresses both of the VM and the arrays/blades you're connecting to (and then please let me know what you updated it to).

For example, if your VM and the arrays are on the 192.168.0.0/16 subnet, set the serviceCIDR to 10.96.0.0/12 like it is now and set the podCIDR to 10.112.0.0/12 or similar for example, so they don't conflict with each other and don't conflict with the arrays and VM.

Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants