Skip to content
This repository has been archived by the owner on Jan 14, 2020. It is now read-only.

Error creating Kubernetes cluster #116

Open
tnbaeta opened this issue May 5, 2017 · 28 comments
Open

Error creating Kubernetes cluster #116

tnbaeta opened this issue May 5, 2017 · 28 comments

Comments

@tnbaeta
Copy link

tnbaeta commented May 5, 2017

I have followed the step-by-step guide on how to create a Kubernetes cluster in Photon Platform (https://github.com/vmware/photon-controller/wiki/Creating-a-Kubernetes-Cluster) and I got en error essentially saying "Unsupported operation GET_NETWORKS". I am using Mac OS for deployment. Has anyoine saw this before?
Follow the output of the command:
./photon service create -n kube-socrates -k KUBERNETES --master-ip 10.1.0.200 --load-balancer-ip 10.1.0.201 --etcd1 10.1.0.202 --dns 10.1.0.137 --gateway 10.1.0.2 --netmask 255.255.255.0 -c 1 --vm_flavor cluster-small

Error: photon: Task 'be72ba9d-73f1-4200-9c44-4038fab7c48a' is in error state: {@step=={"sequence"=>"1","state"=>"ERROR","errors"=>[photon: { HTTP status: '0', code: 'InternalError', message: 'Failed to rollout KubernetesEtcd. Error: MultiException[java.lang.IllegalStateException: VmProvisionTaskService failed with error [Task "GET_NETWORKS": step "GET_NETWORKS" failed with error code "StateError", message "Unsupported operation GET_NETWORKS for vm/8988e61a-4685-4f94-8e44-7f5aebfed6a6 in state ERROR"]. /photon/servicesmanager/vm-provision-tasks/48ea197554eca473c1ee3]', data: 'map[]' }],"warnings"=>[],"operation"=>"CREATE_KUBERNETES_SERVICE_SETUP_ETCD","startedTime"=>"1494005568944","queuedTime"=>"1494005568912","endTime"=>"1494005578947","options"=>map[]}}
API Errors: [photon: { HTTP status: '0', code: 'InternalError', message: 'Failed to rollout KubernetesEtcd. Error: MultiException[java.lang.IllegalStateException: VmProvisionTaskService failed with error [Task "GET_NETWORKS": step "GET_NETWORKS" failed with error code "StateError", message "Unsupported operation GET_NETWORKS for vm/8988e61a-4685-4f94-8e44-7f5aebfed6a6 in state ERROR"]. /photon/servicesmanager/vm-provision-tasks/48ea197554eca473c1ee3]', data: 'map[]' }]

@tactical-drone
Copy link

tactical-drone commented May 6, 2017

You are missing many settings.

here is an example that works...

photon cluster create -t foo -p dev -n seven -k KUBERNETES -m service-master-vm -W cluster-worker -d service-vm-disk -w 2b5b0ad7-059d-4670-908f-909117f5ce62 -c 4 --dns 10.0.7.1 --gateway 10.0.7.1 --netmask 255.255.255.0 --master-ip 10.0.7.9 -master-ip2 10.0.7.8 --load-balancer-ip 10.0.7.5 --container-network '10.2.0.0/16' --number-of-etcds 3 --etcd1 10.0.7.20 --etcd2 10.0.7.21 --etcd3 10.0.7.22 --ssh-key ~/.ssh/id_rsa.pub --batchSize 1 --registry-ca-cert ~/foo.com.crt

That bold section is where you specify what network to use. I think yours is failing to find the default network. Or maybe set your photon console network to default like so:

image

@tnbaeta
Copy link
Author

tnbaeta commented May 6, 2017

Thanks for replying @pompomJuice , but even explicit specifying the Network ID wit th -w switch still got the smae error: "Unsupported operation GET_NETWORKS".

Any other ideas?
Thanks!

@tactical-drone
Copy link

tactical-drone commented May 7, 2017

@tnbaeta No problem. Past your entire log and command please. This is a process of elimination.

@LIV2
Copy link

LIV2 commented May 22, 2017

I'm getting the same error, after digging around the photon-controller logs I see the following error:

ERROR [2017-05-22 04:08:16,183] com.vmware.photon.controller.api.frontend.commands.steps.ResourceReserveStepCmd: reserve resource failed: NOT_ENOUGH_MEMORY_RESOURCE, Not enough memory resources available
ERROR [2017-05-22 04:08:16,183] com.vmware.photon.controller.api.frontend.commands.BaseCommand: Command execution failed with exception
INFO [2017-05-22 04:08:16,183] com.vmware.photon.controller.api.frontend.commands.steps.ResourceReserveStepCmd: Resource reservation failed, mark entity e7c30ab6-8932-4913-84bc-720f4e4b5c38 state as ERROR
ERROR [2017-05-22 04:08:16,188] com.vmware.photon.controller.api.frontend.commands.BaseCommand: Command execution failed with exception
! com.vmware.photon.controller.api.frontend.exceptions.external.TaskNotCompletedException: Step "StepEntity{id=null, kind=step, state=ERROR, operation=RESERVE_RESOURCE, startedTime=Mon May 22 04:08:16 UTC 2017, queuedTime=Mon May 22 04:08:16 UTC 2017, endTime=Mon May 22 04:08:16 UTC 2017, sequence=0, error=StepErrorEntity{id=null, kind=step-error, code=NotEnoughMemoryResource, message=null}}" did not complete.
ERROR [2017-05-22 04:08:16,189] com.vmware.photon.controller.api.frontend.commands.tasks.TaskCommand: Task 2759e27c-9a80-4249-a244-20ae191b997b failed
! com.vmware.photon.controller.api.frontend.exceptions.external.TaskNotCompletedException: Step "StepEntity{id=null, kind=step, state=ERROR, operation=RESERVE_RESOURCE, startedTime=Mon May 22 04:08:16 UTC 2017, queuedTime=Mon May 22 04:08:16 UTC 2017, endTime=Mon May 22 04:08:16 UTC 2017, sequence=0, error=StepErrorEntity{id=null, kind=step-error, code=NotEnoughMemoryResource, message=null}}" did not complete.
INFO [2017-05-22 04:08:16,189] com.vmware.photon.controller.api.frontend.backends.TaskXenonBackend: Task TaskEntity{id=2759e27c-9a80-4249-a244-20ae191b997b, kind=task, entityId=e7c30ab6-8932-4913-84bc-720f4e4b5c38, entityKind=vm, state=QUEUED, operation=CREATE_VM, startedTime=null, queuedTime=Mon May 22 04:08:16 UTC 2017, endTime=null} has been marked as ERROR

I'm using the following create command
photon service create -n kube-socrates -k KUBERNETES --master-ip 192.168.50.231 --load-balancer-ip 192.168.50.232 --etcd1 192.168.50.232 --container-network 10.2.0.0/16 --dns 192.168.50.14 --gateway 192.168.50.254 --netmask 255.255.255.0 -c 1 -v cluster-small

My host has 16GB so there should be enough, I even tried setting the cluster-small flavor to use 1GB of ram only

# photon flavor show 2e390ecd-8fe8-4334-970e-b5704a28c07d
Flavor ID: 2e390ecd-8fe8-4334-970e-b5704a28c07d
Name: cluster-small
Kind: vm
Cost: [vm 1 COUNT vm.cpu 1 COUNT vm.memory 1 GB]
State: READY

# photon tenant list
ID Name
f683a922-3326-4334-962e-e0d3b12cc797 ops
Limits:
persistent-disk 100 COUNT
persistent-disk.capacity 200 GB
vm 100 COUNT
vm.count 1000 COUNT
vm.cpu 100 COUNT
vm.memory 16 GB
ephemeral-disk 100 COUNT
ephemeral-disk.capacity 100 GB
Usage:
persistent-disk.capacity 200 GB
vm 100 COUNT
vm.count 1000 COUNT
vm.cpu 100 COUNT
vm.memory 16 GB
ephemeral-disk 100 COUNT
ephemeral-disk.capacity 100 GB
persistent-disk 100 COUNT`

# photon project list
ID Name Limit Usage
381a7e47-12d9-4b1c-8017-c02ac1692a62 ops-proj ephemeral-disk.capacity 100 GB ephemeral-disk.capacity 39 GB
persistent-disk 100 COUNT persistent-disk 0 COUNT
persistent-disk.capacity 200 GB persistent-disk.capacity 0 GB
vm 100 COUNT vm 1 COUNT
vm.count 1000 COUNT vm.count 0 COUNT
vm.cpu 100 COUNT vm.cpu 1 COUNT
vm.memory 16 GB vm.memory 1 GB
ephemeral-disk 100 COUNT ephemeral-disk 1 COUNT

I am running the latest version of ESXI 6.5

@mwest44
Copy link

mwest44 commented May 22, 2017

Do you have any other VMs on the host? The scheduler does not use active memory in deciding VM placement, it looks at the configured memory for all VMs - whether powered on or not - in determining if there is available resource. No overcommit allowed.

@LIV2
Copy link

LIV2 commented May 22, 2017

@mwest44 ah okay that makes sense, I've got a quite a few others running, I'll spin up another host to test, thanks

@tactical-drone
Copy link

tactical-drone commented May 22, 2017

#LIV2 Yes. What I do is I take stock of all my physical resources, then program that into photon at a say 5/1 contention ratio. Meaning you enter the resources what you have times 5. Otherwise virtualization benefits are lost,

@LIV2
Copy link

LIV2 commented May 22, 2017

@pompomJuice How do you configure photon to overprovision? I haven't been able to find how to do so in the documentation.

@schadr
Copy link
Contributor

schadr commented May 22, 2017

in 1.1.1 it is not possible, in 1.2.0 there is a config file on the ESXi host that you can update to enable over-provisioning, I'm currently on the road not able to look at the code to double check, will update the thread once I find the value.

@tactical-drone
Copy link

tactical-drone commented May 23, 2017

@LIV2 It's not the official way, as you can see there is no official documented way that I know of. I will probably replace this technique with @schadr method.

But you can just set the quotas 'photon tenant quota' and 'photon project quota'. If you set your tenant to say have 5 times more resources than it actually has, you will be able to deploy workers that are over provisioned. ESxi will handle it automatically.

But performance issues will be more difficult to debug because of noisy neighbors and such so just understand the consequences.

@It4lik
Copy link

It4lik commented Jun 1, 2017

Hi,

I got Photon Controller 1.2, ESXis 6.5.0 (Build 4887370).
All ESXis have one datastore (default name) and two protgroups (default too : Management and VM Network)

When I run :
ph service create -n kube-test -k KUBERNETES --master-ip 10.0.40.1 --etcd1 10.0.40.11 --load-balancer-ip 10.0.40.21 --container-network 10.0.50.0/24 --dns 10.0.20.250 --gateway 10.0.0.1 --netmask 255.255.0.0 -c 2 --vm_flavor cluster-small

DNS is the Lightwave's IP address. Gateway, is my gateway address (actually a pfsense used to route/firewall/dhcp in a test env). Others are juste random IPs in 10.0.0.0/16

I got the same issue :

ERROR [2017-06-01 11:11:39,972] com.vmware.photon.controller.servicesmanager.tasks.WaitForNetworkTaskService: [/photon/wait-for-network-tasks/efb7a420550e419c9b7ae] java.lang.RuntimeException: [Task "GET_NETWORKS": step "GET_NETWORKS" failed with error code "StateError", message "Unsupported operation GET_NETWORKS for vm/8fcf7b42-3cd8-4c87-88b6-cbac2113a6c2 in state ERROR"]

Can't find anything else in the logs.

Here is my config :

root@photon-installer [ /var/log ]# ph tenant list
ID                                    Name
14bf7713-efdf-4022-99a8-58500c032268  test-tenant
    Limits:
      persistent-disk.capacity  40     GB
      sdn.floatingip.size       0      COUNT
      vm                        10     COUNT
      vm.memory                 20480  MB
      ephemeral-disk            10     COUNT
      ephemeral-disk.capacity   40     GB
      persistent-disk           10     COUNT
      vm.count                  20     COUNT
      vm.cpu                    10     COUNT
    Usage:
      ephemeral-disk            10     COUNT
      ephemeral-disk.capacity   40     GB
      persistent-disk           10     COUNT
      vm.count                  10     COUNT
      vm.cpu                    10     COUNT
      vm.memory                 20480  MB
      persistent-disk.capacity  40     GB
      sdn.floatingip.size       0      COUNT
      vm                        10     COUNT

Total: 1
ID                                    Name       Limit                           Usage
1371bd7c-f94e-4ca3-bf2d-1a7a693e78e0  test-kube  ephemeral-disk.capacity 40 GB   ephemeral-disk.capacity 39 GB
                                                 persistent-disk 10 COUNT        persistent-disk 0 COUNT
                                                 persistent-disk.capacity 40 GB  persistent-disk.capacity 0 GB
                                                 vm 10 COUNT                     vm 1 COUNT
                                                 vm.count 10 COUNT               vm.count 0 COUNT
                                                 vm.cpu 10 COUNT                 vm.cpu 1 COUNT
                                                 vm.memory 20 GB                 vm.memory 2 GB
                                                 ephemeral-disk 10 COUNT         ephemeral-disk 1 COUNT

Total projects: 1
root@photon-installer [ /var/log ]# ph subnet list
ID                                    Name        Kind    Description  PrivateIpCidr  ReservedIps  State  IsDefault  PortGroups
924a400f-055e-4a9f-9846-274b1f972078  vm-network  subnet  k8s network                 map[]        READY  true       [VM Network]

Any ideas ? I keep on reading logs...

@tactical-drone
Copy link

tactical-drone commented Jun 1, 2017

Let me try and break it down, maybe we can find the issue.

--master-ip 10.0.40.1 --etcd1 10.0.40.11 --load-balancer-ip 10.0.40.21
Be aware that you need a DHCP server on this adress space or your nodes wont work. They are set to get their IPs on DHCP. I suggest you make a vlan for this.

--container-network 10.0.50.0/24
This is fine.

--gateway 10.0.0.1
This is the first suspect setting. Why are your masters and etcd on 10.0.40.0/24 while your gateway is on 10.0.0.1/24. This wont work. You masters wont be able to reach your gateway.

--dns 10.0.20.250
Since your gateway does not work, it is also unable to provide access to this ip since this DNS ip also is not on the 10.0.40.0/24 network.

--netmask 255.255.0.0 -c 2 --vm_flavor cluster-small
There is a note in the docs that you must use --batch-size 1 or something. There is a race condition in the boostrapping phase, they are clearly aware of it.

My advice is recheck your gateway and dns settings so that they are on the same network as 10.0.40/24. I would guess those settings to be 10.0.40.1 & 10.0.40.10. But 10.0.40.1 is also your master so that must be some other ip.

@It4lik
Copy link

It4lik commented Jun 1, 2017

I'm able to provision single hosts with these settings. (EDIT : I mean, a single VM, coming from the officiel PhotonOVA, automatically getting an IP from my dhcp (pfsense))

My network is a /16, not /24 (mask 255.255.0.0, you've pasted it), so 10.0.40.1 should actually be able to join 10.0.0.1.
10.0.0.1 is a pfsense machine with multiples nics including one which permits NAT to the outside world.

I'll try with a /24 anyway. I didn't copy/paste all different commands i tried before posting, but the batch-size option does not change anything. From now, I'll add it systematically.

@tactical-drone
Copy link

Aah I see. I missed the /16. Let me rethink this.

@It4lik
Copy link

It4lik commented Jun 1, 2017

No problem, thank you for your answers :)

I'm stayin tuned (and keep on searching and making tests...)

@tactical-drone
Copy link

tactical-drone commented Jun 1, 2017

Ok your 10.0.0.0/16 wont work. Your DHCP server wont know how to provision the worker nodes. That's what I am sensing here. Because how would the DHCP know that the kubernetes worker nodes need to go to the 10.0.40.0/16 network. Those get random MACs generated (that you might be able to detect with some pattern, then your DCHP might work).

Secondly, photon-controller does not have a setting for kubernetes cluster IP. And those are set to 10.0.0.0/24. So I am not sure if that will clash with your 10.0.0.0/16 network.

@tactical-drone
Copy link

But with regards to point 1, I don't think those worker nodes come into play yet in your situation. So am not so sure about that.

What I do know is that you really need to have your network config set up right. And the documentation does not cover that part. It does not mention for example that your kubernetes network needs an DHCP that they don't provide.

@It4lik
Copy link

It4lik commented Jun 1, 2017

But I just don't understand the need of DHCP if we specify static IP addresses ? Is that for further containerized applications ? 'cause we manually specify static addresses for etcd, master and load-balancer...

I don't know if I was clear before but I'm able to provision single docker host from Photon Controller, with PhotonOS OVA, and it automatically gets an IP. My PFSense gives IP on 10.0.0.0/16 network and delivers leases between 10.0.50.0 and 10.0.100.255. I actually have a working DHCP in my 10.0.0.0/16 private subnet.

So, I'm not sure i understood. What is exactly your advice ? Just forget about /16 and only use /24 ?

@tactical-drone
Copy link

tactical-drone commented Jun 1, 2017

Apologies, I got that backwards.

photon worker nodes require a DHCP server, not the kubernetes network. I got confused there for a second. Because when you specify a photon worker count of say 2, kubernetes spawns it's pods via the photon-controller interface on top of these workers. photon-controller's cloud config for workers sets them to get DHCP (unlike for it's masters and etcd's who get a static configuration as dictated by the photon setup yaml). And that DHCP lease they receive must be the same as the rest of your photon network.

@tactical-drone
Copy link

tactical-drone commented Jun 1, 2017

That photon DHCP lease must also provide your kubernetes cluster DNS ip, which you must set manually on both sides. If 10.0.0.0/24 clashes with your network I have no idea how that would affect your install, but I had a clash and the install worked the cluster's dns was just completely messed up. Kubernetes dns services were not working because pods inherit resolv.conf from their docker host and those in turn are provided by this DHCP config that you are missing.

@It4lik
Copy link

It4lik commented Jun 1, 2017

Okay, acknowledged, it's for the workers.

But i just don't get "your kubernetes cluster DNS ip, which you must set manually on both sides."... What ? How can I manually set an address on a non-existing machine ? I can't predict its MAC so I can't use a static DHCP lease

It is still failing with the following ph service create -n kube-test -k KUBERNETES --master-ip 10.0.40.1 --etcd1 10.0.40.11 --load-balancer-ip 10.0.40.21 --container-network 192.168.1.0/24 --dns 10.0.20.250 --gateway 10.0.0.1 --netmask 255.255.0.0 -c 2 --vm_flavor cluster-small --batchSize 1 -w 924a400f-055e-4a9f-9846-274b1f972078, same error.

My DHCP now delivers addresses between 10.0.40.0 and 10.0.100.255.

When you say "10.0.0.0/24 clashes with your network", you were talking about my container network ? I changed this. Still the same error.

@mwest44
Copy link

mwest44 commented Jun 1, 2017

Your container network must be /16. we use Flannel to handle the container networks. This /16 is carved up into individual /24 networks for each worker node.

@tactical-drone
Copy link

tactical-drone commented Jun 1, 2017

Kubernetes has this contruct of a cluster ip. It uses this IP to route things around internally with iptables. The ip therefor does not exist in a physical device. Those routing rules do not work when one of your interfaces also thinking it can provide 10.0.0.0/24. iptables is a chain and if the packet does not reach the kubernetes nat chains because some forward table consumed the packet because one of the interfaces can also provide it.

In my case all calls from kubernetes's containers to reach kubernetes DNS (statically set to 10.0.0.10 in photon-controller's case) ended up at our company DNS, 10.0.0.10. ( it was forwarded over the nic instead of reaching the nat table)

@tactical-drone
Copy link

tactical-drone commented Jun 1, 2017

My DHCP now delivers addresses between 10.0.40.0 and 10.0.100.255.

That should work. As long as your routers make everyone on that network able to communicate. I am not so clued up on iptables /16 network nic routing match behaviour (clearly) but from what I understand it should work. Because maybe iptables sends those packets to your gateway instead of just communicating on the same network.

@It4lik
Copy link

It4lik commented Jun 1, 2017

Still failing with ph service create -n k8s-test-cluster -k KUBERNETES --master-ip 10.0.90.1 --etcd1 10.0.90.2 --load-balancer-ip 10.0.90.3 --container-network 192.168.0.0/16 --dns 10.0.20.250 --gateway 10.0.0.1 --netmask 255.255.0.0 -c 3 --vm_flavor cluster-small -w 924a400f-055e-4a9f-9846-274b1f972078 --batchSize 1

DHCP does not change anything. Nor using /16 instead of /24 for the container network. Mmh I don't really understand what's happening here.

My router :

  • allows me to create tunnel, and access to remote endpoints (UIs and photon-controller API)
  • allows VMs to join outside world (my society LAN + internet)
  • allows VMs to communicate with each others

My lab is only composed of 1 ESXi. For test purposes.
This ESXis has a PFSense host, and 3 nested ESXIs. These 3 nested ESXis are used for photon controller. The first one is for management machines (lightwave, load-balancer and photon-controller) the two others are used to provision containerization hosts/clusters.

What I meant is that EVERYTHING is local here : PFSense acts as firewall, router, DHCP for 10.0.0.0/16. The nested ESXis are also in this 10.0.0.0/16 (and only in this subnet).

@tactical-drone
Copy link

tactical-drone commented Jun 1, 2017

I thought I read somewhere that nested ESXi wont work.

Or I have definitely read that nested ESXi causes some problems for some kind of cluster solution. Can't remember if it was tectonic or photon.

@It4lik
Copy link

It4lik commented Jun 1, 2017

Mmmmh okay... But only for clusters ? Because every other actions seems to be successful.

Well, thanks for your answers. I'm staying tuned if someone else has ideas.

@tactical-drone
Copy link

I can't remember. Al I remember is that nested ESXi jammed some network construct.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants