This GEP tackles the problem that today a single LoadBalancer
is needed for every single Shoot cluster's control plane.
When the control plane of a Shoot cluster is provisioned, a dedicated LoadBalancer is created for it. It keeps the entire flow quite easy - the apiserver Pods are running and they are accessible via that LoadBalancer. It's hostnames / IP addresses are used for DNS records like api.<external-domain>
and api.<shoot>.<project>.<internal-domain>
. While this solution is simple it comes with several issues.
There are several problems with the current setup.
- IaaS provider costs. For example
ClassicLoadBalancer
on AWS costs at minimum 17 USD / month. - Quotas can limit the amount of LoadBalancers you can get per account / project, limiting the number of clusters you can host under a single account.
- Lack of support for better loadbalancing algorithms than round-robin.
- Slow cluster provisioning time - depending on the provider a LoadBalancer provisioning could take quite a while.
- Lower downtime when workload is shuffled in the clusters as the LoadBalancer is Kubernetes-aware.
- Only one LoadBalancer is used for all Shoot cluster API servers running in a Seed cluster.
- Out-of-cluster (end-user / robot) communication to the API server is still possible.
- In-cluster communication via the kubernetes master service (IPv4/v6 ClusterIP and the
kubernetes.default.svc.cluster.local
) is possible. - Client TLS authentication works without intermediate TLS termination (TLS is terminated by
kube-apiserver
). - Solution should be cloud-agnostic.
To solve the problem of having multiple kube-apiservers
behind a single LoadBalancer, an intermediate proxy must be placed between the Cloud-Provider's LoadBalancer and kube-apiservers
. This proxy is going to choose the Shoot API Server with the help of Server Name Indication. From wikipedia:
Server Name Indication (SNI) is an extension to the Transport Layer Security (TLS) computer networking protocol by which a client indicates which hostname it is attempting to connect to at the start of the handshaking process. This allows a server to present multiple certificates on the same IP address and TCP port number and hence allows multiple secure (HTTPS) websites (or any other service over TLS) to be served by the same IP address without requiring all those sites to use the same certificate. It is the conceptual equivalent to HTTP/1.1 name-based virtual hosting, but for HTTPS.
A rough diagram of the flow of data:
+-------------------------------+
| |
| Network LB | (accessible from clients)
| |
| |
+-------------+-------+---------+ +------------------+
| | | |
| | proxy + lb | Shoot API Server |
| | +-------------+------------->+ |
| | | | Cluster A |
| | | | |
| | | +------------------+
| | |
+----------------v----+--+
| | |
+-+--------v----------+ | +------------------+
| | | | |
| | | proxy + lb | Shoot API Server |
| Proxy | +-------------+---------->+ |
| | | | Cluster B |
| | | | |
| +----+ +------------------+
+----------------+----+
|
|
| +------------------+
| | |
| proxy + lb | Shoot API Server |
+-------------------+-------------->+ |
| Cluster C |
| |
+------------------+
Sequentially:
- client requests
Shoot Cluster A
and sets theServer Name
in the TLS handshake toapi.shoot-a.foo.bar
. - this packet goes through the Network LB and it's forwarded to the Proxy server. (this loadbalancer should be a simple Layer-4 TCP proxy)
- the proxy server reads the packet and see that client requests
api.shoot-a.foo.bar
. - based on its configuration, it maps
api.shoot-a.foo.bar
toShoot API Server Cluster A
. - it acts as TCP proxy and simply send the data
Shoot API Server Cluster A
.
There are multiple OSS proxies for this case:
- nginx
- HAProxy
- Envoy
- traefik
- linkerd2-proxy
To ease integration it should:
- be configurable via Kubernetes resources
- not require restarting when configuration changes
- be fast and with little overhead
All things considered, Envoy proxy is the most fitting solution as it provides all the features Gardener would like (no process reload being the most important one + battle tested in production by various companies).
While building a custom control plane for Envoy is quite simple, an already established solution might be the better path forward. Istio's Pilot is one of the most feature-complete Envoy control plane solutions as it offers a way to configure edge ingress traffic for Envoy via Gateway and VirtualService.
The resources which needs to be created per Shoot clusters are the following:
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: kube-apiserver-gateway
namespace: <shoot-namespace>
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 443
name: tls
protocol: TLS
tls:
mode: PASSTHROUGH
hosts:
- api.<external-domain>
- api.<shoot>.<project>.<internal-domain>
and correct VirtualService
pointing to the correct API server:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: kube-apiserver
namespace: <shoot-namespace>
spec:
hosts:
- api.<external-domain>
- api.<shoot>.<project>.<internal-domain>
gateways:
- kube-apiserver-gateway
tls:
- match:
- port: 443
sniHosts:
- api.<external-domain>
- api.<shoot>.<project>.<internal-domain>
route:
- destination:
host: kube-apiserver.<shoot-namespace>.svc.cluster.local
port:
number: 443
The resources above configures Envoy to forward the raw TLS data (without termination) to the Shoot kube-apiserver
.
Updated diagram:
+-------------------------------+
| |
| Network LB | (accessible from clients)
| |
| |
+-------------+-------+---------+ +------------------+
| | | |
| | proxy + lb | Shoot API Server |
| | +-------------+------------->+ |
| | | | Cluster A |
| | | | |
| | | +------------------+
| | |
+----------------v----+--+
| | |
+-+--------v----------+ | +------------------+
| | | | |
| | | proxy + lb | Shoot API Server |
| Envoy Proxy | +-------------+---------->+ |
| (ingress Gateway) | | | Cluster B |
| | | | |
| +----+ +------------------+
+-----+----------+----+
| |
| |
| | +------------------+
| | | |
| | proxy + lb | Shoot API Server |
| +-------------------+-------------->+ |
| get | Cluster C |
| configuration | |
| +------------------+
|
v Configure
+--+--------------+ +---------------------+ via Istio
| | | | Custom Resources
| Pilot +-------->+ Seed API Server +<------------------+
| | | |
| | | |
+-----------------+ +---------------------+
In this case the internal
and external
DNSEntries
should be changed to the Network LoadBalancer's IP.
In Kubernetes the API server is discoverable via the master service (kubernetes
in default
namespace). Today, this service can only be of type ClusterIP
- making in-cluster communication to the API server impossible due to:
- the client doesn't set the
Server Name
in the TLS handshake, if it attempts to talk to an IP address. In this case, the TLS handshake reaches the Envoy IngressGateway proxy, but it's rejected by it. - Kubernetes services can be of type
ExternalName
, but the master service is not supported by kubelet.- even if this is fixed in future Kubernetes versions, this problem still exists for older versions where this functionality is not available.
Another issue occurs when the client tries to talk to the apiserver via the in-cluster DNS. For all Shoot API servers kubernetes.default.svc.cluster.local
is the same and when a client tries to connect to that API server using that server name. This makes distinction between different in-cluster Shoot clients impossible by the Envoy IngressGateway.
To mitigate this problem an additional proxy must be deployed on every single Node. It does not terminate TLS and sends the traffic to the correct Shoot API Server. This is achieved by:
- the apiserver master service reconciler is started and pointing to the
kube-apiserver
's Cluster IP in the Seed cluster (e.g.--advertise-address=10.1.2.3
). - the proxy runs in the host network of the
Node
. - the proxy has a sidecar container which:
- creates a dummy network interface and assigns the
10.1.2.3
to it. - removes connection tracking (conntrack) if iptables/nftables is enabled as the IP address is local to the
Node
.
- creates a dummy network interface and assigns the
- the proxy listens on the
10.1.2.3
and using the PROXY protocol it sends the data stream to the Envoy ingress gateway (EIGW). - EIGW listens for PROXY protocol on a dedicated
8443
port. EIGW reads the destination IP + port from the PROXY protocol and forwards traffic to the correct upstream apiserver.
The sidecar is a standalone component. It's possible to transparently change the proxy implementation without any modifications to the sidecar. The simplified flow looks like:
+------------------+ +----------------+
| Shoot API Server | TCP | Envoy IGW |
| +<-------------------+ PROXY listener |
| Cluster A | | :8443 |
+------------------+ +-+--------------+
^
|
|
|
|
+-----------------------------------------------------------+
| Single Node in
| the Shoot cluster
|
| PROXY Protocol
|
|
|
+---------------------+ +----------+----------+
| Pod talking to | | |
| the kubernetes | | Proxy |
| service +------>+ No TLS termination |
| | | |
+---------------------+ +---------------------+
Multiple OSS solutions can be used:
- haproxy
- nginx
To add a PROXY lister with Istio several resources must be created - a dedicated Gateway
, dummy VirtualService
and EnvoyFilter
which adds listener filter (envoy.listener.proxy_protocol
) on 8443
port:
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: blackhole
namespace: istio-system
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 8443
name: tcp
protocol: TCP
hosts:
- "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: blackhole
namespace: istio-system
spec:
hosts:
- blackhole.local
gateways:
- blackhole
tcp:
- match:
- port: 8443
route:
- destination:
host: localhost
port:
number: 9999 # any dummy port will work
---
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: proxy-protocol
namespace: istio-system
spec:
workloadSelector:
labels:
istio: ingressgateway
configPatches:
- applyTo: LISTENER
match:
context: ANY
listener:
portNumber: 8443
name: 0.0.0.0_8443
patch:
operation: MERGE
value:
listener_filters:
- name: envoy.filters.listener.proxy_protocol
For each individual Shoot
cluster, a dedicated FilterChainMatch is added. It ensures that only Shoot API servers can receive traffic from this listener:
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: <shoot-namespace>
namespace: istio-system
spec:
workloadSelector:
labels:
istio: ingressgateway
configPatches:
- applyTo: FILTER_CHAIN
match:
context: ANY
listener:
portNumber: 8443
name: 0.0.0.0_8443
patch:
operation: ADD
value:
filters:
- name: envoy.filters.network.tcp_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
stat_prefix: outbound|443||kube-apiserver.<shoot-namespace>.svc.cluster.local
cluster: outbound|443||kube-apiserver.<shoot-namespace>.svc.cluster.local
filter_chain_match:
destination_port: 443
prefix_ranges:
- address_prefix: 10.1.2.3 # kube-apiserver's cluster-ip
prefix_len: 32
Note: this additional
EnvoyFilter
can be removed when Istio supports full L4 matching.
A nginx proxy client in the Shoot cluster on every node could have the following configuration:
error_log /dev/stdout;
stream {
server {
listen 10.1.2.3:443;
proxy_pass api.<external-domain>:8443;
proxy_protocol on;
proxy_protocol_timeout 5s;
resolver_timeout 5s;
proxy_connect_timeout 5s;
}
}
events { }
Even if in future versions of Kubernetes, the master service of type ExternalName
is supported, we still have the problem that in-cluster workload can talk to the server via DNS. For this to work we still need the above mentioned proxy (this time listening on another IP address 10.0.0.2
). An additional change to CoreDNS would be needed:
default.svc.cluster.local.:8053 {
file kubernetes.default.svc.cluster.local
}
.:8053 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
The content of the kubernetes.default.svc.cluster.local
is going to be:
$ORIGIN default.svc.cluster.local.
@ 30 IN SOA local. local. (
2017042745 ; serial
1209600 ; refresh (2 hours)
1209600 ; retry (1 hour)
1209600 ; expire (2 weeks)
30 ; minimum (1 hour)
)
30 IN NS local.
kubernetes IN A 10.0.0.2
So when a client requests kubernetes.default.svc.cluster.local
, it'll be send to the proxy listening on that IP address.
While out of scope of this GEP, several things can be improved:
- Make the sidecar work with eBPF and environments where iptables/nftables are not enabled.