- Troubleshooting AKS cluster creation
- Troubleshooting GKE cluster creation
- Troubleshooting BKPR installation
- Troubleshooting BKPR Ingress
- Troubleshooting DNS
- Troubleshooting Elasticsearch
If you notice the following error message from az aks create
, it could indicate the Azure authentication token has expired.
Operation failed with status: 'Bad Request'. Details: Service principal clientID: <REDACTED>
not found in Active Directory tenant <REDACTED>, Please see https://aka.ms/acs-sp-help for more details.
Troubleshooting:
Please clear your Azure profile directory with rm -rf ~/.azure
and retry after logging in again.
If you encounter the following error message from gcloud container clusters create
, it indicates the Kubernetes API has not been enabled for your Google Cloud Platform (GCP) project.
ERROR: (gcloud.container.clusters.create) ResponseError: code=403, message=Kubernetes Engine API is not enabled for this project. Please ensure it is enabled in Google Cloud Console and try again
Troubleshooting:
In order to be able to provision Kubernetes clusters, the Kubernetes Engine API should be enabled for the GCP project. Visit https://console.developers.google.com/apis/api/container.googleapis.com/overview and enable the Kubernetes API for your project before retrying.
While installing BKPR on an AKS cluster, if you notice the following error message from kubeprod install
, it indicates that another Azure service principal with the same value exists.
ERROR Error: graphrbac.ApplicationsClient#Create: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="Unknown" Message="Unknown service error" Details=[{"odata.error":{"code":"Request_BadRequest","date":"2018-11-29T00:31:52","message":{"lang":"en","value":"Another object with the same value for property identifierUris already exists."},"requestId":"3c6f59e9-ad05-42fb-8ab2-3a9745eb9f68","values":[{"item":"PropertyName","value":"identifierUris"},{"item":"PropertyErrorCode","value":"ObjectConflict"}]}}]
Troubleshooting:
This is typically encountered when you attempt to install BKPR with a DNS zone (--dns-zone
) that was used in an earlier installation on BKPR. Login to the Azure Portal and navigate to Azure Active Directory > App registrations and filter the result with the keyword kubeprod
. From the filtered results remove the entries that have the BKPR DNS zone in its name and retry the BKPR installation.
While installing BKPR on an GKE cluster, if you encounter the following error message from kubeprod install
, it indicates the Google Cloud DNS API has not been enabled for your Google Cloud Platform (GCP) project.
ERROR Error: failed to create Google managed-zone "demo.allaboutif.com": googleapi: Error 403: Google Cloud DNS API has not been used in project xxxxxxxxxxx before or it is disabled.
Troubleshooting:
The BKPR installer creates a Google Cloud DNS zone for the domain specified in the --dns-zone
flag of the kubeprod install
command. For the installer to be able to create the Google Cloud DNS zone the Google Cloud DNS API should be enabled for the GCP project. Visit https://console.developers.google.com/apis/api/dns.googleapis.com/overview and enable the Google Cloud DNS API for your project before retrying.
You have installed BKPR to your Kubernetes cluster, but are unable to access any of the Ingress endpoints due to DNS resolution errors.
$ ping prometheus.my-domain.com
ping: prometheus.my-domain.com: Name or service not known
Troubleshooting:
DNS address resolution could be a result of configuration issues. For a working DNS setup, you need to ensure all of the following conditions are met.
- ExternalDNS Pods are running
- ExternalDNS is updating DNS zone records
- DNS glue records are configured
- DNS propagation has completed
Use the following command to check the status of the external-dns
deployment:
$ kubectl -n kubeprod get deployments external-dns
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
external-dns 1 1 1 0 5m
The AVAILABLE
column indicates the number of Pods that have started successfully.
Troubleshooting:
A Pod goes through various lifecycle phases before it enters the Running
phase. You can use the following command to watch the rollout status of the external-dns
Deployment:
$ kubectl -n kubeprod rollout status deployments external-dns
Waiting for deployment "external-dns" rollout to finish: 0 of 1 updated replicas are available...
Waiting for deployment "external-dns" rollout to finish: 1 of 1 updated replicas are available...
The command will return with the message deployment "external-dns" successfully rolled out
after all the Pods in the external-dns
Deployment have started successfully. Please note it could take a few minutes for the external-dns
Deployment to complete.
If it takes an abnormally long time (>5m) for the external-dns
Pods to enter the Running
phase, they may have encountered an error condition.
Check the status of the external-dns
Pods with the command:
$ kubectl -n kubeprod get $(kubectl -n kubeprod get pods -l name=external-dns -o name)
NAME READY STATUS RESTARTS AGE
external-dns-7bfbf596dd-55ngz 0/1 CrashLoopBackOff 3 10m
The STATUS
column indicates the state the Pod is currently in and the RESTARTS
column indicates the number of times the Pod has been restarted.
In the sample output you can see that the external-dns
Pod is in the CrashLoopBackOff
state and has been restarted very frequently indicating it has been encountering some error condition. In such situations you should inspect the details of the pod with the command:
kubectl -n kubeprod describe $(kubectl -n kubeprod get pods -l name=external-dns -o name)
Additionally, the container logs may contain useful information about the error condition. Inspect the container logs using the following command:
kubectl -n kubeprod logs $(kubectl -n kubeprod get pods -l name=external-dns -o name)
If you are unable to determine the cause of the error, create a support request describing your environment and remember to attach the output of the last two commands to the support request.
The DNS host records (A) can be listed using the following command:
On Google Cloud Platform:
gcloud dns record-sets list --zone $(gcloud dns managed-zones list --filter dnsName:${BKPR_DNS_ZONE} --format='value(name)') --filter type=A
On Microsoft Azure:
az network dns record-set list --resource-group ${AZURE_RESOURCE_GROUP} --zone-name ${BKPR_DNS_ZONE} --query "[?arecords!=null]" --output table
BKPR, by default, creates host records for Prometheus, Grafana and Kibana dashboards. These records should be listed in the output of the above command.
Troubleshooting:
ExternalDNS automatically manages host records for Ingress resources in the cluster. When an Ingress resource is created, it could take a few minutes for it to be seen by ExternalDNS.
If the records are not updated a few minutes after the Ingress resource is created, use the following command to inspect the container logs of the external-dns
Pod to discover any error conditions that may have been encountered.
kubectl -n kubeprod logs $(kubectl -n kubeprod get pods -l name=external-dns -o name)
The DNS glue record setup is the most basic requirement for the DNS address resolution to work correctly. The glue records are typically configured at the DNS domain registrar.
Use the following command to query the glue record configuration for your domain:
dig ${BKPR_DNS_ZONE} NS +noall +answer
The command should list one or more Nameserver (NS) records configured for your domain.
Troubleshooting:
If the above listed command does not return any NS records, it indicates that the DNS glue records have not been configured at your domain registrar. At the same time, it is also possible the glue records are configured with wrong values.
First, use the following command to query the values of the NS records that should be set up as glue records.
On Google Cloud Platform:
gcloud dns managed-zones describe \
$(gcloud dns managed-zones list --filter dnsName:${BKPR_DNS_ZONE} --format='value(name)')
On Microsoft Azure:
az network dns zone show \
--name ${BKPR_DNS_ZONE} \
--resource-group ${AZURE_RESOURCE_GROUP} \
--query nameServers \
--output table
Next, use your domain registrar's portal to add NS records for your domain with the values displayed in the output of the previous command.
Nameserver changes typically take 0 to 24 hours to take effect, but they may also take as long as 48 hours.
Troubleshooting:
If ExternalDNS is updating DNS zone records and the DNS glue records are configured correctly, you need to wait for the DNS propagation to complete.
whatsmydns.net is a DNS propagation checker that can give you a rough idea of the DNS propagation status of your DNS records.
The inability to access a Kubernetes Ingress resource over HTTP/S is likely caused by one of the following scenarios:
- Ingress resource lacks the necessary annotations
- Domain self-check fails
- Let's Encrypt rate-limiting
- Invalid email address (MX records for email domain did not exist)
In the next sections we will describe the troubleshooting steps required to identify the underlying problem and how to fix it.
Accessing cert-manager
logs:
cert-manager
logs can be retrieved directly from the pod.
The first step for troubleshooting self-signed certificates consists of checking the logs in cert-manager
. cert-manager
logs can be queried directly from Kibana. Or if you are unable to access Kibana, the most recent logs from cert-manager
can be retrieved directly from the running pod by performing the following steps:
$ kubectl --namespace=kubeprod get pods --selector=name=cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-75668b9d76-t5659 1/1 Running 0 10m
The following command retrieves the most recent set of logs from the cert-manager
pod named as shown above:
$ kubectl --namespace=kubeprod logs cert-manager-75668b9d76-t5659
Troubleshooting:
A Kubernetes Ingress resource is designated as TLS-terminated at the NGINX controller when the following annotations are present:
annotations:
kubernetes.io/ingress.class: nginx
kubernetes.io/tls-acme: "true"
Make sure the Ingress resource you are trying to reach over HTTP/S has these annotations. For example:
$ kubectl --namespace=kubeprod describe ingress grafana
Name: grafana
Namespace: kubeprod
Address: 35.241.253.114
Default backend: default-http-backend:80 (10.48.1.8:8080)
TLS:
grafana-tls terminates grafana.example.com
Rules:
Host Path Backends
---- ---- --------
grafana.example.com
/oauth2 oauth2-proxy:4180 (<none>)
grafana.example.com
/ grafana:3000 (<none>)
Annotations:
kubecfg.ksonnet.io/garbage-collect-tag: kube_prod_runtime
kubernetes.io/ingress.class: nginx
kubernetes.io/tls-acme: true
nginx.ingress.kubernetes.io/auth-response-headers: X-Auth-Request-User, X-Auth-Request-Email
nginx.ingress.kubernetes.io/auth-signin: https://grafana.example.com/oauth2/start
nginx.ingress.kubernetes.io/auth-url: https://grafana.example.com/oauth2/auth
If this is not the case, ensure that at least "kubernetes.io/ingress.class": true
and "kubernetes.io/tls-acme": true
are present.
This condition is usually signaled by cert-manager
with log messages like the following:
E1217 16:12:27.728123 1 controller.go:180] certificates controller: Re-queuing item "kubeprod/grafana-tls" due to error processing: http-01 self-check failed for domain "grafana.example.com"
This usually means that the DNS domain name grafana.example.com
does not resolve or the ACME protocol is not working as expected. Let's Encrypt is unable to probe that you (your Kubernetes cluster) is actually in control of the DNS domain name and refuses to issue a signed certificate.
To troubleshoot this issue, first determine the IPv4 address that grafana
is using:
$ kubectl --namespace=kubeprod get ing
NAME HOSTS ADDRESS PORTS AGE
cm-acme-http-solver-px56z grafana.example.com 35.241.253.114 80 1m
cm-acme-http-solver-rdxkg kibana.example.com 35.241.253.114 80 1m
cm-acme-http-solver-sdgdc prometheus.example.com 35.241.253.114 80 1m
grafana grafana.example.com,grafana.example.com 35.241.253.114 80, 443 1m
kibana-logging kibana.example.com,kibana.example.com 35.241.253.114 80, 443 1m
prometheus prometheus.example.com,prometheus.example.com 35.241.253.114 80, 443 1m
In this example, grafana.example.com
should resolve to 35.241.253.114
. In this example, this is not the case:
$ nslookup grafana.example.com 8.8.8.8
Server: 8.8.8.8
Address: 8.8.8.8#53
** server can't find grafana.example.com: NXDOMAIN
This typically means that DNS glue records for example.com
are not properly configured. Please check the section named Unable to resolve DNS addresses above.
If the DNS glue records are properly configured but still the DNS name for your Ingress resource does not resolve, it could mean that the NGINX Ingress controller is not getting an IPv4 address.
It could also mean that NGINX Ingress controller hasn't got a public IPv4 address:
$ kubectl --namespace=kubeprod get ing
NAME HOSTS ADDRESS PORTS AGE
grafana grafana.example.com,grafana.example.com 80, 443 49s
kibana-logging kibana.example.com,kibana.example.com 80, 443 47s
prometheus prometheus.example.com,prometheus.example.com 80, 443 44s
Please wait a few minutes and check back again. If this is still the case, there must be some error condition that prevents tbe NGINX Ingress controller from getting a public IPv4 address. The conditions that can trigger this situation depend greatly on the underlying computing platform for Kubernetes. For example, Kubernetes on AKS or GKE depend on the public cloud infrastructure to provide a routable IPv4 address which is usually tied to some form of load-balancing resource.
However, if the NGINX Ingress controller is able to get a public IPv4 address, grafana.example.com
must resolve to that same IPv4 address. For example:
$ kubectl --namespace=kubeprod get ing grafana
NAME HOSTS ADDRESS PORTS AGE
grafana grafana.example.com,grafana.example.com 35.241.253.114 80, 443 6m
The grafana.example.com
DNS name should resolve to 35.231.253.114
. However, in this example this is not the case as seen below:
$ nslookup grafana.example.com 8.8.8.8
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
Name: grafana.example.com
Address: 35.241.251.76
As the reader will notice in this example, grafana.example.com
does not resolve to the IPv4 address stated in the corresponding Kubernetes Ingress resource. This could be caused by DNS caching and propagation issues (e.g. the TTL for the DNS record has not expired yet and Google DNS servers are not re-querying from the authoritative name server). Again, wait a few minutes and check back whether grafana.example.com
resolves to the same the IPv4 address as seen in the Ingress resource (in this example, 35.241.253.114
).
As long as grafana.example.com
does not resolve to the IPv4 address stated in the Ingress resource, Let's Encrypt will refuse to issue the certificate. Let's Encrypt uses the ACME HTTP-01 protocol and as long as the ACME protocol is running you will notice some transient Ingress resources named like cm-acme-http-solver-*
:
$ kubectl --namespace=kubeprod get ing
NAME HOSTS ADDRESS PORTS AGE
cm-acme-http-solver-px56z grafana.example.com 35.241.253.114 80 1m
cm-acme-http-solver-rdxkg kibana.example.com 35.241.253.114 80 1m
cm-acme-http-solver-sdgdc prometheus.example.com 35.241.253.114 80 1m
grafana grafana.example.com,grafana.example.com 35.241.253.114 80, 443 1m
kibana-logging kibana.example.com,kibana.example.com 35.241.253.114 80, 443 1m
prometheus prometheus.example.com,prometheus.example.com 35.241.253.114 80, 443 1m
Once grafana.example.com
resolves properly to the IPv4 address of the Ingress resource, the certificate will be eventually issued and installed:
$ kubectl --namespace=kubeprod describe cert grafana-tls
Name: grafana-tls
Namespace: kubeprod
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning IssuerNotReady 32m cert-manager Issuer letsencrypt-prod not ready
Normal CreateOrder 30m (x2 over 32m) cert-manager Created new ACME order, attempting validation...
Normal DomainVerified 28m cert-manager Domain "grafana.example.com" verified with "http-01" validation
Normal IssueCert 28m cert-manager Issuing certificate...
Normal CertObtained 28m cert-manager Obtained certificate from ACME server
Normal CertIssued 28m cert-manager Certificate issued successfully
At this point check that you can access the Kubernetes Ingress resource correctly over HTTP/S.
This condition is usually signaled by cert-manager
with log messages like the following:
E1217 22:24:31.237112 1 controller.go:180] certificates controller: Re-queuing item "kubeprod/grafana-tls" due to error processing: error getting certificate from acme server: acme: urn:ietf:params:acme:error:rateLimited: Error finalizing order :: too many certificates already issued for exact set of domains: grafana.example.com: see https://letsencrypt.org/docs/rate-limits/
This means that cert-manager
has requested too many certificates and has exceeded the Let's Encrypt rate limits. This situation can happen when you install and uninstall BKPR too many times or when a particular DNS domain is shared among several BKPR clusters.
BKPR defaults to the production environment offered by Let's Encrypt. However, for a non-production use of BKPR you can switch to using the staging environment that Let's Encrypt provides. You can find more information on how to switch to using the staging environment in the BKPR documentation.
This condition is usually signaled by cert-manager
with log messages like the following:
I1218 17:06:03.204699 1 controller.go:171] certificates controller: syncing item 'kubeprod/prometheus-tls'
I1218 17:06:03.204812 1 sync.go:120] Issuer letsencrypt-prod not ready
E1218 17:06:03.204903 1 controller.go:180] certificates controller: Re-queuing item "kubeprod/prometheus-tls" due to error processing: Issuer letsencrypt-prod not ready
If the e-mail address used when installing BKPR uses an invalid domain or an e-mail domain that can't be resolved properly (missing MX DNS record), Let's Encrypt will refuse to accept requests from cert-manager
. The actual error can be inspected by looking at the ClusterIssuer object in Kubernetes:
$ kubectl --namespace=kubeprod describe clusterissuer letsencrypt-prod
Name: letsencrypt-prod
Namespace:
Labels: kubecfg.ksonnet.io/garbage-collect-tag=kube_prod_runtime
name=letsencrypt-prod
Annotations: kubecfg.ksonnet.io/garbage-collect-tag: kube_prod_runtime
API Version: certmanager.k8s.io/v1alpha1
Kind: ClusterIssuer
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ErrVerifyACMEAccount 13m cert-manager Failed to verify ACME account: acme: urn:ietf:params:acme:error:invalidEmail: Error creating new account :: invalid contact domain. Contact emails @example.com are forbidden
To fix the issue make sure the domain part for the e-mail address used when installing BKPR has a corresponding MX DNS record. For example:
$ nslookup -tMX bitnami.com
*** Invalid option: tMX
Server: 192.168.0.1
Address: 192.168.0.1#53
Non-authoritative answer:
Name: bitnami.com
Address: 50.17.235.25
When Elasticsearch enters into a never-ending crash loop under EKS, the logs can reveal the cause:
$ kubectl --namespace=kubeprod logs elasticsearch-logging-0 elasticsearch-logging
Troubleshooting:
If you see logged messages like this:
[1]: max file descriptors [8192] for elasticsearch process is too low, increase to at least [65536]
Then, you are affected by a bug that causes Elasticsearch 6.0 to get into a crash loop because of incorrect ulimit
values. It affects EKS clusters which use the AMI image named amazon-eks-node-1.10-v20190220
.
In order to fix this, you will need to SSH into every node from the EKS nodegroup and remove the ulimit
configuration of the Docker daemon:
sudo sed -i '/"nofile": {/,/}/d' /etc/docker/daemon.json
sudo systemctl restart docker
The above sed
command turns:
{
"bridge": "none",
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "10"
},
"live-restore": true,
"max-concurrent-downloads": 10,
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Soft": 2048,
"Hard": 8192
}
}
}
to
{
"bridge": "none",
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "10"
},
"live-restore": true,
"max-concurrent-downloads": 10,
"default-ulimits": {
}
}