Skip to content

Commit

Permalink
Prevent flapping of external DNS configuration (k8gb-io#1767)
Browse files Browse the repository at this point in the history
A flapping DNSEndpoint affects external DNS configuration, which introduces unwanted behavior during e2e tests.

# How the controller uses externalDNS to configure zone delegation

K8GB uses a DNSEndpoint to configure zone delegation on the upstream DNS servers. This DNSEndpoint is picked up by ExternalDNS and looks as follows:

```
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  annotations:
    k8gb.absa.oss/dnstype: extdns
  creationTimestamp: "2024-10-27T11:03:38Z"
  generation: 1
  name: k8gb-ns-extdns
  namespace: k8gb
  resourceVersion: "1608"
  uid: 2ff4476f-0efd-4de7-96e7-605ca2a8fc78
spec:
  endpoints:
  - dnsName: cloud.example.com recordTTL: 5 recordType: NS targets:
    - gslb-ns-eu-cloud.example.com
    - gslb-ns-us-cloud.example.com
  - dnsName: gslb-ns-eu-cloud.example.com recordTTL: 5 recordType: A targets:
    - 172.19.0.6
    - 172.19.0.7
```

This resource is independent of GSLB resources, but it is still updated on every reconciliation loop, since that is the only chance the controller has to update resources. In the end to end tests we do not know the IP address on which coreDNS is exposed, therefore we abuse the GSLB resource to fetch the IP addresses of the nodes in the cluster (see the update in controllers/providers/dns/external.go). However, this is quite some negative consequences if values are different for different GSLB resources, which happens quite often.

# Flapping affects e2e tests

While trying out chainsaw I tried to increase the parallelism of tests, since I would like to have all e2e tests running simultaneously. This would prevent testing time to grow linearly with the number of strategies or ingress integration.

## Frequent DNSEndpoint updates

Unfortunately, having all GSLB resources trying to modify the DNSEndpoint with different values resulted in flaky tests. This DNSEndpoint is important for the tests since it contains the records necessary for cross-cluster communication, if it is not available then K8GB instances on different clusters cannot discover their peers which leads to the following error:

```
2024-10-27T10:27:36Z WRN github.com/k8gb-io/k8gb/controllers/providers/assistant/gslb.go:255 > can't resolve FQDN using nameservers error="exchange error: all dns servers were tried and none of them were able to resolve, err: dial udp: lookup gslb-ns-us-cloud.example.com on 10.43.0.10:53: no such host" fqdn=localtargets-roundrobin-istio.cloud.example.com. nameservers=[{"Host":"gslb-ns-us-cloud.example.com","Port":1053},{"Host":"gslb-ns-us-cloud.example.com","Port":1053},{"Host":"gslb-ns-us-cloud.example.com","Port":53}]
```

### Example
* A GSLB using Kubernetes Ingress is created. The IP address is still not assigned by the cluster -> A DNSEndpoint is created with empty targets, so discovery of other clusters is not yet possible
* The same GSLB has now an IP address assigned -> The DNSEndpoint is updated with target, so discovery is now possible
* A new GSLB using Kubernetes Ingress is created. The IP address is still not assigned by the cluster -> The DNSEndpoint is updated, since there are no targets the discovery of other clusters is no longer possible

If the timings are unfortunate enough it would be possible that cluster discovery is always unavailable when we reconcile a particular GSLB resource, thus resulting in the advertisement of incorrect targets.

### Solution

This PR proposed to fix this issue by not updating the DNSEndpoint if the list of targets that comes from the gslb resource and is empty. This should be only relevant for testing since all production usecases should use a coreDNS exposed via a load balancing service.

## DNSEndpoint deletion

Additionally, the deletion of a GSLB resource was also leading to the deletion of the ExternalDNS resource, even if there were additional GSLB resources still in use. This also led disruption of the cross-cluster communication until the next GSLB resource is reconciled. This problem is fixed on the finalizer, by deleting the ExternalDNS resource only when the last GSLB resource is deleted.

## TTL flapping

Lastly, even though this didn't affect the e2e tests I noticed that the TTL also flaps, since different GSLB resources may have different TTLs. To stabilize it we can create a new configuration option to set the TTL for the NS and glue record.

Signed-off-by: Andre Aguas <[email protected]>

---------

Signed-off-by: Andre Aguas <[email protected]>
  • Loading branch information
abaguas authored Nov 6, 2024
1 parent 2fe8089 commit 31c0d09
Show file tree
Hide file tree
Showing 19 changed files with 85 additions and 96 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ HELM_ARGS ?=
K8GB_COREDNS_IP ?= kubectl get svc k8gb-coredns -n k8gb -o custom-columns='IP:spec.clusterIP' --no-headers
LOG_FORMAT ?= simple
LOG_LEVEL ?= debug
CONTROLLER_GEN_VERSION ?= v0.15.0
CONTROLLER_GEN_VERSION ?= v0.16.5
GOLIC_VERSION ?= v0.7.2
GOLANGCI_VERSION ?= v1.60.3
POD_NAMESPACE ?= k8gb
Expand Down
1 change: 1 addition & 0 deletions chart/k8gb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ For Kubernetes `< 1.19` use this chart and k8gb in version `0.8.8` or lower.
| k8gb.log.format | string | `"simple"` | log format (simple,json) |
| k8gb.log.level | string | `"info"` | log level (panic,fatal,error,warn,info,debug,trace) |
| k8gb.metricsAddress | string | `"0.0.0.0:8080"` | Metrics server address |
| k8gb.nsRecordTTL | int | `30` | TTL of the NS and respective glue record used by external DNS |
| k8gb.podAnnotations | object | `{}` | pod annotations |
| k8gb.podLabels | object | `{}` | pod labels |
| k8gb.reconcileRequeueSeconds | int | `30` | Reconcile time in seconds |
Expand Down
8 changes: 5 additions & 3 deletions chart/k8gb/crd/k8gb.absa.oss_gslbs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.15.0
controller-gen.kubebuilder.io/version: v0.16.5
name: gslbs.k8gb.absa.oss
spec:
group: k8gb.absa.oss
Expand Down Expand Up @@ -111,6 +111,7 @@ spec:
format: int32
type: integer
type: object
x-kubernetes-map-type: atomic
required:
- name
type: object
Expand Down Expand Up @@ -150,8 +151,8 @@ spec:
these may change in the future.\nIncoming requests are
matched against the host before the\nIngressRuleValue.
If the host is unspecified, the Ingress routes all\ntraffic
based on the specified IngressRuleValue.\n\n\nHost can
be \"precise\" which is a domain name without the terminating
based on the specified IngressRuleValue.\n\nHost can be
\"precise\" which is a domain name without the terminating
dot of\na network host (e.g. \"foo.bar.com\") or \"wildcard\",
which is a domain name\nprefixed with a single wildcard
label (e.g. \"*.foo.com\").\nThe wildcard character '*'
Expand Down Expand Up @@ -240,6 +241,7 @@ spec:
format: int32
type: integer
type: object
x-kubernetes-map-type: atomic
required:
- name
type: object
Expand Down
4 changes: 3 additions & 1 deletion chart/k8gb/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,9 @@ spec:
- name: DNS_ZONE
value: {{ .Values.k8gb.dnsZone }}
- name: RECONCILE_REQUEUE_SECONDS
value: {{ quote .Values.k8gb.reconcileRequeueSeconds}}
value: {{ quote .Values.k8gb.reconcileRequeueSeconds }}
- name: NS_RECORD_TTL
value: {{ quote .Values.k8gb.nsRecordTTL }}
{{- if .Values.infoblox.enabled }}
- name: INFOBLOX_GRID_HOST
valueFrom:
Expand Down
4 changes: 4 additions & 0 deletions chart/k8gb/values.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,10 @@
"type": "integer",
"minimum": 0
},
"nsRecordTTL": {
"type": "integer",
"minimum": 0
},
"log": {
"$ref": "#/definitions/k8gbLog"
},
Expand Down
2 changes: 2 additions & 0 deletions chart/k8gb/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ k8gb:
extGslbClustersGeoTags: "us"
# -- Reconcile time in seconds
reconcileRequeueSeconds: 30
# -- TTL of the NS and respective glue record used by external DNS
nsRecordTTL: 30
coredns:
# -- Extra CoreDNS server blocks
extraServerBlocks: ""
Expand Down
2 changes: 2 additions & 0 deletions controllers/depresolver/depresolver.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,8 @@ type Infoblox struct {
type Config struct {
// Reschedule of Reconcile loop to pickup external Gslb targets
ReconcileRequeueSeconds int `env:"RECONCILE_REQUEUE_SECONDS, default=30"`
// TTL of the NS and respective glue record used by external DNS
NSRecordTTL int `env:"NS_RECORD_TTL, default=30"`
// ClusterGeoTag to determine specific location
ClusterGeoTag string `env:"CLUSTER_GEO_TAG"`
// ExtClustersGeoTags to identify clusters in other locations in format separated by comma. i.e.: "eu,uk,us"
Expand Down
5 changes: 5 additions & 0 deletions controllers/depresolver/depresolver_config.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ import (
// Environment variables keys
const (
ReconcileRequeueSecondsKey = "RECONCILE_REQUEUE_SECONDS"
NSRecordTTLKey = "NS_RECORD_TTL"
ClusterGeoTagKey = "CLUSTER_GEO_TAG"
ExtClustersGeoTagsKey = "EXT_GSLB_CLUSTERS_GEO_TAGS"
ExtDNSEnabledKey = "EXTDNS_ENABLED"
Expand Down Expand Up @@ -120,6 +121,10 @@ func (dr *DependencyResolver) validateConfig(config *Config, recognizedDNSTypes
if err != nil {
return err
}
err = field(NSRecordTTLKey, config.NSRecordTTL).isHigherThanZero().err
if err != nil {
return err
}
err = field(ClusterGeoTagKey, config.ClusterGeoTag).isNotEmpty().matchRegexp(geoTagRegex).err
if err != nil {
return err
Expand Down
24 changes: 22 additions & 2 deletions controllers/depresolver/depresolver_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ const (

var predefinedConfig = Config{
ReconcileRequeueSeconds: 30,
NSRecordTTL: 30,
ClusterGeoTag: "us",
ExtClustersGeoTags: []string{"za", "eu"},
EdgeDNSType: DNSTypeInfoblox,
Expand Down Expand Up @@ -264,6 +265,24 @@ func TestResolveConfigWithZeroReconcileRequeueSecondsKey(t *testing.T) {
arrangeVariablesAndAssert(t, expected, assert.Error)
}

func TestResolveConfigWithNegativeNSRecordTTL(t *testing.T) {
// arrange
defer cleanup()
expected := predefinedConfig
expected.NSRecordTTL = -1
// act,assert
arrangeVariablesAndAssert(t, expected, assert.Error)
}

func TestResolveConfigWithZeroReconcileNSRecordTTL(t *testing.T) {
// arrange
defer cleanup()
expected := predefinedConfig
expected.NSRecordTTL = 0
// act,assert
arrangeVariablesAndAssert(t, expected, assert.Error)
}

// remove this test once the deprecated key is no longer supported
func TestResolveConfigWithDeprecatedEdgeDNSServerKey(t *testing.T) {
// arrange
Expand Down Expand Up @@ -1504,8 +1523,8 @@ func arrangeVariablesAndAssert(t *testing.T, expected Config,
}

func cleanup() {
for _, s := range []string{ReconcileRequeueSecondsKey, ClusterGeoTagKey, ExtClustersGeoTagsKey, EdgeDNSZoneKey, DNSZoneKey, EdgeDNSServersKey,
ExtDNSEnabledKey, InfobloxGridHostKey, InfobloxVersionKey, InfobloxPortKey, InfobloxUsernameKey,
for _, s := range []string{ReconcileRequeueSecondsKey, NSRecordTTLKey, ClusterGeoTagKey, ExtClustersGeoTagsKey, EdgeDNSZoneKey, DNSZoneKey,
EdgeDNSServersKey, ExtDNSEnabledKey, InfobloxGridHostKey, InfobloxVersionKey, InfobloxPortKey, InfobloxUsernameKey,
InfobloxPasswordKey, K8gbNamespaceKey, CoreDNSExposedKey, InfobloxHTTPRequestTimeoutKey,
InfobloxHTTPPoolConnectionsKey, LogLevelKey, LogFormatKey, LogNoColorKey, MetricsAddressKey, SplitBrainCheckKey, TracingEnabled,
TracingSamplingRatio, OtelExporterOtlpEndpoint} {
Expand All @@ -1519,6 +1538,7 @@ func configureEnvVar(config Config) {
_ = os.Unsetenv(EdgeDNSServerKey)
_ = os.Unsetenv(EdgeDNSServerPortKey)
_ = os.Setenv(ReconcileRequeueSecondsKey, strconv.Itoa(config.ReconcileRequeueSeconds))
_ = os.Setenv(NSRecordTTLKey, strconv.Itoa(config.NSRecordTTL))
_ = os.Setenv(ClusterGeoTagKey, config.ClusterGeoTag)
_ = os.Setenv(ExtClustersGeoTagsKey, strings.Join(config.ExtClustersGeoTags, ","))
_ = os.Setenv(EdgeDNSServersKey, config.EdgeDNSServers.String())
Expand Down
2 changes: 1 addition & 1 deletion controllers/finalize.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ func (r *GslbReconciler) finalizeGslb(gslb *k8gbv1beta1.Gslb) (err error) {
// needs to do before the CR can be deleted. Examples
// of finalizers include performing backups and deleting
// resources that are not owned by this CR, like a PVC.
err = r.DNSProvider.Finalize(gslb)
err = r.DNSProvider.Finalize(gslb, r.Client)
if err != nil {
log.Err(err).
Str("gslb", gslb.Name).
Expand Down
1 change: 1 addition & 0 deletions controllers/gslb_controller_reconciliation_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ var crSampleYaml = "../deploy/crds/k8gb.absa.oss_v1beta1_gslb_cr_roundrobin_ingr

var predefinedConfig = depresolver.Config{
ReconcileRequeueSeconds: 30,
NSRecordTTL: 30,
ClusterGeoTag: "us-west-1",
ExtClustersGeoTags: []string{"us-east-1"},
EdgeDNSServers: []utils.DNSServer{
Expand Down
9 changes: 5 additions & 4 deletions controllers/mocks/provider_mock.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion controllers/providers/dns/dns.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Generated by GoLic, for more details see: https://github.com/AbsaOSS/golic
import (
k8gbv1beta1 "github.com/k8gb-io/k8gb/api/v1beta1"
"github.com/k8gb-io/k8gb/controllers/providers/assistant"
"sigs.k8s.io/controller-runtime/pkg/client"
externaldns "sigs.k8s.io/external-dns/endpoint"
)

Expand All @@ -32,7 +33,7 @@ type Provider interface {
// SaveDNSEndpoint update DNS endpoint in gslb or create new one if doesn't exist
SaveDNSEndpoint(*k8gbv1beta1.Gslb, *externaldns.DNSEndpoint) error
// Finalize finalize gslb in k8gbNamespace
Finalize(*k8gbv1beta1.Gslb) error
Finalize(*k8gbv1beta1.Gslb, client.Client) error
// String see: Stringer interface
String() string
}
3 changes: 2 additions & 1 deletion controllers/providers/dns/empty.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
k8gbv1beta1 "github.com/k8gb-io/k8gb/api/v1beta1"
"github.com/k8gb-io/k8gb/controllers/depresolver"
"github.com/k8gb-io/k8gb/controllers/providers/assistant"
"sigs.k8s.io/controller-runtime/pkg/client"
externaldns "sigs.k8s.io/external-dns/endpoint"
)

Expand Down Expand Up @@ -50,7 +51,7 @@ func (p *EmptyDNSProvider) SaveDNSEndpoint(gslb *k8gbv1beta1.Gslb, i *externaldn
return p.assistant.SaveDNSEndpoint(gslb.Namespace, i)
}

func (p *EmptyDNSProvider) Finalize(gslb *k8gbv1beta1.Gslb) (err error) {
func (p *EmptyDNSProvider) Finalize(gslb *k8gbv1beta1.Gslb, _ client.Client) (err error) {
return p.assistant.RemoveEndpoint(gslb.Name)
}

Expand Down
23 changes: 21 additions & 2 deletions controllers/providers/dns/external.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Generated by GoLic, for more details see: https://github.com/AbsaOSS/golic
*/

import (
"context"
"fmt"
"sort"
"strings"
Expand All @@ -30,6 +31,7 @@ import (
k8gbv1beta1 "github.com/k8gb-io/k8gb/api/v1beta1"
"github.com/k8gb-io/k8gb/controllers/depresolver"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"sigs.k8s.io/controller-runtime/pkg/client"
externaldns "sigs.k8s.io/external-dns/endpoint"
)

Expand All @@ -52,7 +54,7 @@ func NewExternalDNS(config depresolver.Config, assistant assistant2.Assistant) *
}

func (p *ExternalDNSProvider) CreateZoneDelegationForExternalDNS(gslb *k8gbv1beta1.Gslb) error {
ttl := externaldns.TTL(gslb.Spec.Strategy.DNSTtlSeconds)
ttl := externaldns.TTL(p.config.NSRecordTTL)
log.Info().
Interface("provider", p).
Msg("Creating/Updating DNSEndpoint CRDs")
Expand All @@ -66,6 +68,11 @@ func (p *ExternalDNSProvider) CreateZoneDelegationForExternalDNS(gslb *k8gbv1bet
if p.config.CoreDNSExposed {
NSServerIPs, err = p.assistant.CoreDNSExposedIPs()
} else {
if len(gslb.Status.LoadBalancer.ExposedIPs) == 0 {
// do not update DNS Endpoint for External DNS if no IPs are exposed
// new GSLB resources may have this field empty
return nil
}
NSServerIPs = gslb.Status.LoadBalancer.ExposedIPs
}
if err != nil {
Expand Down Expand Up @@ -101,7 +108,19 @@ func (p *ExternalDNSProvider) CreateZoneDelegationForExternalDNS(gslb *k8gbv1bet
return nil
}

func (p *ExternalDNSProvider) Finalize(*k8gbv1beta1.Gslb) error {
func (p *ExternalDNSProvider) Finalize(_ *k8gbv1beta1.Gslb, k8sClient client.Client) error {
gslbList := &k8gbv1beta1.GslbList{}

err := k8sClient.List(context.TODO(), gslbList)
if err != nil {
return err
}

// only remove the DNSEndpoint if there are no more GSLB resources
if len(gslbList.Items) > 1 {
return nil
}

return p.assistant.RemoveEndpoint(p.endpointName)
}

Expand Down
5 changes: 3 additions & 2 deletions controllers/providers/dns/external_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ var a = struct {
}{
Config: depresolver.Config{
ReconcileRequeueSeconds: 30,
NSRecordTTL: 30,
ClusterGeoTag: "us",
ExtClustersGeoTags: []string{"za", "eu"},
EdgeDNSServers: []utils2.DNSServer{
Expand Down Expand Up @@ -96,13 +97,13 @@ var expectedDNSEndpoint = &externaldns.DNSEndpoint{
Endpoints: []*externaldns.Endpoint{
{
DNSName: a.Config.DNSZone,
RecordTTL: 30,
RecordTTL: externaldns.TTL(a.Config.NSRecordTTL),
RecordType: "NS",
Targets: a.TargetNSNamesSorted,
},
{
DNSName: "gslb-ns-us-cloud.example.com",
RecordTTL: 30,
RecordTTL: externaldns.TTL(a.Config.NSRecordTTL),
RecordType: "A",
Targets: a.TargetIPs,
},
Expand Down
3 changes: 2 additions & 1 deletion controllers/providers/dns/infoblox.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ import (
"reflect"
"time"

"sigs.k8s.io/controller-runtime/pkg/client"
externaldns "sigs.k8s.io/external-dns/endpoint"

ibcl "github.com/infobloxopen/infoblox-go-client"
Expand Down Expand Up @@ -156,7 +157,7 @@ func (p *InfobloxProvider) CreateZoneDelegationForExternalDNS(gslb *k8gbv1beta1.
return nil
}

func (p *InfobloxProvider) Finalize(gslb *k8gbv1beta1.Gslb) error {
func (p *InfobloxProvider) Finalize(gslb *k8gbv1beta1.Gslb, _ client.Client) error {
objMgr, err := p.client.GetObjectManager()
if err != nil {
return err
Expand Down
2 changes: 1 addition & 1 deletion controllers/providers/dns/infoblox_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ func TestInfobloxFinalize(t *testing.T) {
provider := NewInfobloxDNS(config, a, cl)

// act
err := provider.Finalize(defaultGslb)
err := provider.Finalize(defaultGslb, nil)

// assert
assert.NoError(t, err)
Expand Down
Loading

0 comments on commit 31c0d09

Please sign in to comment.