The Qubership Disaster Recovery Daemon
(DRD) is a service that establishes communication between the Site Manager
and the current cluster operator or disaster recovery controller.
DRD provides the following features:
- Disaster Recovery Server implements Site Manager contract and manage current mode in DR resource.
- Disaster Recovery Controller provides ability to implement DR controller for services without operator.
Example of DRD chart template is presented here.
DRD provides all REST endpoints to satisfy the Site Manager
contract and
takes data from Kubernetes Custom Resource, Kubernetes API. By default, the cluster operator manages this Custom Resource
and contains service switchover logic. DRD just triggers the cluster operator via Custom Resource changes.
DRD is delivered as a docker image and has a list of environment variables to configure it.
DRD can be deployed in the Kubernetes as a separated pod or as a side container for the operator pod.
Name | Format | Description | Example | Required |
---|---|---|---|---|
NAMESPACE | A string. | The name of service namespace. | rabbitmq-service | true |
RESOURCE_FOR_DR | Four words in a single string separated by a single space. | This parameter specifies four values to find Kubernetes Custom Resource. These values are group, version, resource, and name of Custom Resource. First word, group, can be empty "" . |
netcracker.com v2 rabbitmqservices rabbitmq-service | true |
USE_DEFAULT_PATHS | A single boolean word. | If this parameter is true the default values are used instead of DISASTER_RECOVERY_* environment variable values. |
true | false |
DISASTER_RECOVERY_MODE_PATH | Several words separated by a dot. | This parameter specifies the path to disaster recovery mode field in Custom Resource. |
spec.disasterRecovery.mode | true if USE_DEFAULT_PATHS variable is not set to true |
DISASTER_RECOVERY_NOWAIT_PATH | Several words separated by a dot. | This parameter specifies the path to disaster recovery no-wait field in Custom Resource. |
spec.disasterRecovery.noWait | true if USE_DEFAULT_PATHS variable is not set to true |
DISASTER_RECOVERY_STATUS_MODE_PATH | Several words separated by a dot. | This parameter specifies the path to disaster recovery status mode field in Custom Resource. |
status.disasterRecoveryStatus.mode | true if USE_DEFAULT_PATHS variable is not set to true |
DISASTER_RECOVERY_STATUS_STATUS_PATH | Several words separated by a dot. | This parameter specifies the path to disaster recovery status status field in Custom Resource. |
status.disasterRecoveryStatus.status | true if USE_DEFAULT_PATHS variable is not set to true |
DISASTER_RECOVERY_STATUS_COMMENT_PATH | Several words separated by a dot. | This parameter specifies the path to disaster recovery status comment field in Custom Resource. |
status.disasterRecoveryStatus.comment | false |
DISASTER_RECOVERY_NOWAIT_AS_STRING | A single boolean word. | If this parameter is true the disaster recovery daemon uses string values for no-wait parameter, otherwise boolean value is used. |
false | false |
HEALTH_MAIN_SERVICES_ACTIVE | Several word pairs separated by commas. Each pair contains two words separated by a single space. The first word is a Kubernetes workload type and the second one is the workload name. | This parameter specifies the main services for the health check on active side. | deployment kafka-1,deployment kafka-2 | true |
HEALTH_ADDITIONAL_SERVICES_ACTIVE | Several word pairs separated by commas. Each pair contains two words separated by a single space. The first word is a Kubernetes workload type and the second one is the workload name. | This parameter specifies the additional services for the health check on active side. | deployment rabbitmq-backup-daemon | false |
HEALTH_MAIN_SERVICES_STANDBY | Several word pairs separated by commas. Each pair contains two words separated by a single space. The first word is a Kubernetes workload type and the second one is the workload name. | This parameter specifies the main services for the health check on standby side. If the parameter is empty or is absent, the health status will be always UP on standby side. |
deployment kafka-1,deployment kafka-2 | false |
HEALTH_ADDITIONAL_SERVICES_STANDBY | Several word pairs separated by commas. Each pair contains two words separated by a single space. The first word is a Kubernetes workload type and the second one is the workload name. | This parameter specifies the additional services for the health check on standby side. | deployment rabbitmq-backup-daemon | false |
HEALTH_MAIN_SERVICES_DISABLED | Several word pairs separated by commas. Each pair contains two words separated by a single space. The first word is a Kubernetes workload type and the second one is the workload name. | This parameter specifies the main services for the health check on disable side. If the parameter is empty or is absent, the health status will be always UP on disable side. |
deployment kafka-1,deployment kafka-2 | false |
HEALTH_ADDITIONAL_SERVICES_DISABLED | Several word pairs separated by commas. Each pair contains two words separated by a single space. The first word is a Kubernetes workload type and the second one is the workload name. | This parameter specifies the additional services for the health check on disable side. |
deployment rabbitmq-backup-daemon | false |
SITE_MANAGER_SERVICE_ACCOUNT_NAME | A single word. | This parameter specifies the Site Manager service account name. | site-manager | false |
SITE_MANAGER_NAMESPACE | A single word. | This parameter specifies the Site Manager namespace. | site-manager | false |
SITE_MANAGER_CUSTOM_AUDIENCE | A single word. | This parameter specifies the Site Manager custom audience applied for token during authntication. | sm-services | false |
SERVER_PORT | A number. | This parameter specifies the DRD server port. The default value is 8068 . |
8069 | false |
ADDITIONAL_HEALTH_ENDPOINT | A string. | This parameter specifies additional health endpoint. The endpoint response contains information about full cluster health state (if EXTERNAL_FULL_HEALTH_ENABLED is true ) or additional cluster health state (if EXTERNAL_FULL_HEALTH_ENABLED is false ). In the second case, the result will be calculate as HEALTH_ADDITIONAL_SERVICES variable. |
http://(POD_IP):8069/healthz | false |
EXTERNAL_FULL_HEALTH_ENABLED | A boolean string. | If this parameter is true the ADDITIONAL_HEALTH_ENDPOINT variable will be used as external full health endpoint. In this case all HEALTH_* environment variables are not necessary. |
true | false |
TLS_ENABLED | A boolean string. | If this parameter is true TLS will be enabled for DRD container. |
false | false |
CERTS_PATH | Path string. | This parameter specifies path to folder with TLS certificates in DRD container. | /tls/ | false |
CIPHER_SUITES | Comma-separated list of strings. Each word is suite name supported by GO e.g. TLS_RSA_WITH_3DES_EDE_CBC_SHA |
This parameter specifies the list of cipher suites that are used to negotiate the security settings for a network connection using TLS or SSL network protocol | "" | false |
TREAT_STATUS_AS_FIELD | A boolean. | This parameter specifies whether resource status should be treated as field. It is necessary when initially DISASTER_RECOVERY_STATUS_STATUS_PATH does not have Status sub-resource. In that case status is set as a field to chosen resource. For example, it may be applicable for some of custom resources or ConfigMaps. |
false | false |
DRD REST server provides three methods of interaction:
-
GET
healthz
method allows finding out the state of the current cluster side.curl -XGET localhost:8068/healthz
Where
8068
is the default server port.The response to such a request is as follows:
{"status":"up"}
Where:
status
is the current state of the cluster side. The four possible status values are as follows:up
- All service's workloads are ready.degraded
- Some of the service's workloads (the main health service or additional health service) are not ready.down
- The main health service is down.disabled
- The service is switched off.
-
GET
sitemanager
method allows finding out the mode of the current cluster side and the actual state of the switchover procedure.curl -XGET localhost:8068/sitemanager
Where
8068
is the default server port.The response to such a request is as follows:
{"mode":"standby","status":"done"}
Where:
mode
is the mode in which the cluster side is deployed. The possible mode values are as follows:active
- The service accepts external requests from clients.standby
- The service does not accept external requests from clients.disabled
- The service does not accept external requests from clients.
status
is the current state of switchover for the service cluster side. The three possible status values are as follows:running
- The switchover is in progress.done
- The switchover is successful.failed
- Something went wrong during the switchover.
comment
is the message which contains a detailed description of the problem.
-
POST
sitemanager
method allows switching mode for the current side of the service cluster.curl -XPOST -H "Content-Type: application/json" localhost:8068/sitemanager -d '{"mode":"<MODE>"}'
Where:
- Where
8068
is the default server port. <MODE>
is the mode to be applied to the cluster side. The possible mode values are as follows:active
- The service accepts external requests from clients.standby
- The service does not accept external requests from clients.disabled
- The service does not accept external requests from clients.
The response to such a request is as follows:
{"mode":"standby"}
Where:
mode
is the mode that is applied to the cluster side. The possible values areactive
,standby
, anddisabled
.status
is the state of the request on the REST server. The only possible value isfailed
, when something goes wrong while processing the request.comment
is the message which contains a detailed description of the problem and is only filled out if thestatus
value isfailed
.
- Where
All the DRD SM endpoints can be secured via Kubernetes JWT Service Account Tokens. A Site Manager Kubernetes token should be specified in the Request Header. Examples for DRD REST endpoints:
curl -XGET -H "Authorization: Bearer <TOKEN>" localhost:8068/healthz
curl -XGET -H "Authorization: Bearer <TOKEN>" localhost:8068/sitemanager
curl -XPOST -H "Content-Type: application/json, Authorization: Bearer <TOKEN>" localhost:8068/sitemanager -d '{"mode":"<MODE>"}'
Where TOKEN
is a Site Manager Kubernetes token.
Authentication will be enabled only if both SITE_MANAGER_SERVICE_ACCOUNT_NAME
and SITE_MANAGER_NAMESPACE
environment variables are specified.
If these environment variables are not specified, the authentication will be disabled.
If authentication is enabled and the SITE_MANAGER_CUSTOM_AUDIENCE
environment variable is specified, then custom audience
is applied to TokenReview request.
Custom Resource with default paths:
apiVersion: qubership.org/v1
kind: MyService
metadata:
name: example-service
namespace: my-namespace
spec:
disasterRecovery:
mode: 'standby'
noWait: false
status:
disasterRecoveryStatus:
comment: 'replication has finished successfully'
mode: 'standby'
status: 'done'
Environment Variables:
- name: NAMESPACE
value: 'my-namespace'
- name: RESOURCE_FOR_DR
value: 'qubership.org v1 myservices example-service'
- name: USE_DEFAULT_PATHS
value: 'true'
- name: HEALTH_MAIN_SERVICES_ACTIVE
value: 'StatefulSet example-service'
Config Map:
kind: ConfigMap
apiVersion: v1
metadata:
name: example-service-dr-config
namespace: my-namespace
data:
mode: 'standby'
noWait: 'false'
status_comment: 'replication has finished successfully'
status_mode: 'standby'
status_status: 'done'
Environment Variables:
- name: NAMESPACE
value: 'my-namespace'
- name: RESOURCE_FOR_DR
value: '"" v1 configmaps example-service-dr-config'
- name: USE_DEFAULT_PATHS
value: 'false'
- name: DISASTER_RECOVERY_MODE_PATH
value: 'data.mode'
- name: DISASTER_RECOVERY_NOWAIT_PATH
value: 'data.noWait'
- name: DISASTER_RECOVERY_STATUS_MODE_PATH
value: 'data.status_mode'
- name: DISASTER_RECOVERY_STATUS_STATUS_PATH
value: 'data.status_status'
- name: DISASTER_RECOVERY_STATUS_COMMENT_PATH
value: 'data.status_comment'
- name: DISASTER_RECOVERY_NOWAIT_AS_STRING
value: 'true'
- name: HEALTH_MAIN_SERVICES_ACTIVE
value: 'StatefulSet example-service'
Disaster Recovery Daemon provides an ability to implement controller for watching changes of Disaster Recovery resource for cases when service does not have its own operator.
DRD extension is a golang application which starts server and controller with function which implements custom DR logic and/or custom health check logic.
- Make a repository or folder for your golang application.
- In
go.mod
add importgithub.com/Netcracker/qubership-disaster-recovery-daemon
with actual version. - Implement
Main.go
which starts server and controller with function which implements custom DR logic and custom health check logic. - Build a Docker image with your golang application.
- Add a new deployment or container for DRD application to your Helm chart, with corresponding environment variables.
To start custom server or controller you need to provide configuration object with necessary parameters.
Configuration can be loaded from some kind of sources with implementing interface configuration loader config.ConfigLoader
,
by default DRD provides only environment variables configuration loader config.DefaultEnvConfigLoader
which uses corresponding environment variables.
cfgLoader := config.GetDefaultEnvConfigLoader()
cfg, err := config.NewConfig(cfgLoader)
To create and start DR server you need created configuration:
server.NewServer(cfg).Run()
You can also specify custom health check function (by default DRD uses pods readiness probes to calculate health):
server.NewServer(cfg).WithHealthFunc(healthFunc, false).Run()
Contract for health check function is:
WithHealthFunc(healthFunc func(request entity.HealthRequest) (entity.HealthResponse, error), fullHealth bool)
entity.HealthRequest
contains fields:
mode
is a current DR mode for cluster side. Type:string
. Values:active
,standby
ordisabled
). This is required field.
entity.HealthResponse
contains fields:
status
is a result of health check operation. Type:string
. Values:up
,down
ordegraded
). This is required field.comment
is a comment of performing health check operation. Type:string
.
fullHealth
function argument means whether health overrides pod readiness health check (if fullHealth: true
) or should be used as additional health status (if fullHealth: false
).
If fullHealth: false
then the following rules are applied:
- All pods ready: UP, additional health: UP -> UP
- Some pods are not ready: DEGRADED, additional health: UP -> DEGRADED
- All pods are down: DOWN, additional health: UP -> DOWN
- All pods ready: UP, additional health: DOWN or DEGRADED -> DEGRADED
NOTE: The health check function is an optional feature, if no function is specified the default approach with HEALTH_MAIN_SERVICES_ACTIVE...
is used.
To create and start controller you need created configuration and controller func:
controller.NewController(cfg).
WithFunc(func).
WithRetry(3, time.Second * 5).
Run()
WithFunc
takes DR controller function. DR function must be set for DR controller.
Contract for DR controller function is:
func(controllerRequest entity.ControllerRequest) (entity.ControllerResponse, error)
entity.ControllerRequest
contains fields:
mode
is a disaster recovery mode from resource. Type:string
. Values:active
,standby
ordisabled
).noWait
is a flag meaning this is failover operation. Type:bool
.eventType
is a type of resource event. Type:string
. Values:ADDED
,MODIFIED
orDELETED
).object
is an original DR resource object.
entity.ControllerResponse
contains fields:
mode
is a disaster recovery mode after performing DR operation. Type:string
. Values:active
,standby
ordisabled
). This is required field.status
is a result of performing DR operation. Type:string
. Values:done
,running
orfailed
). This is required field.comment
is a comment of performing DR operation. Type:string
.
The result of operation execution will be saved to DR Resource.
WithRetry
takes number of attempts and delay for retry policy.
Controller runs retry only if error happens during function execution, if function returned failed
status, no retry is called.
If no retry parameters are specified controller calls function only one time.
The below is an example of Main.go
for custom resource Config Map presented above:
package main
import (
"context"
"github.com/Netcracker/qubership-disaster-recovery-daemon/api/entity"
"github.com/Netcracker/qubership-disaster-recovery-daemon/client"
"github.com/Netcracker/qubership-disaster-recovery-daemon/config"
"github.com/Netcracker/qubership-disaster-recovery-daemon/controller"
"github.com/Netcracker/qubership-disaster-recovery-daemon/server"
"k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"log"
)
func main() {
// Make a config loader
cfgLoader := config.GetDefaultEnvConfigLoader()
// Build a config
cfg, err := config.NewConfig(cfgLoader)
if err != nil {
log.Fatalln(err.Error())
}
// Easy way to create a kubernetes client if necessary
kubeClient := client.MakeKubeClientSet()
// Start DRD server with custom health function inside. This func calculates only additional health status (fullHealth: false)
go server.NewServer(cfg).
WithHealthFunc(func(request entity.HealthRequest) (entity.HealthResponse, error) {
// Do some health check logic, e,g, using kubernetes client kubeClient.CoreV1()...
return entity.HealthResponse{Status: entity.UP}, nil
}, false).
Run()
// Start DRD controller with external DR function
controller.NewController(cfg).
WithFunc(drFunction).
Run()
}
// DR function implementation
func drFunction(controllerRequest entity.ControllerRequest) (entity.ControllerResponse, error) {
var configMap v1.ConfigMap
// Convert unstructured object to expected type (ConfigMap in our case)
err := runtime.DefaultUnstructuredConverter.FromUnstructured(controllerRequest.Object, &configMap)
if err != nil {
return entity.ControllerResponse{}, err
}
// Do some DR logic
return entity.ControllerResponse{
SwitchoverState: entity.SwitchoverState{
Mode: controllerRequest.Mode,
Status: entity.DONE,
Comment: "switchover successfully done",
},
}, nil
}