Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-admin controller design #18

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 211 additions & 0 deletions docs/design/Non_Admin_Controller_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
# Non-Admin Backup/Restore Design

## Background
[OADP (Openshift API for Data Protection)](https://github.com/openshift/oadp-operator) Operator currently requires cluster admin access for performing Backup and Restore operations of applications deployed on the OpenShift platform.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here "has" would be better than "requires"?

suppose someone tries to use OADP in non admin way (without NAC). Installs it in a non admin namespace. the non admin user can only create things in that namespace, but it can tell OADP to backup/restore anything (in any namespace) in the cluster, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either is fine, currently we advertise OADP functionalities as cluster admin tasks so that way requires makes sense.

This design intends to enable non-admin users the ability to perform Backup and Restore operations of their application namespace.

## Goals
- Enable non-admin backup operation
- Enable non-admin restore operation

## Non-Goals
- Performance improvements of Backup and Restore Operations
- Parallel Backup and Restore Operations

## Use-Cases

### Backup Operation
- As a non-admin user/namespace owner with administrative privileges for a particular namespace, the user should be able to:
- Create a Backup of the namespace
- View the status of the Backup created for the particular namespace
- Delete the Backup of the namespace

### Restore Operation
- As a non-admin user/namespace owner with administrative privileges for a particular namespace, the user should be able to:
- Create a Restore of the namespace
- View the status of the Restore created for the particular namespace
- Delete the Restore of the namespace

### BackupStorageLocation (BSL) Configuration Operation (optional use case)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention the use case of that BSL in the other operations such as Backup? How this will be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the use-case

- As a non-admin user/namespace owner with administrative privileges for a particular namespace, the user should be able to:
- Create their own BSLs
- View the status of the BSLs created by them
- Delete the BSLs created by them
- Use their own BSLs for Backup/Restore Operations

## Installation
- The Non-Admin Controller (NAC) will be installed via OADP Operator.
- The Data Protection Application (DPA) CR will consist of a root level struct called `nonAdmin` under the features. This struct will have a boolean to `enable` or `disable` the non-admin feature.
- If the `features.nonAdmin.enable` flag is set to `true`, the OADP Operator will install the NAC in OADP Operator's install namespace, by default this feature will be disabled.
- **Multiple OADP:NAC installations:** Due to performance limitations of single threaded nature of Velero controller, it might be desirable in some environments to have more than one instance of OADP:NAC installed.
- We need to ensure that one OADP operator instance is mapped to only one NAC instance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't clear to me. One instance of NAC for one OADP operator, and above we have the multiple OADP:NAC installations, so does it mean we can only scale the NAC by scaling OADP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OADP Operator is NS scoped operator, so you can have multiple OADP Operator installations in the cluster. But each of the OADP Operator will have one NAC controller instance if NAC is enabled. So we need to ensure that if there are multiple OADP/NAC installations in one particular cluster then no 2 NAC controllers should serve once single non-admin application/user namespace.

- Also, the namespaces catered by each NAC instance should be separate and not merge with each other
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not clear - NAC will create namespace per instance? Or it's OADP namespace to which the NAC controller(s) get installed, or if I think what it is:

The non-admin namespaces on which NAC controllers will operate....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes your understanding that The non-admin namespaces on which NAC controllers will operate.... is correct

- We can facilitate this by accepting a list of NS that will be watched/catered by the particular NAC instance. We can do this during the installation time via DPA. There may be 2 ways which we will support of NS filtering:
- Direct List of NS Names `(Spec.features.nonAdmin.NSList[])`
- List of NS to filtered based on a LabelSelector `(Spec.features.nonAdmin.labelSelector)`
- By default, of none of the above are specified then NAC listens to all the Namespaces in the cluster
- We need to error out if NAC NS scopes overlap (i.e. NS watch list overlap)


## Pre-requisites
- **OADP installed**: OADP Operator must be installed and configured to use non-admin controller
- **Non Admin Controller configured**: Data Protection Application (DPA) instance must configure Non Admin Controller to watch user namespace(s), by default it watches all Namespaces
- **RBAC privileges for the user**: User must have the appropriate RBAC privileges to create Non Admin objects within the Namespace where Backup will be taken. An example of such ClusterRole, which may be added to the user with `RoleBinding`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user has admin rights to a particular namespace, the RBAC is not required right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense, added a note.

```yaml
# permissions for end users to edit non-admin custom resources.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/name: clusterrole
app.kubernetes.io/instance: nonadminbackup-editor-role
app.kubernetes.io/component: rbac
app.kubernetes.io/created-by: oadp-nac
app.kubernetes.io/part-of: oadp-nac
app.kubernetes.io/managed-by: kustomize
name: nonadminbackup-editor-role
rules:
- apiGroups:
- nac.oadp.openshift.io
resources:
- nonadminbackups
- nonadminrestores
- nonadminbackupstorageloations
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- nac.oadp.openshift.io
resources:
- nonadminbackups/status
- nonadminrestores/status
- nonadminbackupstoragelocations/status
verbs:
- get

```
**Note:** If users have admin level access/rights to particular namespace then we do not need to specify the privileges listed above.
- **Existence of Namespace:** The existence of namespace prior to performing non-admin backup/restore operations is a must.(for backup it will obviously exist but for restore it is an explicit hard precondition)
## High-Level design

### Components
- **OADP Operator:** OADP is the OpenShift API for Data Protection operator. This open source operator sets up and installs Velero on the OpenShift platform, allowing users to backup and restore applications.
- **Controllers:** The Non-Admin controller will pack the following controllers as part of it:
- **Non-Admin Backup (NAB) Controller:** The responsibilities of the NAB controller are:
shubham-pampattiwar marked this conversation as resolved.
Show resolved Hide resolved
- Listen to requests pertaining to Non-Admin Backup CRD
shubham-pampattiwar marked this conversation as resolved.
Show resolved Hide resolved
- Process requests pertaining to Non-Admin Backup CRD
- Update Non-Admin Backup CR status with the status/events from Velero Backup CR
- Cascade Any actions performed on Non-Admin Backup CR to corresponding Velero backup CR
- **Non-Admin Restore (NAR) Controller:** The responsibilities of the NAR controller are:
shubham-pampattiwar marked this conversation as resolved.
Show resolved Hide resolved
- Listen to requests pertaining to Non-Admin Restore CRD
- Process requests pertaining to Non-Admin Restore CRD
- Update Non-Admin Backup CR status with the status/events from Velero Restore CR
- Cascade Any actions performed on Non-Admin Restore CR to corresponding Velero restore CR
- **Non-Admin BackupStorageLocation (NABSL) controller:** The responsibilities of the NABSL controller are:
- Listen to requests pertaining to Non-Admin BSL CRD
- Process requests pertaining to Non-Admin BSL CRD
- The NABSL controller would cascade the BSL request to Velero controller
- Validate that the NABSL CR is appropriate and relevant secrets from non-admin user are also obtained (note that authenticating the access keys will not be the NABSL controller's responsibility)
- The NABSL request can be create/update/view/delete BSL type requests
- NABSL controller would be responsible for gathering the status from Velero BSL and updating the status of NABSL CR, keeping them in sync
- Every BSL needs user access keys and storage credentials, NABSL controller will be responsible for provisioning them in Velero NS, once they are provided by the non-admin user.
- **Non-Admin Backup Sync controller:** The responsibilities of this controller are:
- Check if the non-admin users NS exists (from the configured ones to watch)
- Now if this NS exists check if there are any Velero backups that are relevant to this NS
- Now check if the relevant Velero Backup's Non-Admin backup instances exist in this NS and sync via recreating them if they do not exist.
- This sync should be done regularly and this sync frequency should be configurable, default can be 30m.
- CRDs: The following CRDs will be provided to Non-Admin users:
- **Non-Admin Backup (NAB) CRD:** This CRD will encapsulate the whole Velero Backup CRD and some additional spec fields that will be needed for non-admin feature.
- **Non-Admin Restore (NAR) CRD:** This CRD will encapsulate the whole Velero Restore CRD and some additional spec fields that will be needed for non-admin feature.
- **Non-Admin BackupStorageLocation (NABSL) CRD:** This CRD will encapsulate the whole Velero BSL CRD and some additional spec fields that will be needed for non-admin feature.


### Implementation details
#### Backup Workflow
- **Non-Admin user creates a Non-Admin backup CR:** The user creates a NonAdminBackup custom resource object in the Namespace on which the backup will run within the Kubernetes cluster. The `NonAdminBackup` schema has the `backupSpec`, which is the same as `Backup` CR from the `velero.io/v1` apiVersion.

```yaml
apiVersion: nac.oadp.openshift.io/v1alpha1
kind: NonAdminBackup
metadata:
name: example
namespace: user-namespace
spec:
backupSpec: {}
```
- **NAB controller reconciles on this NAB CR:** The NonAdminBackup controller continuously reconciles the NonAdminBackup object's desired state with the actual state in the cluster.
- **NAB controller validates the NAB CR and then creates a corresponding Velero Backup CR:** When the NonAdminBackup controller detects a new or modified NonAdminBackup object, it creates or updates a corresponding Velero Backup object within the OADP Namespace using the information provided in the `backupSpec` field of the NonAdminBackup object. The resulting Backup object is named as `nab-<namespace>-<hash>`, where the `<namespace>` is the NonAdminBackup namespace and the `<hash>` is computed from the original NonAdminBackup name. The resulting Backup object is labeled and annotated with the following additional metadata:

```yaml
metadata:
annotations:
openshift.io/oadp-nab-origin-name: <NonAdminBackup name>
openshift.io/oadp-nab-origin-namespace: <NonAdminBackup Namespace>
openshift.io/oadp-nab-origin-uuid: <NonAdminBackup UUID>
labels:
app.kubernetes.io/managed-by: <OADP NonAdminController id>
openshift.io/oadp: 'True'
```
- **Velero runs Backup**: Velero executes the backup operation based on the configuration specified in the Velero Backup object. Velero updates the status of the Velero Backup object to reflect the outcome of the backup process.
- **Reconcile loop updates NonAdminBackup object Status**: Upon detecting changes in the status of the Velero Backup object, the NonAdminBackup controller's reconciliation loop updates the Status field of the corresponding NonAdminBackup object with the updated status from the Velero Backup object.

![NAB-Backup Workflow Diagram](../images/nab-backup-workflow.jpg)
![NAB-Controller Backup Details Diagram](../images/Backup-Workflow-Details.jpg)

#### Restore Workflow
- **Namespace exists:** Hard precondition that the NS exists and non-admin user has the appropriate access
- **Backup Sync controller syncs the Non-admin Backup CRs:** The Backup-Sync controller ensures that the NS relevant backups are synced and Non-admin backup CRs exists for non-admin users to refer them for restore operations.
- **Non-Admin user creates the Non-Admin restore CR:** The user creates NonAdminRestore custom resource object in the Namespace on which the restore will run within the Kubernetes cluster. The `NonAdminRestore` schema has the `restoreSpec`, which is the same as `Restore` CR from the `velero.io/v1` apiVersion. Note that the non-admin user will use non-admin backup name (and not Velero backup name) in non-admin restore spec.
- **NAR controller reconciles on this NAR CR:** The NonAdminRestore controller continuously reconciles the NonAdminRestore object's desired state with the actual state in the cluster.
- **NAR controller validates the NAR CR and then creates a corresponding Velero Restore CR:** When the NonAdminRestore controller detects a new or modified NonAdminRestore object, it creates the corresponding Velero Restore object within the OADP Namespace using the information provided in the `restoreSpec` field of the NonAdminRestore object. The resulting Restore object is named as `nar-<namespace>-<hash>`, where the `<namespace>` is the NonAdminRestore namespace and the `<hash>` is computed from the original NonAdminRestore name. The resulting Restore object is labeled and annotated with the following additional metadata:

```yaml
metadata:
annotations:
openshift.io/oadp-nar-origin-name: <NonAdminRestore name>
openshift.io/oadp-nar-origin-namespace: <NonAdminRestore Namespace>
openshift.io/oadp-nar-origin-uuid: <NonAdminRestore UUID>
labels:
app.kubernetes.io/managed-by: <OADP NonAdminController id>
openshift.io/oadp: 'True'
```
- **Velero runs Restore**: Velero executes the restore operation based on the configuration specified in the Velero Restore object. Velero updates the status of the Velero Restore object to reflect the outcome of the restore process.
- **Reconcile loop updates NonAdminRestore object Status**: Upon detecting changes in the status of the Velero Restore object, the NonAdminRestore controller's reconciliation loop updates the Status field of the corresponding NonAdminRestore object with the updated status from the Velero Restore object.

- // TODO: Diagram remaining

#### Queuing mechanism on NAB/NAR CR status
- We will introduce a Queue status on NAB/NAR CR status in order to give some transparency on what is the current backup/restore request status or a general idea when it will get processed so that the non-admin users are not left to wonder about what's happening with their backup/restore.
- The queuing status would be added to the NAB/NAR CR status, and it would give a general idea about how many backup/restore request are still remaining to be processed before their own backup/restore operation gets triggered.
- For example, in a singular OADP:NAC environment, lets say there are 5 non-admin users sharing this single OADP/NAC/Velero instance and all of them triggered backups.
- Current Velero Backup statuses in this example are: Backup-1, Backup-2: Complete; Backup-3: InProgress; Backup-4, Backup-5: New
- Then Non-Admin Backup Status for Backup-1 and Backup-2 will have `QueuedStatus: Completed`
- Non-Admin Backup Status for Backup-3 will have `QueuedStatus: Processing`
- Non-Admin Backup Status for Backup-4 will have `QueuedStatus: 1 Processing`
- Non-Admin Backup Status for Backup-5 will have `QueuedStatus: 1 Processing, 1 New` Or `QueuedStatus: 2 Backups remaining`
// TODO: Diagram remaining

#### Validating Webhooks to Ensure more transparency and debug-ability
Currently, the velero CLI is a very useful tool, particularly the describe backup option. This option comes in pretty handy to gain more detailed insights on what is getting backed up (ResourceList) or which operations are in progress (async datamover operations) or details like how many items are still remaining to be backed up and much more.
We will be ensuring that non-admin users have view/read access to the Velero CRs that are only relevant to them or created by them.

// TODO: Diagram remaining

#### Control over the Velero spec exposed via Non-Admin CRs
shubham-pampattiwar marked this conversation as resolved.
Show resolved Hide resolved
The Velero Backup/BSL/Restore CRs consists of a plethora of Backup Spec options. It might be desirable to have a control over the spec that is exposed or may be allowed listed for non-admin users.
For instance, Backup Spec IncludedNamespaces should by default be the non-admin user's namespace as this is a scoped backup operation
shubham-pampattiwar marked this conversation as resolved.
Show resolved Hide resolved
Another instance is mentioned in the relevant discussion of cluster admins wanting to set a default backup TTL for all the non-admin users
Or maybe the cluster-admins want to enforce that only fs-backups should be used for volumes, so defaultVolumesToFSBackup should be true for all non-admin backups.
In order to implement this we could accept a configmap from cluster-admin users in OADP install NS and this configmap would be used to perform run-time validations against the Non-Admin CR instances.

// TODO: Approach Discussion

## Open Questions and Know Limitations
- Velero command and pod logs
- Multiple instance of OADP Operator
- Should BSL configuration be optional like cluster-admin could opt for shard BSLs or BYOB (Bring your own Bucket/BSL)
Binary file added docs/images/Backup-Workflow-Details.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/nab-backup-workflow.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading