-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Velero to take a backup of CKF on top of EKS #1197
Comments
Thank you for reporting your feedback to us! The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6758.
|
UpdateWe managed to do the backup/restore of the CKF cluster on EKS:
BACKUP_NAME=backup-$(date '+%Y-%m-%d--%H-%M')
velero backup create \
$BACKUP_NAME \
--include-namespaces kubeflow-user-example-com \
--include-cluster-scoped-resources profiles.kubeflow.org Adding the
kubectl delete ns kubeflow-user-example-com
kubectl delete profiles.kubeflow.org kubeflow-user-example-com
velero restore create --from-backup $BACKUP_NAME The notebooks, PVs and PVCs will be restored and the notebooks will be fully functional. However, the restore status will be Warnings:
Velero: <none>
Cluster: could not restore, CustomResourceDefinition "authorizationpolicies.security.istio.io" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, CustomResourceDefinition "notebooks.kubeflow.org" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, CustomResourceDefinition "poddefaults.kubeflow.org" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, CustomResourceDefinition "profiles.kubeflow.org" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, CustomResourceDefinition "virtualservices.networking.istio.io" already exists. Warning: the in-cluster version is different than the backed-up version
Namespaces:
test: could not restore, Secret "mlpipeline-minio-artifact" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, ConfigMap "istio-ca-root-cert" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, ConfigMap "kube-root-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, ConfigMap "metadata-grpc-configmap" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, ReplicaSet "ml-pipeline-ui-artifact-5fd49cc64c" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, ReplicaSet "ml-pipeline-visualizationserver-7f8545bdb9" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, Endpoints "ml-pipeline-ui-artifact" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, Endpoints "ml-pipeline-visualizationserver" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, Service "ml-pipeline-ui-artifact" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, Service "ml-pipeline-visualizationserver" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, AuthorizationPolicy "ns-owner-access-istio-charmed" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, Deployment "ml-pipeline-ui-artifact" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, Deployment "ml-pipeline-visualizationserver" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, PodDefault "access-ml-pipeline" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "default-editor" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "default-viewer" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "namespaceAdmin" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, StatefulSet "test" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, VirtualService "notebook-test-test" already exists. Warning: the in-cluster version is different than the backed-up version It might be caused by the fact that some custom resources of CKF are responsible for creating the others, so during the restoration, the velero will attempt to create resources that are managed by previously restored objects. |
UpdateWe ran another round of tests for backup/restore, limiting the scope of resources being backed up. That helps avoid the conflicts caused by the objects that own/create other resources like Notebook Controllers: The backup will now be separated into two objects:
velero backup create profile-backup --include-resources profiles.kubeflow.org
velero backup create notebooks-backup --include-resources notebooks.kubeflow.org,persistentvolumeclaims,persistentvolumes Additionally, Now the profile and the namespace can be deleted and restored as follows: kubectl delete ns kubeflow-user-example-com
kubectl delete profiles.kubeflow.org kubeflow-user-example-com
velero restore create --from-backup profile-backup
# Wait for the namespace to be created
velero restore create --from-backup notebooks-backup Limiting the scope of the resources being backed up, to user namespaces, will avoid the conflicts mentioned in the previous update (since now it tries to restore PVCs in the
The warnings for the notebooks restore indicate that some PVCs couldn't be restored as they already exist. That's caused by the fact that we were snapshotting up all the PVCs, in all namespaces including
|
UpdateFor the next round of tests, we deployed MLFlow alongside Kubeflow and ran the following Notebook, which is a good E2E example for the backup/restore capabilities of Velero. The backup will still consist of two parts:
velero backup create profile-backup --include-resources profiles.kubeflow.org
velero backup create crds-backup --exclude-namespaces kubeflow --include-resources persistentvolumeclaims,persistentvolumes,authcodes.dex.coreos.com,authorizationpolicies.security.istio.io,certificaterequests.cert-manager.io,certificates.cert-manager.io,certificates.networking.internal.knative.dev,challenges.acme.cert-manager.io,clusterdomainclaims.networking.internal.knative.dev,clusterissuers.cert-manager.io,clusterservingruntimes.serving.kserve.io,clusterstoragecontainers.serving.kserve.io,clusterworkflowtemplates.argoproj.io,compositecontrollers.metacontroller.k8s.io,configurations.serving.knative.dev,controllerrevisions.metacontroller.k8s.io,cronworkflows.argoproj.io,decoratorcontrollers.metacontroller.k8s.io,destinationrules.networking.istio.io,domainmappings.serving.knative.dev,envoyfilters.networking.istio.io,experiments.kubeflow.org,gateways.networking.istio.io,images.caching.internal.knative.dev,inferencegraphs.serving.kserve.io,inferenceservices.serving.kserve.io,ingresses.networking.internal.knative.dev,issuers.cert-manager.io,metrics.autoscaling.internal.knative.dev,mpijobs.kubeflow.org,mxjobs.kubeflow.org,notebooks.kubeflow.org,orders.acme.cert-manager.io,paddlejobs.kubeflow.org,peerauthentications.security.istio.io,podautoscalers.autoscaling.internal.knative.dev,poddefaults.kubeflow.org,profiles.kubeflow.org,proxyconfigs.networking.istio.io,pvcviewers.kubeflow.org,pytorchjobs.kubeflow.org,requestauthentications.security.istio.io,revisions.serving.knative.dev,routes.serving.knative.dev,scheduledworkflows.kubeflow.org,serverlessservices.networking.internal.knative.dev,serviceentries.networking.istio.io,services.serving.knative.dev,servingruntimes.serving.kserve.io,sidecars.networking.istio.io,suggestions.kubeflow.org,telemetries.telemetry.istio.io,tensorboards.tensorboard.kubeflow.org,tfjobs.kubeflow.org,trainedmodels.serving.kserve.io,trials.kubeflow.org,viewers.kubeflow.org,virtualservices.networking.istio.io,wasmplugins.extensions.istio.io,workflowartifactgctasks.argoproj.io,workfloweventbindings.argoproj.io,workflows.argoproj.io,workflowtaskresults.argoproj.io,workflowtasksets.argoproj.io,workflowtemplates.argoproj.io,workloadentries.networking.istio.io,workloadgroups.networking.istio.io,xgboostjobs.kubeflow.org Now, the profile and the namespace can be deleted and restored as follows: kubectl delete ns kubeflow-user-example-com
kubectl delete profiles.kubeflow.org kubeflow-user-example-com
velero restore create --from-backup profile-backup
# Wait for the namespace to be created
velero restore create --from-backup crds-backup The first restore succeeds, and the second one partially fails. velero restore get
NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR
crds-backup-20250129194215 crds-backup PartiallyFailed 2025-01-29 19:42:16 +0300 MSK 2025-01-29 19:42:18 +0300 MSK 1 13 2025-01-29 19:42:16 +0300 MSK <none>
profile-backup-20250129194138 profile-backup Completed 2025-01-29 19:41:40 +0300 MSK 2025-01-29 19:41:40 +0300 MSK 0 0 2025-01-29 19:41:40 +0300 MSK <none> The second restore will have the following warnings and errors: velero restore describe crds-backup-20250129194215 --details
...
Warnings:
Velero: <none>
Cluster: <none>
Namespaces:
controller-eks-controller: could not restore, PersistentVolumeClaim "storage-controller-0" already exists. Warning: the in-cluster version is different than the backed-up version
demo: could not restore, AuthorizationPolicy "ns-owner-access-istio-charmed" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, AuthorizationPolicy "ns-owner-access-istio" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, PodDefault "access-ml-pipeline" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, PodDefault "mlflow-server-access-minio" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, PodDefault "mlflow-server-minio" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, Revision "wine-regressor3-predictor-00001" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, Route "wine-regressor3-predictor" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, ServerlessService "wine-regressor3-predictor-00001" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, Service "wine-regressor3-predictor" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, VirtualService "notebook-demo-demo" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, VirtualService "wine-regressor3-predictor-ingress" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, VirtualService "wine-regressor3-predictor-mesh" already exists. Warning: the in-cluster version is different than the backed-up version
Errors:
Velero: <none>
Cluster: <none>
Namespaces:
demo: error restoring configurations.serving.knative.dev/demo/wine-regressor3-predictor: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: missing field(s): metadata.labels.serving.knative.dev/service
... As in the previous update, the Notebook was restored successfully, and the contents are present. The MLFlow UI still correctly reports experiments and models. IssuesKubeflow UI FrontendThe UI is now broken in some places:
E2E exampleThe Notebook example will now fail to run the model:
The status of the kubectl get inferenceservices.serving.kserve.io -n kubeflow-user-example-com wine-regressor3 -o yaml
...
status:
components:
predictor:
latestCreatedRevision: wine-regressor3-predictor-00001
conditions:
- lastTransitionTime: "2025-01-29T16:42:19Z"
reason: PredictorConfigurationReady not ready
severity: Info
status: "False"
type: LatestDeploymentReady
- lastTransitionTime: "2025-01-29T16:42:19Z"
message: 'Revision "wine-regressor3-predictor-00001" failed with message: There
is an existing PodAutoscaler "wine-regressor3-predictor-00001" that we do not
own..'
reason: RevisionFailed
severity: Info
status: "False"
type: PredictorConfigurationReady
- lastTransitionTime: "2025-01-29T16:42:19Z"
message: Configuration "wine-regressor3-predictor" does not have any ready Revision.
reason: RevisionMissing
status: "False"
type: PredictorReady
- lastTransitionTime: "2025-01-29T16:42:19Z"
message: Configuration "wine-regressor3-predictor" does not have any ready Revision.
reason: RevisionMissing
severity: Info
status: "False"
type: PredictorRouteReady
- lastTransitionTime: "2025-01-29T16:42:19Z"
message: Configuration "wine-regressor3-predictor" does not have any ready Revision.
reason: RevisionMissing
status: "False"
type: Ready
- lastTransitionTime: "2025-01-29T16:42:19Z"
reason: PredictorRouteReady not ready
severity: Info
status: "False"
type: RoutesReady
modelStatus:
states:
activeModelState: ""
targetModelState: Pending
transitionStatus: InProgress
observedGeneration: 1
... |
UpdateWe rerun the previous attempt with the E2E Notebook Example, but this time, we removed the following CRDs from the second backup: certificates.networking.internal.knative.dev
clusterdomainclaims.networking.internal.knative.dev
configurations.serving.knative.dev
domainmappings.serving.knative.dev
images.caching.internal.knative.dev
ingresses.networking.internal.knative.dev
metrics.autoscaling.internal.knative.dev
podautoscalers.autoscaling.internal.knative.dev
revisions.serving.knative.dev
routes.serving.knative.dev
serverlessservices.networking.internal.knative.dev
services.serving.knative.dev
virtualservices.networking.istio.io Those changes helped resolve the errors during the second restore in the last run: velero restore get
NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR
crds-backup-20250129222819 crds-backup Completed 2025-01-29 22:28:20 +0300 MSK 2025-01-29 22:28:21 +0300 MSK 0 6 2025-01-29 22:28:20 +0300 MSK <none>
profile-backup-20250129222622 profile-backup Completed 2025-01-29 22:26:23 +0300 MSK 2025-01-29 22:26:24 +0300 MSK 0 0 2025-01-29 22:26:23 +0300 MSK <none> There are only some warnings that seem to be harmless and might be resolved in the future: ...
Warnings:
Velero: <none>
Cluster: <none>
Namespaces:
controller-eks-controller: could not restore, PersistentVolumeClaim "storage-controller-0" already exists. Warning: the in-cluster version is different than the backed-up version
test: could not restore, AuthorizationPolicy "ns-owner-access-istio-charmed" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, AuthorizationPolicy "ns-owner-access-istio" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, PodDefault "access-ml-pipeline" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, PodDefault "mlflow-server-access-minio" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, PodDefault "mlflow-server-minio" already exists. Warning: the in-cluster version is different than the backed-up version
... The ISVC was also correctly restored without any status errors. The model can be successfully rerun from the E2E Example Notebook. Unfortunately, the UI issues persisted. |
After a bit of debigging with @Deezzir we got to the root cause of why the UI was complaining with the above error #1197 (comment) of Yuri's intuition that it's because of an empty Root Cause Analysis
We also noticed, by looking at KFP API requests from the browser, that indeed when we were restoring the Workflow this is when the NotesIn the code they use the term This is something though that has increased the complexity of the backend, and the upstream plans to remove Tekton kubeflow/pipelines#11438 So at this point:
|
Final UpdateAfter the debugging session with @kimwnasptd, we decided to remove the following CRDs from the second backup: workflowartifactgctasks.argoproj.io
workfloweventbindings.argoproj.io
workflows.argoproj.io
workflowtaskresults.argoproj.io
workflowtasksets.argoproj.io
workflowtemplates.argoproj.io
clusterworkflowtemplates.argoproj.io So, the final two backup commands are the following: Backup of the velero backup create profile-backup --include-resources profiles.kubeflow.org Backup of the CRDs velero backup create crds-backup --exclude-namespaces kubeflow --include-resources persistentvolumeclaims,persistentvolumes,secrets,authcodes.dex.coreos.com,authorizationpolicies.security.istio.io,certificaterequests.cert-manager.io,certificates.cert-manager.io,challenges.acme.cert-manager.io,clusterissuers.cert-manager.io,clusterservingruntimes.serving.kserve.io,clusterstoragecontainers.serving.kserve.io,compositecontrollers.metacontroller.k8s.io,controllerrevisions.metacontroller.k8s.io,cron,decoratorcontrollers.metacontroller.k8s.io,destinationrules.networking.istio.io,envoyfilters.networking.istio.io,experiments.kubeflow.org,gateways.networking.istio.io,inferencegraphs.serving.kserve.io,inferenceservices.serving.kserve.io,issuers.cert-manager.io,mpijobs.kubeflow.org,mxjobs.kubeflow.org,notebooks.kubeflow.org,orders.acme.cert-manager.io,paddlejobs.kubeflow.org,peerauthentications.security.istio.io,poddefaults.kubeflow.org,profiles.kubeflow.org,proxyconfigs.networking.istio.io,pvcviewers.kubeflow.org,pytorchjobs.kubeflow.org,requestauthentications.security.istio.io,scheduledworkflows.kubeflow.org,serviceentries.networking.istio.io,servingruntimes.serving.kserve.io,sidecars.networking.istio.io,suggestions.kubeflow.org,telemetries.telemetry.istio.io,tensorboards.tensorboard.kubeflow.org,tfjobs.kubeflow.org,trainedmodels.serving.kserve.io,trials.kubeflow.org,viewers.kubeflow.org,wasmplugins.extensions.istio.io,workloadentries.networking.istio.io,workloadgroups.networking.istio.io,xgboostjobs.kubeflow.org Using the following CRD list and restoring with the steps described here, we successfully restored the user namespace with |
Great description @Deezzir! Marking this as done since we managed to perform the backup / restore. Great work!! |
Context
This is related to #1097
In order to verify that we can successfully take a backup of Charmed Kubeflow with Velero we'll target an EKS cluster.
For this we'll focus on using the velero plugin for AWS to take the snapshot of the data, and not the file system backup functionality.
What needs to get done
After this we need to ensure that:
Definition of Done
The text was updated successfully, but these errors were encountered: