Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Velero to take a backup of CKF on top of EKS #1197

Closed
kimwnasptd opened this issue Jan 17, 2025 · 8 comments
Closed

Use Velero to take a backup of CKF on top of EKS #1197

kimwnasptd opened this issue Jan 17, 2025 · 8 comments
Labels
enhancement New feature or request

Comments

@kimwnasptd
Copy link
Contributor

Context

This is related to #1097

In order to verify that we can successfully take a backup of Charmed Kubeflow with Velero we'll target an EKS cluster.

For this we'll focus on using the velero plugin for AWS to take the snapshot of the data, and not the file system backup functionality.

What needs to get done

  1. Deploy CKF in EKS
  2. Create a Notebook, with a PVC, and create some files
  3. Install velero and take a backup in an S3 bucket
  4. Delete the test Profile
  5. Do a restore with Velero

After this we need to ensure that:

  1. The Notebook gets re-created
  2. The PVC gets re-created
  3. The contents of the PVC are there

Definition of Done

  1. We manage to run the above steps for doing a backup and restore
  2. Document all the interim errors and debug we did
@kimwnasptd kimwnasptd added the enhancement New feature or request label Jan 17, 2025
Copy link

Thank you for reporting your feedback to us!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6758.

This message was autogenerated

@Deezzir
Copy link

Deezzir commented Jan 17, 2025

Update

We managed to do the backup/restore of the CKF cluster on EKS:

  1. The backup can be made with the following command:
BACKUP_NAME=backup-$(date '+%Y-%m-%d--%H-%M')
velero backup create \
	$BACKUP_NAME \
	--include-namespaces kubeflow-user-example-com \
	--include-cluster-scoped-resources profiles.kubeflow.org

Adding the --default-volumes-to-fs-backup will cause the restore to hang. Otherwise, looking at AWS's plugin docs they recommend that we actually don't use the FSB backup but just the plugins and snapshots.

  1. The profile and the namespace can be deleted and restored as follows:
kubectl delete ns kubeflow-user-example-com
kubectl delete profiles.kubeflow.org kubeflow-user-example-com

velero restore create --from-backup $BACKUP_NAME

The notebooks, PVs and PVCs will be restored and the notebooks will be fully functional. However, the restore status will be PartiallyFailed. The logs will indicate that there are some conflicts during the restoration of some resources of the user profile/namespace:

Warnings:
  Velero:     <none>
  Cluster:  could not restore, CustomResourceDefinition "authorizationpolicies.security.istio.io" already exists. Warning: the in-cluster version is different than the backed-up version
            could not restore, CustomResourceDefinition "notebooks.kubeflow.org" already exists. Warning: the in-cluster version is different than the backed-up version
            could not restore, CustomResourceDefinition "poddefaults.kubeflow.org" already exists. Warning: the in-cluster version is different than the backed-up version
            could not restore, CustomResourceDefinition "profiles.kubeflow.org" already exists. Warning: the in-cluster version is different than the backed-up version
            could not restore, CustomResourceDefinition "virtualservices.networking.istio.io" already exists. Warning: the in-cluster version is different than the backed-up version
  Namespaces:
    test:  could not restore, Secret "mlpipeline-minio-artifact" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, ConfigMap "istio-ca-root-cert" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, ConfigMap "kube-root-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, ConfigMap "metadata-grpc-configmap" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, ReplicaSet "ml-pipeline-ui-artifact-5fd49cc64c" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, ReplicaSet "ml-pipeline-visualizationserver-7f8545bdb9" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, Endpoints "ml-pipeline-ui-artifact" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, Endpoints "ml-pipeline-visualizationserver" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, Service "ml-pipeline-ui-artifact" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, Service "ml-pipeline-visualizationserver" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, AuthorizationPolicy "ns-owner-access-istio-charmed" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, Deployment "ml-pipeline-ui-artifact" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, Deployment "ml-pipeline-visualizationserver" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, PodDefault "access-ml-pipeline" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, RoleBinding "default-editor" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, RoleBinding "default-viewer" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, RoleBinding "namespaceAdmin" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, StatefulSet "test" already exists. Warning: the in-cluster version is different than the backed-up version
           could not restore, VirtualService "notebook-test-test" already exists. Warning: the in-cluster version is different than the backed-up version

It might be caused by the fact that some custom resources of CKF are responsible for creating the others, so during the restoration, the velero will attempt to create resources that are managed by previously restored objects.

@Deezzir
Copy link

Deezzir commented Jan 20, 2025

Update

We ran another round of tests for backup/restore, limiting the scope of resources being backed up. That helps avoid the conflicts caused by the objects that own/create other resources like Notebook Controllers:

The backup will now be separated into two objects:

  1. The first one will include the profiles.kubeflow.org resource:
velero backup create profile-backup --include-resources profiles.kubeflow.org
  1. The second one will include the notebooks.kubeflow.org, PersistentVolumeClaims and PersistentVolumes:
velero backup create notebooks-backup --include-resources notebooks.kubeflow.org,persistentvolumeclaims,persistentvolumes

Additionally, --exclude-namespaces flag can be used to exclude resources from the kubeflow namespace.

Now the profile and the namespace can be deleted and restored as follows:

kubectl delete ns kubeflow-user-example-com
kubectl delete profiles.kubeflow.org kubeflow-user-example-com

velero restore create --from-backup profile-backup
# Wait for the namespace to be created
velero restore create --from-backup notebooks-backup

Limiting the scope of the resources being backed up, to user namespaces, will avoid the conflicts mentioned in the previous update (since now it tries to restore PVCs in the kubeflow namespace), leaving two successful restores and the notebooks fully restored and functional:

NAME                              BACKUP             STATUS      STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
notebooks-backup-20250120191948   notebooks-backup   Completed   2025-01-20 19:19:50 +0300 MSK   2025-01-20 19:19:51 +0300 MSK   0        5          2025-01-20 19:19:50 +0300 MSK   <none>
profile-backup-20250120191918     profile-backup     Completed   2025-01-20 19:19:20 +0300 MSK   2025-01-20 19:19:20 +0300 MSK   0        0          2025-01-20 19:19:20 +0300 MSK   <none>

The warnings for the notebooks restore indicate that some PVCs couldn't be restored as they already exist. That's caused by the fact that we were snapshotting up all the PVCs, in all namespaces including kubeflow, in the second backup object:

# velero restore describe notebooks-backup-20250120190417 --details
...
Warnings:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    kubeflow:                   could not restore, PersistentVolumeClaim "katib-db-database-0255af6d-katib-db-0" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, PersistentVolumeClaim "kfp-db-database-44dd8a91-kfp-db-0" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, PersistentVolumeClaim "minio-data-a2f0a44e-minio-0" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, PersistentVolumeClaim "mlmd-mlmd-data-40242aa3-mlmd-0" already exists. Warning: the in-cluster version is different than the backed-up version
    controller-eks-controller:  could not restore, PersistentVolumeClaim "storage-controller-0" already exists. Warning: the in-cluster version is different than the backed-up version
...

@Deezzir
Copy link

Deezzir commented Jan 29, 2025

Update

For the next round of tests, we deployed MLFlow alongside Kubeflow and ran the following Notebook, which is a good E2E example for the backup/restore capabilities of Velero.

The backup will still consist of two parts:

  1. The first one will include the profiles.kubeflow.org resource:
velero backup create profile-backup --include-resources profiles.kubeflow.org
  1. The second one will include the PersistentVolumeClaims, PersistentVolumes and all CRDs created by Kubeflow. The list can be found here. Additionally, we will exclude the kubeflow namespace from the backup to avoid copying extra PVs and PVCs:
velero backup create crds-backup --exclude-namespaces kubeflow --include-resources persistentvolumeclaims,persistentvolumes,authcodes.dex.coreos.com,authorizationpolicies.security.istio.io,certificaterequests.cert-manager.io,certificates.cert-manager.io,certificates.networking.internal.knative.dev,challenges.acme.cert-manager.io,clusterdomainclaims.networking.internal.knative.dev,clusterissuers.cert-manager.io,clusterservingruntimes.serving.kserve.io,clusterstoragecontainers.serving.kserve.io,clusterworkflowtemplates.argoproj.io,compositecontrollers.metacontroller.k8s.io,configurations.serving.knative.dev,controllerrevisions.metacontroller.k8s.io,cronworkflows.argoproj.io,decoratorcontrollers.metacontroller.k8s.io,destinationrules.networking.istio.io,domainmappings.serving.knative.dev,envoyfilters.networking.istio.io,experiments.kubeflow.org,gateways.networking.istio.io,images.caching.internal.knative.dev,inferencegraphs.serving.kserve.io,inferenceservices.serving.kserve.io,ingresses.networking.internal.knative.dev,issuers.cert-manager.io,metrics.autoscaling.internal.knative.dev,mpijobs.kubeflow.org,mxjobs.kubeflow.org,notebooks.kubeflow.org,orders.acme.cert-manager.io,paddlejobs.kubeflow.org,peerauthentications.security.istio.io,podautoscalers.autoscaling.internal.knative.dev,poddefaults.kubeflow.org,profiles.kubeflow.org,proxyconfigs.networking.istio.io,pvcviewers.kubeflow.org,pytorchjobs.kubeflow.org,requestauthentications.security.istio.io,revisions.serving.knative.dev,routes.serving.knative.dev,scheduledworkflows.kubeflow.org,serverlessservices.networking.internal.knative.dev,serviceentries.networking.istio.io,services.serving.knative.dev,servingruntimes.serving.kserve.io,sidecars.networking.istio.io,suggestions.kubeflow.org,telemetries.telemetry.istio.io,tensorboards.tensorboard.kubeflow.org,tfjobs.kubeflow.org,trainedmodels.serving.kserve.io,trials.kubeflow.org,viewers.kubeflow.org,virtualservices.networking.istio.io,wasmplugins.extensions.istio.io,workflowartifactgctasks.argoproj.io,workfloweventbindings.argoproj.io,workflows.argoproj.io,workflowtaskresults.argoproj.io,workflowtasksets.argoproj.io,workflowtemplates.argoproj.io,workloadentries.networking.istio.io,workloadgroups.networking.istio.io,xgboostjobs.kubeflow.org

Now, the profile and the namespace can be deleted and restored as follows:

kubectl delete ns kubeflow-user-example-com
kubectl delete profiles.kubeflow.org kubeflow-user-example-com

velero restore create --from-backup profile-backup
# Wait for the namespace to be created
velero restore create --from-backup crds-backup

The first restore succeeds, and the second one partially fails.

velero restore get
                                          
NAME                            BACKUP           STATUS            STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
crds-backup-20250129194215      crds-backup      PartiallyFailed   2025-01-29 19:42:16 +0300 MSK   2025-01-29 19:42:18 +0300 MSK   1        13         2025-01-29 19:42:16 +0300 MSK   <none>
profile-backup-20250129194138   profile-backup   Completed         2025-01-29 19:41:40 +0300 MSK   2025-01-29 19:41:40 +0300 MSK   0        0          2025-01-29 19:41:40 +0300 MSK   <none>

The second restore will have the following warnings and errors:

velero restore describe crds-backup-20250129194215 --details

...
Warnings:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    controller-eks-controller:  could not restore, PersistentVolumeClaim "storage-controller-0" already exists. Warning: the in-cluster version is different than the backed-up version
    demo:                       could not restore, AuthorizationPolicy "ns-owner-access-istio-charmed" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, AuthorizationPolicy "ns-owner-access-istio" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, PodDefault "access-ml-pipeline" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, PodDefault "mlflow-server-access-minio" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, PodDefault "mlflow-server-minio" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, Revision "wine-regressor3-predictor-00001" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, Route "wine-regressor3-predictor" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, ServerlessService "wine-regressor3-predictor-00001" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, Service "wine-regressor3-predictor" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, VirtualService "notebook-demo-demo" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, VirtualService "wine-regressor3-predictor-ingress" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, VirtualService "wine-regressor3-predictor-mesh" already exists. Warning: the in-cluster version is different than the backed-up version

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    demo:  error restoring configurations.serving.knative.dev/demo/wine-regressor3-predictor: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: missing field(s): metadata.labels.serving.knative.dev/service
...

As in the previous update, the Notebook was restored successfully, and the contents are present. The wine-regressor3 got recreated as well but with some status errors.

MLFlow UI still correctly reports experiments and models.

Issues

Kubeflow UI Frontend

The UI is now broken in some places:

  1. The Experiments tab will fail to load details of the active experiments

Image
Screenshot of the browser console logs

  1. The Run tab will fail to load similarly.

E2E example

The Notebook example will now fail to run the model:

Image
Screenshot of the cell output of E2E Notebook example

The status of the wine-regressor3 ISVC resource indicates that there is an error with wine-regressor3-predictor and the Autoscaler, which is apparently is not owned anymore by the correct resource:

kubectl get inferenceservices.serving.kserve.io -n kubeflow-user-example-com wine-regressor3 -o yaml

...
status:
  components:
    predictor:
      latestCreatedRevision: wine-regressor3-predictor-00001
  conditions:
  - lastTransitionTime: "2025-01-29T16:42:19Z"
    reason: PredictorConfigurationReady not ready
    severity: Info
    status: "False"
    type: LatestDeploymentReady
  - lastTransitionTime: "2025-01-29T16:42:19Z"
    message: 'Revision "wine-regressor3-predictor-00001" failed with message: There
      is an existing PodAutoscaler "wine-regressor3-predictor-00001" that we do not
      own..'
    reason: RevisionFailed
    severity: Info
    status: "False"
    type: PredictorConfigurationReady
  - lastTransitionTime: "2025-01-29T16:42:19Z"
    message: Configuration "wine-regressor3-predictor" does not have any ready Revision.
    reason: RevisionMissing
    status: "False"
    type: PredictorReady
  - lastTransitionTime: "2025-01-29T16:42:19Z"
    message: Configuration "wine-regressor3-predictor" does not have any ready Revision.
    reason: RevisionMissing
    severity: Info
    status: "False"
    type: PredictorRouteReady
  - lastTransitionTime: "2025-01-29T16:42:19Z"
    message: Configuration "wine-regressor3-predictor" does not have any ready Revision.
    reason: RevisionMissing
    status: "False"
    type: Ready
  - lastTransitionTime: "2025-01-29T16:42:19Z"
    reason: PredictorRouteReady not ready
    severity: Info
    status: "False"
    type: RoutesReady
  modelStatus:
    states:
      activeModelState: ""
      targetModelState: Pending
    transitionStatus: InProgress
  observedGeneration: 1
...

@Deezzir
Copy link

Deezzir commented Jan 29, 2025

Update

We rerun the previous attempt with the E2E Notebook Example, but this time, we removed the following CRDs from the second backup:

certificates.networking.internal.knative.dev
clusterdomainclaims.networking.internal.knative.dev
configurations.serving.knative.dev
domainmappings.serving.knative.dev
images.caching.internal.knative.dev
ingresses.networking.internal.knative.dev
metrics.autoscaling.internal.knative.dev
podautoscalers.autoscaling.internal.knative.dev
revisions.serving.knative.dev
routes.serving.knative.dev
serverlessservices.networking.internal.knative.dev
services.serving.knative.dev
virtualservices.networking.istio.io

Those changes helped resolve the errors during the second restore in the last run:

velero restore get
NAME                            BACKUP           STATUS      STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
crds-backup-20250129222819      crds-backup      Completed   2025-01-29 22:28:20 +0300 MSK   2025-01-29 22:28:21 +0300 MSK   0        6          2025-01-29 22:28:20 +0300 MSK   <none>
profile-backup-20250129222622   profile-backup   Completed   2025-01-29 22:26:23 +0300 MSK   2025-01-29 22:26:24 +0300 MSK   0        0          2025-01-29 22:26:23 +0300 MSK   <none>

There are only some warnings that seem to be harmless and might be resolved in the future:

...
Warnings:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    controller-eks-controller:  could not restore, PersistentVolumeClaim "storage-controller-0" already exists. Warning: the in-cluster version is different than the backed-up version
    test:                       could not restore, AuthorizationPolicy "ns-owner-access-istio-charmed" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, AuthorizationPolicy "ns-owner-access-istio" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, PodDefault "access-ml-pipeline" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, PodDefault "mlflow-server-access-minio" already exists. Warning: the in-cluster version is different than the backed-up version
                                could not restore, PodDefault "mlflow-server-minio" already exists. Warning: the in-cluster version is different than the backed-up version
...

The ISVC was also correctly restored without any status errors. The model can be successfully rerun from the E2E Example Notebook. Unfortunately, the UI issues persisted.

@kimwnasptd
Copy link
Contributor Author

kimwnasptd commented Feb 5, 2025

After a bit of debigging with @Deezzir we got to the root cause of why the UI was complaining with the above error #1197 (comment) of Error: Unexpected runtime state!

Yuri's intuition that it's because of an empty state of the Run in MySQL was right!

Root Cause Analysis

  1. We were (incorrectly) restoring the Argo Workflow, that had finished running
  2. The persistent agent (kfp-persistence charm) then:
    1. monitors the Argo Workflows
    2. saw the new, restored, workflow
    3. Tried to detect if the Workflow's finishedAt is smaller that the TTL configuration
    4. Because the status of the workflow is empty after restore, the workflow had surpassed the TTL (Now - empty > TTL)
    5. The agent will call the API Server's ReportWorkflow
  3. The KFP API's handler for ReportWorkflow then:
    1. Will delete the Argo Workflow, since the restored Workflow had PersistedFinalState() label pipeline/persistedFinalState set to true
    2. Will try to UpdateRun with the state, to MySQL
    3. Because the Workflow has empty status, thus no Conditions, so the calculated state is empty
    4. An empty state gets store in MySQL
  4. The KFP UI then tries to load the status of the Run, but because it has empty state, it crashes with the above error

We also noticed, by looking at KFP API requests from the browser, that indeed when we were restoring the Workflow this is when the state would get corrupted. Which aligns with the above understanding.

Notes

In the code they use the term Execution rather than Argo Workflow. This is because the KFP Pipelines are an intermediate representation, and the execution Engine can be either Argo or Tekton workflows.

This is something though that has increased the complexity of the backend, and the upstream plans to remove Tekton kubeflow/pipelines#11438


So at this point:

  1. We will not make a backup/restore of any Argo resource
  2. Will retry our experiment but expect it to now work

@Deezzir
Copy link

Deezzir commented Feb 6, 2025

Final Update

After the debugging session with @kimwnasptd, we decided to remove the following CRDs from the second backup:

workflowartifactgctasks.argoproj.io 
workfloweventbindings.argoproj.io 
workflows.argoproj.io 
workflowtaskresults.argoproj.io 
workflowtasksets.argoproj.io 
workflowtemplates.argoproj.io
clusterworkflowtemplates.argoproj.io

So, the final two backup commands are the following:

Backup of the Profiles

velero backup create profile-backup --include-resources profiles.kubeflow.org

Backup of the CRDs

velero backup create crds-backup --exclude-namespaces kubeflow --include-resources persistentvolumeclaims,persistentvolumes,secrets,authcodes.dex.coreos.com,authorizationpolicies.security.istio.io,certificaterequests.cert-manager.io,certificates.cert-manager.io,challenges.acme.cert-manager.io,clusterissuers.cert-manager.io,clusterservingruntimes.serving.kserve.io,clusterstoragecontainers.serving.kserve.io,compositecontrollers.metacontroller.k8s.io,controllerrevisions.metacontroller.k8s.io,cron,decoratorcontrollers.metacontroller.k8s.io,destinationrules.networking.istio.io,envoyfilters.networking.istio.io,experiments.kubeflow.org,gateways.networking.istio.io,inferencegraphs.serving.kserve.io,inferenceservices.serving.kserve.io,issuers.cert-manager.io,mpijobs.kubeflow.org,mxjobs.kubeflow.org,notebooks.kubeflow.org,orders.acme.cert-manager.io,paddlejobs.kubeflow.org,peerauthentications.security.istio.io,poddefaults.kubeflow.org,profiles.kubeflow.org,proxyconfigs.networking.istio.io,pvcviewers.kubeflow.org,pytorchjobs.kubeflow.org,requestauthentications.security.istio.io,scheduledworkflows.kubeflow.org,serviceentries.networking.istio.io,servingruntimes.serving.kserve.io,sidecars.networking.istio.io,suggestions.kubeflow.org,telemetries.telemetry.istio.io,tensorboards.tensorboard.kubeflow.org,tfjobs.kubeflow.org,trainedmodels.serving.kserve.io,trials.kubeflow.org,viewers.kubeflow.org,wasmplugins.extensions.istio.io,workloadentries.networking.istio.io,workloadgroups.networking.istio.io,xgboostjobs.kubeflow.org

Using the following CRD list and restoring with the steps described here, we successfully restored the user namespace with Notebooks and other artifacts such as Runs correctly recovered.

@kimwnasptd
Copy link
Contributor Author

Great description @Deezzir! Marking this as done since we managed to perform the backup / restore.

Great work!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants