Skip to content

Skyhook at AF

Oksana Shadura edited this page Jan 25, 2022 · 8 revisions

Configuration of separate Rook cluster @ UNL: Ceph Docker image

We need to use a Skyhook Ceph image in the Rook CRD here: uccross/skyhookdm-arrow:vX.Y.Z.

Currently, we are using as an image: uccross/skyhookdm-arrow:v0.4.0

More documentation from Skyhook developers: https://github.com/uccross/skyhookdm-arrow-docker

Rook Ceph cluster configuration

After the cluster is updated, we need to deploy a Pod with the PyArrow (with SkyhookFileFormat API) library installed to start interacting with the cluster. This can be achieved by following these steps:

Update the ConfigMap with configuration options to be able to load the arrow CLS plugins,

kubectl apply -f cls.yaml

where cls.yaml is:

apiVersion: v1
kind: ConfigMap
data:
  config: |
    [global]
    debug ms = 1
    [osd]
    osd max write size = 250
    osd max object size = 256000000
    osd class load list = *
    osd class default list = *
    osd pool default size = 1
    osd pool default min size = 1
    osd crush chooseleaf type = 1
    osd pool default pg num = 128
    osd pool default pgp num = 128
    bluestore block create = true
    debug osd = 25
    debug bluestore = 30
    debug journal = 20
metadata:
  name: rook-config-override
  namespace: rook-ceph-skyhookdm

Create a CephFS filesytem on the Rook cluster,

kubectl create -f filesystem.yaml

where filesystem.yaml is:

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: cephfs
  namespace: rook-ceph-skyhookdm
spec:
  metadataPool:
    replicated:
      size: 3
  dataPools:
    - replicated:
        size: 3
  preserveFilesystemOnDelete: true
  metadataServer:
    activeCount: 1
    activeStandby: true

Add specific cephfs storage class rook-cephfs for Rook Ceph cluster to be used within rook-ceph-skyhookdm namespace (with Ceph pool named cephfs-data0) in our case:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph-skyhookdm.cephfs.csi.ceph.com
parameters:
  # clusterID is the namespace where operator is deployed.
  clusterID: rook-ceph-skyhookdm

  # CephFS filesystem name into which the volume shall be created
  fsName: cephfs

  # Ceph pool into which the volume shall be created
  # Required for provisionVolume: "true"
  pool: cephfs-data0

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph-skyhookdm
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph-skyhookdm
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph-skyhookdm

reclaimPolicy: Delete 

Add PVC claim skyhook-pv-claim configuration that will be mounted in Coffea-casa Helm charts (in this example for rook-ceph-skyhookdm namespace and storage classname rook-cephfs for specific Rook cluster with enabled Skyhook):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: skyhook-pv-claim
  namespace: rook-ceph-skyhookdm
  labels:
    app: rook-ceph-skyhookdm
spec:
  storageClassName: rook-cephfs
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 500Gi

Add Ceph specific secrets in k8s

Check fsid and keyring values in the Ceph configuration and Keyring from some OSD/MON Pod.

  kubectl -n rook-ceph-skyhookdm cat [any-osd/mon-pod]:/var/lib/rook/rook-ceph-skyhookdm/rook-ceph-skyhookdm.config
  kubectl -n rook-ceph-skyhookdm cat [any-osd/mon-pod]:/var/lib/rook/rook-ceph-skyhookdm/client.admin.keyring
  

Please add found on the previous step fsid and keyring values as a SealedSecret for namespace where you are running AF and planning to mount Skyhook (opendataaf-prod in my case):

apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  creationTimestamp: null
  name: skyhook-secret
  namespace: opendataaf-prod
spec:
  encryptedData:
    fsid: xxxxxx
    keyring: xxxxxx
  template:
    metadata:
      creationTimestamp: null
      name: skyhook-secret
      namespace: opendataaf-prod

Coffea-casa images supporting builtin Skyhook functionality

We had been using for testing the next two Ubuntu images (I will update CC7 images as well):

Ceph configuration file in /home/cms-jovyan/.ceph/ceph.conf and both fsid and keyring values are populated based on values stores as Kubernetes SealedSecrets.

Images could have some UNL Ceph specific setting, please report this to us and we will fix it!

Integration in coffea-casa AF

In Coffea-casa Helm charts please edit values.yaml using fsid from SealedSecret as environment variable SKYHOOK_CEPH_UUIDGEN and keyring as SKYHOOK_CEPH_KEYRING (see the step before if you forgot what is fsid and keyring):

    singleuser:
      image:
        pullPolicy: Always
      extraEnv:
        SERVICEX_HOST: http://opendataaf-servicex-servicex-app:8000
        LABEXTENTION_FACTORY_CLASS: LocalCluster
        LABEXTENTION_FACTORY_MODULE: dask.distributed
        SKYHOOK_CEPH_UUIDGEN:
          valueFrom:
            secretKeyRef:
              name: skyhook-secret
              key: fsid
        SKYHOOK_CEPH_KEYRING:
          valueFrom:
            secretKeyRef:
              name: skyhook-secret
              key: keyring

Also, don't forget to mount PVC with Skyhook to singleuser pod:

    singleuser:
      storage:
        extraVolumes:
        - name: skyhook-shared
          persistentVolumeClaim:
            claimName: skyhook-pv-claim
        extraVolumeMounts:
          - name: skyhook-shared
            mountPath: /mnt/cephfs

Test and small benchmark

Check the connection status from notebook terminal:

  $ ceph -s

Download some example dataset into /mnt/cephfs/. For example,

cd /mnt/cephfs
wget https://raw.githubusercontent.com/JayjeetAtGithub/zips/main/nyc.zip
unzip nyc.zip

Execute a mini test:

import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq

format_ = ds.SkyhookFileFormat("parquet", "/home/cms-jovyan/.ceph/ceph.conf", "cephfs-data0")
partitioning_ = ds.partitioning(
    pa.schema([("payment_type", pa.int32()), ("VendorID", pa.int32())]),
    flavor="hive"
)
dataset_ = ds.dataset("file:///mnt/cephfs/nyc", partitioning=partitioning_, format=format_)
print(dataset_.to_table(
        columns=['total_amount', 'DOLocationID', 'payment_type'], 
        filter=(ds.field('payment_type') > 2)
).to_pandas())