tocdepth:	1

Note

This technote is not yet published.

Notes on running DRP pipelines using PanDA (Production ANd Distributed Analysis system)

1 Introduction

The PanDA (Production ANd Distributed Analysis) is a workload management system for distributed (GRID based), Cloud, or HPC computing. PanDA based setups usually include few leveraged components such as:

PanDA server [PanDA_documentation], [PanDA_paper]
JEDI system [JEDI_twiki]
Harvester [Harvester_documentation], [Harvester_slides], [Harvester_paper]
Pilot [Pilot_documentation], [Pilot_paper]
iDDS [iDDS_documentation], [iDDS_slides]
Monitoring [monitoring_paper]
CRIC [CRIC_slides]

All these components have a well-defined scope of functionality and its independent installation allows to manage deeply customized instances. For the year 2019, the ATLAS PanDA instance landed 394M computing jobs on 165 computation queues distributed all over the World. There are more than 3000 users who use BigPanDA monitor. PanDA system has been successfully adopted by other experiments, e.g. COMPASS [compass_paper].

In this note, we describe the PanDA setup created for Rubin evaluation and test data processing.

2 Setup

We have intensively used results of both ATLAS Google POC [atlas_google_poc] and Rubin Google POC [rubin_google_poc] projects. The following components of the setup were configured specifically for this exercise:

Harvester. A Kubernetes edge service has been configured in the DOMA Harvester instance to handle two clusters deployed on the Google Kubernetes Engine. We distinguished all tasks in the workflow by memory requirements into two groups: high (16 GB per pod) and conventional memory (4GB per pod). Queues are functioning in the PULL mode. This mode continuously provided few instances of pilot idling in the GKE pods and periodically checked for new jobs. The number of running pilots and, accordingly, number of activated nodes dynamically increased to match the number jobs immediately ready submission on the a queue. Details of Harvester configuration for using Google Cloud Console described here [harvester_gke_manual].
Pilot. It is a generic middleware component for grid computing. Instead of submitting payload jobs directly to the grid gatekeepers, pilot factories are used to submit special lightweight jobs referred to here as pilot wrappers that are executed on the worker nodes. The pilot is responsible for pulling the actual payload from the PanDA server and any input files from the Storage Element (SE), executing the payload, uploading the output to the SE, monitoring resource usage, sending the heartbeat signal, evaluation and sending the final job status to the PanDA server. Pilot jobs are started by pilot-starter jobs, which are provided by the Harvester. In the created setup pilot uploads the log files from produced by each job to a Google Bucket for long term storage.
iDDS. This system provides synchronization of payload execution accordingly to the Direct Acyclic Graph (DAG) generated by BPS subsystem. Its basic role is to check payload processing status and resolve pseudo inputs in tasks successors once correspondent predecessor job has finished. PanDA plugin uses iDDS client to submit workflow and once it submitted iDDS sends the payload to PanDA server and manages its execution. iDDS provides API for workflow management as a whole such as cancellation, retrial.
Google Cloud configuration. As it was noted above we defined two kubernetes clusters with different type of machines: n2-standard-4 with ~4 cores and ~4 GB of memory per core and n2-custom-4-43008-ext with ~4 cores and ~16 GB of RAM per core. Part of each CPU power and memory are reserved by GKE for management purpose, this is why each machine can not accept more than 3 jobs. Disk size for all machines was set to 200 GB with ordinary performance type (not SSD). All nodes were preemtible and autoscaling was enabled for both clusters. Data is kept on the Google Storage and accessed using gc protocol. All payload executed in the Docker containers. We have deployed a Docker over Docker configuration with outer container responsible for providing the OSG worker node software stack for landing the Pilot and perform X.509 proxy authentication to communicate with PanDA server. The inner container contains the standard Rubin software published in the official DockerHub repository.

Where were additional improvements of JEDI and iDDS core but they were done into the workflow agnostic way.

3 PanDA Queues o GKE Clusters and GCS Buckets

3.1 GKE Clusters

In the project of panda-dev-1a74, we defined 5 kubernetes (GKE) production clusters and one small GKE test cluster (developmentcluster). All clusters are deployed using correspondent [Terraform] configuration. The repository with deployment scripts are available in the LSST GitHub [deployemnt_project].

Currently there are 7 GKE clusters:

% gcloud container clusters list
NAME                       LOCATION     MASTER_VERSION    MASTER_IP       MACHINE_TYPE            NODE_VERSION        NUM_NODES  STATUS
developmentcluster         us-central1  1.22.8-gke.2200   35.239.22.197   n2-custom-6-8960        1.21.10-gke.1500 *  1          RUNNING
extra-highmem              us-central1  1.22.8-gke.2200   35.193.135.73   n2-custom-2-240640-ext  1.21.10-gke.1500 *  4          RUNNING
extra-highmem-non-preempt  us-central1  1.22.8-gke.2200   104.198.73.122  n2-custom-2-240640-ext  1.22.8-gke.200 *    2          RUNNING
highmem                    us-central1  1.21.11-gke.1100  35.224.254.34   n2-custom-4-43008-ext   1.21.11-gke.900 *   3          RUNNING
highmem-non-preempt        us-central1  1.21.11-gke.1100  35.193.45.57    n2-custom-4-43008-ext   1.21.11-gke.1100    5          RUNNING
merge                      us-central1  1.22.8-gke.2200   34.70.152.234   n2-standard-4           1.21.10-gke.1500 *  2          RUNNING
moderatemem                us-central1  1.21.11-gke.1100  34.69.213.236   n2-standard-4           1.21.5-gke.1302 *   3          RUNNING

3.2 PanDA Queues

There are 6 PanDA queues were configured in the [CRIC] system to match particular job requirements:

DOMA_LSST_GOOGLE_TEST (GKE cluster: moderatemem). This is a cluster for jobs that are not sensitive to node preemption and requiring not more than 3200MB of RAM. The GKE k8s cluster is configured to use n2-standard-4 machines which offer 4 cores with 16GB of total memory. These available CPUs and memory are shared between jobs assigned to particular nodes and system pods which performs the machine level health monitoring, logs delivery, events collection and another Kubernetes service functions. This is why the available computing power is reduced, and the value of 0.85 core and 3200MB of RAM per job are experimentally proved values that allow fitting 4 jobs in every cluster node. That values are defined in the Kubernetes job definition YAML, which is used by Harvester in a job submission phase. This cluster lands the majority of jobs in Rubin's payload.
DOMA_LSST_GOOGLE_TEST_HIMEM (GKE cluster: highmem). For jobs requiring more than 3200MB but less than 18000MB of RAM, we defined a high memory preemption cluster. This cluster uses n2-custom-4-43008-ext machines and can land up to 2 jobs per one node. The machine choice was motivated by the following: the "ext" memory is higher priced than the standard one, and we can't order less than 4 cores for such an amount of memory. Further optimization is possible.
DOMA_LSST_GOOGLE_TEST_EXTRA_HIMEM (GKE cluster: extra-highmem). This is a queue for extremely memory-demanding jobs and allows them to allocate 220000MB of memory (there is some memory overhead from the kubernetes components). If submitting task requests RAM above the DOMA_LSST_GOOGLE_TEST_HIMEM capability, the job becomes assigned to this queue.
DOMA_LSST_GOOGLE_MERGE (GKE cluster: merge). This is a special queue to run merge jobs finalizing each submitted workflow. This queue has been excluded from the automatic PanDA brokerage, and tasks are assigned using the queue definition parameter in the Rubin BPS submission YAML. The distinguished property of the correspondent backend cluster is that number of concurrent jobs is very limited. This limitation allows controlling the number of active connections to the Butler Postgres DB.
DOMA_LSST_GOOGLE_TEST_HIMEM_NON_PREEMPT (GKE cluster: highmem-non-preempt). We have experimentally observed that jobs lasting more than 12 hours have a low probability of success due to nodes preemption. This significantly impacts the duration of the workflow run because it takes a few days of running and failing attempts to reach the retry attempt, which will finally survive. That long-lasting retry attempts with a low survival rate also negatively impact the cost-efficiency. To increase the chances for such durable jobs to finish from the first attempt, we created a special non-preemptive queue. In terms of CPU and RAM, the queue is equivalent to the DOMA_LSST_GOOGLE_TEST_HIMEM.
DOMA_LSST_GOOGLE_TEST_EXTRA_HIMEM_NON_PREEMPT (GKE cluster: extra-highmem-non-preempt). We have experimentally observed that jobs lasting more than 12 hours have a low probability of success due to nodes preemption. This significantly impacts the duration of the workflow run because it takes a few days of running and failing attempts to reach the retry attempt, which will finally survive. That long-lasting retry attempts with a low survival rate also negatively impact the cost-efficiency. To increase the chances for such durable jobs to finish from the first attempt, we created a special non-preemptive queue. In terms of CPU and RAM, the queue is equivalent to the DOMA_LSST_GOOGLE_TEST_EXTRA_HIMEM.
DOMA_LSST_DEV (GKE cluster: developmentcluster). This cluster is used for testing developments before deployment into the production environment.

The queues configuration files are available in the GitHub repository the panda-conf github repo.

The json file panda_queueconfig.json defined all PanDA queues on the harvester server. The kube_job.yaml provides Kubernetes job configuration for DOMA_LSST_GOOGLE_TEST_HIMEM, DOMA_LSST_GOOGLE_TEST_EXTRA_HIMEM, DOMA_LSST_GOOGLE_MERGE queues. The kube_job_moderate.json defines K8s jobs on DOMA_LSST_GOOGLE_TEST and kube_job_non_preempt.yaml for DOMA_LSST_GOOGLE_TEST_HIMEM_NON_PREEMPT and DOMA_LSST_GOOGLE_TEST_EXTRA_HIMEM_NON_PREEMPT. The yaml file job_dev-prmon.yaml is for the test queue DOMA_LSST_DEV.

The above "k8s_yaml_file" files instruct POD:

what container image is used.
what credentials are passed.
what commands run in the container on the pod.

While the "k8s_config_file" files associate PanDA queues with their corresponding GKE clusters, which be explained in the next subsection.

For the production queues, the commands inside the container are passed to "bash -c":

whoami;cd /tmp;export ALRB_noGridMW=NO; wget https://storage.googleapis.com/drp-us-central1-containers/pilots_starter_d3.py; chmod 755 ./pilots_starter_d3.py; ./pilots_starter_d3.py || true

It will download the pilot package and start a new pilot job.

For debugging purpose, a POD node can be created independently with a test yaml file. But a different metadata name should be used i.e. test-job, in the yaml file. For example:

kubectl create -f test.yaml
kubectl get pods -l job-name=test-job
kubectl exec -it $podName -- /bin/bash

which creates a pod in the job-name of test-job, and enters to the container on that POD to debug, where $podName is the POD name found on the command "kubectl get pods".

3.2.1 Association of PanDA queues with GKE Clusters

In order to associate a new GKE cluster with the corresponding PanDA queue, a "k8s_config_file" file need to be created. Take an example of the cluster "extra-highmem-non-preempt":

export KUBECONFIG=/data/idds/gcloud_config_rubin/kube_extra_large_mem_non_preempt
gcloud container clusters get-credentials --region=us-central1 extra-highmem-non-preempt
chmod og+rw $KUBECONFIG

3.2.2 GKE Authentication for PanDA Queues

The environment variable CLOUDSDK_CONFIG defines the location of Google Cloud SDK’s config files. On the harvester server machine the environment variable is defined in the file /opt/harvester/etc/rc.d/init.d/panda_harvester-uwsgi. during new wokers creation the harvester server needs to run Google cloud authentication command:

gcloud config config-helper --format=json

To check which account is used in the Google cloud authentication, just run gcloud auth list:

% gcloud auth list
   Credentialed Accounts
ACTIVE  ACCOUNT
*       dev-panda-harvester@panda-dev-1a74.iam.gserviceaccount.com
        gcs-access@panda-dev-1a74.iam.gserviceaccount.com
        spadolski@lsst.cloud
        yesw@lsst.cloud

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

The GKE authentication account has been changed to use the service account dev-panda-harvester.

To modify the active account, first run the Google cloud authentication "gcloud auth login", then run gcloud config set account `ACCOUNT`.

3.2.3 AWS Access Key for S3 Access to GCS Buckets

Rubin jobs need to access the GCS butler bucket in s3 botocore, hence AWS authentication is required. The AWS access secret key is stored in he environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

Currently the AWS access key from the service account butler-gcs-butler-gcs-data-sa@data-curation-prod-fbdb.iam.gserviceaccount.com is used as show on the interoperability setting page for the project data-curation-prod-fbdb.

The AWS access key is passed to the POD nodes via kubernetes secrets in the data field which have to be base64-encoded strings. Then the AWS access key is passed as environment variables into the Rubin docker containers.

3.3 GCS Buckets

In the Google Cloud Storage (GCS), we defined two buckets, drp-us-central1-containers and drp-us-central1-logging, as shown below:

The 3rd bucket in the name of "us.artifacts.*", was automatically created in the Google Cloud Build, to store the build container images.

As the bucket name indicates, the bucket drp-us-central1-containers accommodate container image files, the pilot-related files and panda queue configuration files. The other bucket drp-us-central1-logging stores the log files of pilot and payload jobs.

The logging bucket is configured in Uniform access mode, allowing public access, and allowing a special service account gcs-access with the permission of roles/storage.legacyBucketWriter and roles/storage.legacyObjectReader. The credential json file of this special service account is generated in the following command:

gcloud iam service-accounts keys create gcs-access.json --iam-account=gcs-access@${projectID}.iam.gserviceaccount.com

Where $projectID is panda-dev-1a74. Then it is passed to the container on the POD nodes via the secret name gcs-access, with the environmental variable GOOGLE_APPLICATION_CREDENTIAL pointing to the json file.

4 Job Run Procedure in PanDA

The PanDA system is overviewed in the following graph:

The detailed description of these components presented in the slides of PanDA status update talk.

4.1 Job Submission

As described in the PanDA Orchestration User Guide, jobs generated by the BPS subsystem end then grouped in into tasks by PanDA plugin using jobs labels as a grouping criteria. In this way, each task performs the unique principal operations over different Data/Node ids. Each job has its own input Data/Node id. The submission YAML file is described here: configuration YAML file. Once PanDA plugin generates a workflow of dependent jobs united into tasks it submits them into iDDS performing transitional authentication in PanDA server. The PanDA monitoring page will show the tasks in the status of registered, as shown below:

4.2 Job Starting

The harvester server, ai-idds-02.cern.ch, is continuously querying the PanDA server about the number of jobs to run, then triggers the corresponding GKE cluster to start up the needed POD nodes. this moment, those tasks/jobs status will be changed into running, as shown below:

4.3 Job Running

The POD nodes run in the pilot/Rubin container, for example, us.gcr.io/panda-dev-1a74/centos:7-stack-lsst_distrib-w_2021_21_osg_d3, as configured in the GKE cluster. Each jobs on the POD nodes start one pilot job inside the container. The pilot job will first get the corresponding PanDA queue configuration and the associated storage ddmendpoint (RSE) configuration from CRIC.

The pilot job uses the provided job definition in case of PUSH mode, or will get retrieve definition in case of PULL mode. Then the pilot job runs the provided payload job. In case of PULL mode, one pilot job could get and run multiple payload jobs one by one. After the payload job finishes, the pilot will use the python client for GCS to write the payload job log file into the Google Cloud Storage bucket, which is defined in the PanDA queue and RSE configuration. Then the pilot will update the job status including the public access URL to the log files, as shown below:

If the jobs have not finished successfully, the job status would be failed.

The pilot communication with the PanDA server is authenticated with a valid grid proxy, which is passed to the container through POD. Similarly, a credential json file of the GCS bucket access service account is passed to the container, in order to write/access to the GCS bucket in the python client for the Google Cloud Storage.

4.4 Job Monitoring

Users can visit the PanDA monitoring server, https://panda-doma.cern.ch/, to check the workflow/task/job status. The PanDA monitor fetches the payload information from the central database. The monitoring provides the drill down functionality starting from a workflow and finishing by a particular job log. Click on the task IDs will go into the details of each task, then click on the number under the job status such as running, finished, or failed, will show the list of jobs in that status. You can check each job details by following the PanDA ID number.

4.5 Real-time Logging

The Rubin jobs on the PanDA queues are also provided with (near)real-time logging on Google Cloud Logging. Once the jobs have been running on the PandDA queues, users can check the json format job logs on the Google Logs Explorer. To access it, you need to login with your Google account of lsst.cloud, and select the project of "panda-dev" (the full name is panda-dev-1a74).

On the Google Logs Explorer, you make the query. Please include the logName Panda-RubinLog in the query:

logName="projects/panda-dev-1a74/logs/Panda-RubinLog"

For specific panda task jobs, you can add one field condition on jsonPayload.TaskID in the query, such as:

logName="projects/panda-dev-1a74/logs/Panda-RubinLog"
jsonPayload.TaskID="6973"

For a specific individual panda job, you can include the field jsonPayload.PandaJobID. Or search for a substring "Importing" in the log message:

logName="projects/panda-dev-1a74/logs/Panda-RubinLog"
jsonPayload.TaskID="6973"
jsonPayload.message:"Importing"

Or ask for logs containing the field "MDC.RUN":

logName="projects/panda-dev-1a74/logs/Panda-RubinLog"
jsonPayload.TaskID="6969"
jsonPayload.MDC.RUN:*

You will get something like:

/_static/Screenshot-GoogleLogsQuery-20211012.jpg

You can change the time period from the top panel. The default is the last hour.

And you can also pull down the Configure menu (on the middle right) to change what to be displayed on the SUMMARY column of the query result. By default, it shows the content of jsonPayload.message.

There are more fields available in the query. As you are typing in the query window, it will show up autocomplete field options for you.

You can visit the page of Advanced logs queries for more details on the query syntax.

5 Support

There are two lines of support: Rubin-specific and core PanDA components. For front line support we established a dedicated slack channel: #rubinobs-panda-support. If an occurred problem goes beyond the Rubin deployment, a correspondent development team could be involved. Support channel for each subsystem of the setup provided in particular documentation.