Skip to content

Latest commit

 

History

History
109 lines (68 loc) · 4.35 KB

scheduling_tasks.md

File metadata and controls

109 lines (68 loc) · 4.35 KB

PLATO WP36 EAS pipeline prototype

Scheduling jobs to run through the pipeline

This page describes how to schedule a job to run on the pipeline.

1. Deploy the core pipeline infrastructure within Kubernetes

Having successfully installed Kubernetes, you should make sure that the database and message queue containers are alive within the cluster:

cd ../eas_controller/worker_orchestration
./deploy.py

It doesn't matter if you run this command many times - it will do nothing if these containers are already running

2. Check what pods are running

kubectl get pods -n=plato

This will show a list of the containers running within Kubernetes. It often takes a minute or two for them to reach the Running state.

3. Initialise the databases

Before you can connect to the database, you need to find out the port and host on which minikube is exposing the MySQL and RabbitMQ services on the host machine. There is a convenience function for doing this as follows:

minikube service --url mysql -n=plato
minikube service --url rabbitmq-service -n=plato

Once you know the IP address and port number for each service, you can initialise the databases using commands like this:

cd eas_controller/database_initialisation
./init_schema.py --db_port 30036 --db_host 192.168.49.2
./init_queues.py --mq_port 30672 --mq_host 192.168.49.2

Alternatively, you may wish to configure the job queue to use SQL transactions rather than RabbitMQ as its queue implementation:

./init_queues.py --queue_implementation sql

Current tests suggest that SQL-based job queues are actually much faster and more efficient than RabbitMQ.

4. Start some worker nodes

Now that the database has been initialised, it's possible to start some workers:

cd ../eas_controller/worker_orchestration
./deploy.py --worker eas-worker-synthesis-psls-batman --worker eas-worker-tls

5. Submit a job into the task database

The Python script eas_controller/job_submission/submit.py is used to submit jobs to the cluster.

Jobs themselves are defined by JSON files which list a series of pipeline tasks which should be run, either in sequence or in parallel depending on their inter-dependencies. The pipeline automatically determines how the list of tasks can be parallelised based on which tasks read input file products generated by previous tasks.

There are numerous example JSON files in the demo_jobs directory of this repository. One of them can be launched as follows:

cd eas_controller/job_submission
./submit.py --tasks ../../demo_jobs/feature_tests/qats_test.json

The format of the JSON files is explained here.

6. Schedule the job

The step above creates an entry for the job in the task database, but doesn't insert it into the job queue that the worker listen to.

Before the task will run, you need to run a daemon which continually scans the database for tasks which are ready to run (i.e. they are not waiting for input from other tasks):

cd eas_controller/queue_management
./schedule_all_waiting_tasks.py

This script enters an infinite loop of scanning the database for jobs that can be queued for immediate execution, and so you may want to run it in a screen session.

7. When things go wrong...

Various scripts in eas_controller can be used to recover bad situations...

  • If you want to empty out the job queue, first kill the script schedule_all_waiting_tasks.py, and then run eas_controller/queue_management/empty_queue.py
  • If you want to re-run all the tasks which failed, then run eas_controller/queue_management/reschedule_all_unfinished_jobs.py
  • If you want to kill all the workers right away, then run eas_controller/worker_orchestration/stop.py
  • If you want to restart all the workers, so that they pick up a new build of their Docker containers, run eas_controller/worker_orchestration/restart_workers.py
  • If you want to restart the web interface, then run eas_controller/worker_orchestration/restart_web_interface.sh

8. Diagnostics

If you want to monitor the cluster's progress, then the best place to start is the web interface, but there are also numerous diagnostic scripts to help.


Author

This code is developed and maintained by Dominic Ford, at the Institute of Astronomy, Cambridge.