-
Notifications
You must be signed in to change notification settings - Fork 5
How the scheduler works
This document contains links to source code. It is recommended to follow along in the code while reading.
The JobScheduler runs in a while loop and at the end of the iteration, it sleeps for 5 seconds (by default, configurable through scheduler.heart.beat
), that's the HEART_BEAT
The loop can be interrupted by calling a REST API endpoint, /admin/stopManager
and restarted by /admin/startManager
The different instances of Hyperdrive-Trigger coordinate through a central Postgres database. Coordination is required around the current number of instances. Work is distributed on the granularity of Workflow
s. I.e. one instance is responsible for running the sensor of that workflow and also to execute the job submissions of that workflow. To that end, the workflow
table contains a column scheduler_instance_id
, which shows to which instance a workflow belongs. If scheduler_instance_id
is NULL
, it means that the workflow has not been associated yet.
The workflows are simply evenly distributed to the scheduler instances. The advantage of this distribution logic is that it can be computed independently on every scheduler instance without the need for a central distributor. The disadvantage is that the distribution by workflows is not finely granular, and it is possible that the actual workload (i.e. the number of job submissions) is not evenly distributed because some workflows may run very often, or contain many steps in comparison to others.
scheduler_instance_id
is a foreign key to the table scheduler_instance
. The table contains the following fields: id
, status
and last_heartbeat
. It is used to keep track of the currently online instances, so for a 3-instances setup it will contain 3 rows, if an additional instance is added, then it will contain 4 rows, if an instance is killed, then it will contain 2 rows etc.
At the beginning of an iteration, an instance inserts a row if it doesn't exist yet and updates the last heartbeat field. Note that the clock of the database is used, to avoid issues due to unsynchronized clocks of different instances, see #708.
If a scheduler instance does not update its heartbeat for 20 seconds (by default, configurable through scheduler.lag.threshold
), its status will be set to Deactivated
by the first instance that successfully executes the update query. In a separate query, the scheduler_instance_id
of all its associated workflows will be set to NULL
and finally the row will be deleted from the scheduler_instance
table.
If a scheduler instance is wrongly deactivated by another instance, it will not be able to update its heartbeat, and clean up its sensors. After that, it is able to resume the loop by inserting a new row into the scheduler_instance
table with a new id, as if it had been manually restarted. See also #738