Skip to content

v1.0.0-beta.1: Add "babysitter" and `datum_tries` support

Pre-release
Pre-release
Compare
Choose a tag to compare
@emk emk released this 24 Nov 17:59

This release adds a "babysitter" process inside each falconerid. We use this to monitor jobs and datums, and detect and/or recover from various types of errors. Updating an existing cluster should be fine, but it's likely to spend a minute or two detecting and marking problems with old jobs. So please exercise appropriate caution.

We plan to stabilize a falconeri 1.0 with approximately this feature set. It has been in production for years, and the babysitter was the last missing critical feature.

Added

  • If worker pod disappears off the cluster while processing a datum, detect this and set the datum to status = Status::Error. This is handled automatically by a "babysitter" thread in falconerid.
  • Add support for datum_tries in the pipeline JSON. Set this to 2, 3, etc., to automatically retry failed datums. This is also handled by the babysitter.
  • Periodically check to see whether a job has finished without being correctly marked as such. This is mostly intended to clean up existing clusters.
  • Periodically check to see whether a Kubernetes job has unexpectedly disappeared, and mark the corresponding falconeri job as having failed.
  • Add trace spans for most low-level database access.

Fixed

  • We now correctly update updated_at on all tables that have it.