Release v1.0.0-beta.1: Add "babysitter" and `datum_tries` support · faradayio/falconeri

This release adds a "babysitter" process inside each falconerid. We use this to monitor jobs and datums, and detect and/or recover from various types of errors. Updating an existing cluster should be fine, but it's likely to spend a minute or two detecting and marking problems with old jobs. So please exercise appropriate caution.

We plan to stabilize a falconeri 1.0 with approximately this feature set. It has been in production for years, and the babysitter was the last missing critical feature.

Added

If worker pod disappears off the cluster while processing a datum, detect this and set the datum to status = Status::Error. This is handled automatically by a "babysitter" thread in falconerid.
Add support for datum_tries in the pipeline JSON. Set this to 2, 3, etc., to automatically retry failed datums. This is also handled by the babysitter.
Periodically check to see whether a job has finished without being correctly marked as such. This is mostly intended to clean up existing clusters.
Periodically check to see whether a Kubernetes job has unexpectedly disappeared, and mark the corresponding falconeri job as having failed.
Add trace spans for most low-level database access.

Fixed

We now correctly update updated_at on all tables that have it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0-beta.1: Add "babysitter" and `datum_tries` support

Added

Fixed