Recovery mechanism on update #15

nielsdenissen · 2018-09-03T14:12:39Z

When an update fails for whatever reason, redeploy the old job.
For this we have to monitor how the new job behaves for a little while in case of issues just after startup (e.g. Inconsistent state)

nielsdenissen · 2018-09-24T15:20:37Z

Initial investigation results

We (@mrooding and I) found some issues with this. Essentially this boils down to the following. In order to recover we need to:

Know which JAR is currently deployed to redeploy that in case of a failure
Know which arguments were given to current running job

The first is difficult since the Flink API doesn't return this information anywhere, but could be done by for instance storing all JARs uploaded to the Flink cluster and looking for a matching job name (See option 1 below).
The second though is a larger issue as this cannot be found anywhere else in the API. This leads us to believe that to implement this feature properly, we'd have to maintain state in our deployer (last deployed jar with run arguments etc..).

Solution to Problem 1

Currently our deployer is making the following assumptions:

Your job name consists of 2 parts (e.g. "Aggregate All The Data 1.0.2"):
1. A job basename that defines what job this is (e.g. "Aggregate All The Data")
2. A version number that changes across versions for this job (e.g. "1.0.2")

We use this assumption to determine which of the potentially multiple running jobs needs to be updated.
Currently we've found 2 solutions to Problem 1:

Option 1: JARs on Flink

All JARs uploaded to Flink are available on the Flink cluster(requires to persist web.upload.dir). This allows us to determine the running JAR as follows:

Get http://localhost:8081/jobs/[JOBID] (get "name")
Find JAR that was running
a. list all jars (http://localhost:8081/jars, get "files.id")
b. for all jars find name (http://localhost:8081/jars/[JAR_ID]/plan, get "plan.name")
c. Get JAR_ID for jars name that matches job name (JAR_ID is also the name of the file stored by Flink)

Option 2: JARs remote

All JARs build are stored in a remote repository (e.g. Artifactory). The user has to provide the JAR that he wants to use as a fallback in case the update fails.
To ensure that the fallback image is the same as the currently running image, we could check that the JARs match by matching the job names they list. Possibly we could break when these don't match.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery mechanism on update #15

Recovery mechanism on update #15

nielsdenissen commented Sep 3, 2018

nielsdenissen commented Sep 24, 2018 •

edited

Loading

Recovery mechanism on update #15

Recovery mechanism on update #15

Comments

nielsdenissen commented Sep 3, 2018

nielsdenissen commented Sep 24, 2018 • edited Loading

Initial investigation results

Solution to Problem 1

Option 1: JARs on Flink

Option 2: JARs remote

nielsdenissen commented Sep 24, 2018 •

edited

Loading