Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery mechanism on update #15

Open
nielsdenissen opened this issue Sep 3, 2018 · 1 comment
Open

Recovery mechanism on update #15

nielsdenissen opened this issue Sep 3, 2018 · 1 comment

Comments

@nielsdenissen
Copy link
Contributor

When an update fails for whatever reason, redeploy the old job.
For this we have to monitor how the new job behaves for a little while in case of issues just after startup (e.g. Inconsistent state)

@nielsdenissen
Copy link
Contributor Author

nielsdenissen commented Sep 24, 2018

Initial investigation results

We (@mrooding and I) found some issues with this. Essentially this boils down to the following. In order to recover we need to:

  1. Know which JAR is currently deployed to redeploy that in case of a failure
  2. Know which arguments were given to current running job

The first is difficult since the Flink API doesn't return this information anywhere, but could be done by for instance storing all JARs uploaded to the Flink cluster and looking for a matching job name (See option 1 below).
The second though is a larger issue as this cannot be found anywhere else in the API. This leads us to believe that to implement this feature properly, we'd have to maintain state in our deployer (last deployed jar with run arguments etc..).

Solution to Problem 1

Currently our deployer is making the following assumptions:

  • Your job name consists of 2 parts (e.g. "Aggregate All The Data 1.0.2"):
    1. A job basename that defines what job this is (e.g. "Aggregate All The Data")
    2. A version number that changes across versions for this job (e.g. "1.0.2")

We use this assumption to determine which of the potentially multiple running jobs needs to be updated.
Currently we've found 2 solutions to Problem 1:

Option 1: JARs on Flink

All JARs uploaded to Flink are available on the Flink cluster(requires to persist web.upload.dir). This allows us to determine the running JAR as follows:

  1. Get http://localhost:8081/jobs/[JOBID] (get "name")
  2. Find JAR that was running
    a. list all jars (http://localhost:8081/jars, get "files.id")
    b. for all jars find name (http://localhost:8081/jars/[JAR_ID]/plan, get "plan.name")
    c. Get JAR_ID for jars name that matches job name (JAR_ID is also the name of the file stored by Flink)

Option 2: JARs remote

All JARs build are stored in a remote repository (e.g. Artifactory). The user has to provide the JAR that he wants to use as a fallback in case the update fails.
To ensure that the fallback image is the same as the currently running image, we could check that the JARs match by matching the job names they list. Possibly we could break when these don't match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant