-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovery mechanism on update #15
Comments
Initial investigation resultsWe (@mrooding and I) found some issues with this. Essentially this boils down to the following. In order to recover we need to:
The first is difficult since the Flink API doesn't return this information anywhere, but could be done by for instance storing all JARs uploaded to the Flink cluster and looking for a matching job name (See option 1 below). Solution to Problem 1Currently our deployer is making the following assumptions:
We use this assumption to determine which of the potentially multiple running jobs needs to be updated. Option 1: JARs on FlinkAll JARs uploaded to Flink are available on the Flink cluster(requires to persist
Option 2: JARs remoteAll JARs build are stored in a remote repository (e.g. Artifactory). The user has to provide the JAR that he wants to use as a fallback in case the update fails. |
When an update fails for whatever reason, redeploy the old job.
For this we have to monitor how the new job behaves for a little while in case of issues just after startup (e.g. Inconsistent state)
The text was updated successfully, but these errors were encountered: