You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I've read crd docs but can't understand what does it mean when the job fails, thus can't understand what is the purpose of restartPolicy. Can you kindly explain why buildin checkpointing mechanics + HA is not enough to recover from failure and we need restartPolicy=FromSavepointOnFailure? If this property covers completely another case, please can you explain by example?
Thanks!
The text was updated successfully, but these errors were encountered:
Hi Ilia, I just want to share my thoughts: I believe they are in different level:
checkpointing + HA(for standalone cluster) is managed by Flink itself.
restartPolicy is managed by this flink-on-k8s-operator(i.e. k8s) (codes can be found here.
AFAIK, option1 should be enough if we configure it correctly like creating 2 JM and a zk service. Option2 is a good try to utilize k8s's potential. And due to the git history, it may be implemented pretty early when Flink's HA is not so good.
Besides, it is worthwhile to mention that Flink community also does some work in k8s HA like this. And since 1.12, Flink even supports native k8s HA. I am also interested in the question that if this operator can support such usage.
Thanks for sharing!
I'm running flink 1.14 using this operator, it is per-job mode with 1 jm and k8s HA. I delete jm pod, k8s created new one and job continued to work from the place it stopped before. That lead me to ask a question about cases of restartPolicy usage. May be you are right and it is applicable to older versions of flink, but it's great to know for sure.
Hi,
I've read crd docs but can't understand what does it mean
when the job fails
, thus can't understand what is the purpose ofrestartPolicy
. Can you kindly explain why buildin checkpointing mechanics + HA is not enough to recover from failure and we need restartPolicy=FromSavepointOnFailure? If this property covers completely another case, please can you explain by example?Thanks!
The text was updated successfully, but these errors were encountered: