Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

Got error 'Framework has been removed' on restart #44

Open
fil1o opened this issue Oct 10, 2017 · 14 comments
Open

Got error 'Framework has been removed' on restart #44

fil1o opened this issue Oct 10, 2017 · 14 comments

Comments

@fil1o
Copy link

fil1o commented Oct 10, 2017

I'm using DC/OS Version 1.10.0 with dcos-flink-2-11:1.3.1-1.1
Flink is configured to run in HA mode.

high-availability=zookeeper
high-availability.zookeeper.quorum=ctrl1.filio.bg:2181,ctrl2.filio.bg:2181,ctrl3.filio:2181 high-availability.zookeeper.storageDir=/mnt/glusterfs/flink
zookeeper.sasl.disable=true
state.checkpoints.dir=file:///mnt/glusterfs/flink/checkpoints
state.savepoints.dir=file:///mnt/glusterfs/flink/savepoints

Flink runs fine until Marathon tries to restart it (after a crash or a manual restart).

This is the error.

I1010 10:37:57.033221 199 sched.cpp:1187] Got error 'Framework has been removed'
I1010 10:37:57.033287 199 sched.cpp:2055] Asked to abort the driver
I1010 10:37:57.033898 199 sched.cpp:1233] Aborting framework 33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002

33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002 is the ID of the previous instance of Flink found in Zookeeper/flink/default/mesos-workers/frameworkId
If I delete the key from Zookeeper Flink starts normally and all previously running tasks are restored.

@joerg84
Copy link
Contributor

joerg84 commented Oct 10, 2017 via email

@todormazgalov
Copy link

Yeah, I've been running into the same problem recently.

@rradnev
Copy link

rradnev commented Nov 3, 2017

We got the same exception in our setup. :(

@fil1o
Copy link
Author

fil1o commented Dec 15, 2017

We still get this error after upgrading to version 1.4.0.
We cannot go into production until this issue has been resolved. There is no high availability this way. We have to manually delete entries from Zookeeper to get Flink running after a crash.

@EronWright
Copy link

The error message means that failover timeout has been exceeded. When a framework fails, Mesos keeps that framework's tasks running to allow it time to recover. Eventually a timeout is reached and Mesos kills the tasks. If the framework later reconnects, Mesos responds with "Framework has been removed" - a problem that can be solved by cleaning out the ZK state.

The framework timeout value is configured with the Flink config option: mesos.failover-timeout (default: 10 minutes).

A possible improvement would be to have Flink automatically clean the ZK state and retry in this situation.

@fil1o, @todormazgalov, and @rradnev - does this characterization agree with your observations? Thanks.

@fil1o
Copy link
Author

fil1o commented Dec 20, 2017

It seem about right. I will make one clarification.
Even if I suspend Flink manually
-Suspend (set instances to 0)
-Wait for wait for all task managers to shut down
-Resume Flink
I still get "Framework has been removed" unless I clean up ZK .

@EronWright
Copy link

@fil1o what you describe is consistent with my explanation. Mesos shuts down the task managers once the failover timeout is reached. When the job manager is later resumed, it receives the 'removed' (due to timeout) message.

The remaining question is whether the failover timeout value is being configured properly.

@asicoe
Copy link

asicoe commented Jan 28, 2018

Yep we're hitting this as well with DCOS 1.10 and dcos-flink-2-11:1.3.1-1.1 and Flink is configured to run in HA mode.
So, to understand things:

  1. There is no current workaround except for increasing the mesos.failover-timeout to something extremely high? Any other ways people solved this?
  2. Is there work done to solve this issue? PR etc?

@fil1o
Copy link
Author

fil1o commented Jan 29, 2018

We implemented our own workaround. We use zkCli.sh to clean up frameworkId from zookeeper on every start. You will lose all task managers on job restart or job manager failure, but at least Flink can start up and resume the job.

#! /bin/bash
[[ ":$PATH:" != *":/opt/mesosphere/bin:"* ]] && PATH="${PATH}:/opt/mesosphere/bin"
/usr/share/zookeeper/bin/zkCli.sh -server zookeeper.address.here:2181 <<EOF
delete /flink/default/mesos-workers/frameworkId
quit
EOF

Hint. zkCli.sh can be found inside Flink container.

@asicoe
Copy link

asicoe commented Jan 29, 2018

Thanks @fil1o. I have a few questions:

  1. When do you call the script?
    I mean in order to make sure it gets executed when a job manager fails?
    Are u using the vanilla dcos service?
  2. Above you are implying you need to call the script on job restart. You mean automatic restart? Doesn't flink handle that for us? In our case, we only get the problem mentioned in this issue if the job manager gets restarted. Once the JobManager is up and the task managers are up and the jobs are deployed it seems it recovers from any job automatic restart or task manager failure.
  3. When u say it can start-up and resume the job, it will be the same job with the same state as before right?

@fil1o
Copy link
Author

fil1o commented Jan 29, 2018

  1. We use the original dcos service, but we edit the configuration.
    Edit->Service->Command set to rm_framework_id.sh && /sbin/init.sh
    rm_framework_id.sh need to be on the same path on every worker. Best use a destributed file system and mount the folder to flink container.
  2. The above will call the script every time a new job manager is started.
  3. Right. As long as you have a checkpoint to resume from.

@EronWright
Copy link

It is obviously quite unfortunate to lose the TMs in the case of JM failover, but that workaround does make sense. The ultimate fix is to have Flink respond to this error by clearing out its own state such that it registers as a new framework. Mesos isn't prescriptive about how the framework should react to this situation. I will open a ticket and contribute a fix for 1.5.0 (will update this comment with a bug #).

@EronWright
Copy link

Opened FLINK-8541.

@asicoe
Copy link

asicoe commented Feb 2, 2018

Just following up after we've done more digging around.

It seems we were getting this error when a task manager died, as opposed to what I have written in my previous message. What happened then was that the job manager was also killed (Flink Mesos framework was removed) because the minimum number of task managers available was no longer met. When the job manager is restarted it cannot re-submit the framework under the same id (which it recovered from zookeeper) since it was removed previously.

Setting mesos.maximum-failed-tasks to -1 and increasing mesos.failover-timeout to 1hr seems to have fixed things for us in case of both job manager container loss or task manager container loss or service restart, without having to add the above delete script.

By the way, the above delete script is leaking task managers (with mesos.failover-timeout set to 1hr and mesos.maximum-failed-tasks left as the default). In case the job manager container gets killed when it comes back up it will create a new framework with a new id and thus a new task managers but the old ones are still around and reconnect to the cluster.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants