From a53832abb3d6d6373a4164b875db1f9d49beb9f8 Mon Sep 17 00:00:00 2001 From: David Xia Date: Thu, 19 Sep 2019 12:23:24 -0400 Subject: [PATCH] Set cluster ID when ZK connection state is connected The master and agent process check they can connect to ZooKeeper with a cluster ID value. This cluster ID value is passed in as a CLI switch on startup. The code also checks the cluster ID ZK node exists and stores it in an `AtomicBoolean`. The `AtomicBoolean` is updated with a `org.apache.zookeeper.Watcher` on the node itself and with a `ConnectionStateListener` that updates when the connection state to ZK changes. My hypothesis is that when agents lose connection entirely to the ZKs and the connection then comes back, the `ConnectionState` is `CONNECTED` instead of `RECONNECTED`. This would skip the conditional and the `AtomicBoolean` wouldn't get updated. Helios agents that do not automatically recover have these logs for `ConnectionState`. Notice that the last state is `CONNECTED`. ``` dxia@bad-host:~$ grep 'DefaultZooKeeperClient connection state change' /path/to/helios/info.log 2019-09-19T01:53:14.391+00:00 bad-host helios[21939]: DefaultZooKeeperClient connection state change - SUSPENDED 2019-09-19T01:53:33.511+00:00 bad-host helios[21939]: DefaultZooKeeperClient connection state change - LOST 2019-09-19T01:54:15.333+00:00 bad-host helios[21939]: DefaultZooKeeperClient connection state change - RECONNECTED 2019-09-19T03:15:16.298+00:00 bad-host helios[21939]: DefaultZooKeeperClient connection state change - SUSPENDED 2019-09-19T03:15:39.592+00:00 bad-host helios[21939]: DefaultZooKeeperClient connection state change - LOST 2019-09-19T08:11:09.721+00:00 bad-host helios[15902]: DefaultZooKeeperClient connection state change - CONNECTED ``` There are some helios agents that were fine that also had `CONNECTED` as the last state in the logs though. But it seems like we should update the `AtomicBoolean` in cases of `CONNECTED`, `RECONNECTED`, and `READ_ONLY`. --- .../servicescommon/coordination/DefaultZooKeeperClient.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/helios-services/src/main/java/com/spotify/helios/servicescommon/coordination/DefaultZooKeeperClient.java b/helios-services/src/main/java/com/spotify/helios/servicescommon/coordination/DefaultZooKeeperClient.java index ec71541dc..1bdecc7f9 100644 --- a/helios-services/src/main/java/com/spotify/helios/servicescommon/coordination/DefaultZooKeeperClient.java +++ b/helios-services/src/main/java/com/spotify/helios/servicescommon/coordination/DefaultZooKeeperClient.java @@ -96,7 +96,7 @@ public void process(WatchedEvent event) { @Override public void stateChanged(CuratorFramework client, ConnectionState newState) { log.info("DefaultZooKeeperClient connection state change - {}", newState); - if (newState == ConnectionState.RECONNECTED) { + if (newState.isConnected()) { checkClusterIdExists(clusterId, "connectionStateListener"); } }