Skip to content
This repository has been archived by the owner on Jun 30, 2022. It is now read-only.

Document high availability for Lita #159

Open
t33chong opened this issue Dec 2, 2015 · 5 comments
Open

Document high availability for Lita #159

t33chong opened this issue Dec 2, 2015 · 5 comments

Comments

@t33chong
Copy link
Contributor

t33chong commented Dec 2, 2015

@esigler @ranjib and I have been talking about how we can ensure that our Lita instance keeps running in the event of a host failure. It would be very helpful to have some information regarding a recommended deployment scenario for a high availability setup. I'll let them chime in with specific issues so that we can keep track of this discussion publicly.

@brodock
Copy link
Contributor

brodock commented Dec 3, 2015

Redis HA can be handled by using Sentinel. Lita itself I have no idea :/

@ranjib
Copy link

ranjib commented Dec 3, 2015

@brodock we have the same understanding as well. In short term, redis HA can be done using sentinel, for lita server itself, i was thinking it will be nice to add conul/etcd based leader election support. That will enable us to run multiple lita servers in different DC while only one being active any given time, and others can take over if the leader dies.
in the long term it will be nice to have some other kind of key value store (cockroachdb, cassanrda, vitess all look good), cause redis clustering are limited for HA stuff, but this might be bit overloaded, as most lita handler we currently use stores little state information (also most of them can be recalculated)

@jimmycuadra
Copy link
Collaborator

I mentioned this briefly to Tristan the other day, but the way I would approach this is the way the podmaster program works for the Kubernetes scheduler and controller manager. It uses a distributed lock via etcd to ensure that only one instance of the application is running at a time and that another one starts up if the one that was running stops. In short:

Each host periodically attempts to set the value of some key K to its own hostname with a TTL of T. It does the set via an atomic compare-and-swap to avoid race conditions. For each host H, there are three possible results:

  1. H is not running Lita. The key is successfully set because it did not exist. That means that H is the master and should start Lita.
  2. H is currently running Lita and the key already exists with the correct value. The TTL (T) is simply refreshed. Lita continues running.
  3. H is not running Lita and the key already exists with another host as the value. H does nothing.

This could probably be implemented with Redis instead of etcd—I think Redis also supports atomic operations.

As far as it being part of Lita itself, it would be possible, but I'd rather see it prototyped as a either another program or a Lita plugin first.

For reference, documentation on the Kubernetes podmaster which does this can be found here: https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/admin/high-availability.md#master-elected-components. The code for the podmaster program itself is here: https://github.com/kubernetes/contrib/tree/be436560df6fa839fb92a2f88ae4c4b7da4e58e4/pod-master

@sciurus
Copy link

sciurus commented May 1, 2016

If you want to do this with consul, you can use the consul lock command to wrap starting lita.

https://www.consul.io/docs/commands/lock.html

@indirect
Copy link

Probably relevant to this discussion, after solving the problem of Lita itself being HA you'll need a data store that offers replication and failover—Sentinel is probably okay as long as you don't care about losing data as part of the failover process, and possibly ending up with multiple redis masters: https://aphyr.com/posts/287-asynchronous-replication-with-failover. You'll need a different datastore if you need consistent data while running stores fail over.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants