Add support for ha, cleanup, refactor

Squashed commit of the following: commit 52a9c113b6cfe8c2ada17213f96148855dfcefda Author: Germano Eichenberg <[email protected]> Date: Mon Feb 14 10:42:47 2022 -0300 Remove dead code commit 02400a8 Author: Germano Eichenberg <[email protected]> Date: Mon Feb 14 10:38:47 2022 -0300 Add call to increment requests_routed_received commit d8b7f16 Author: Germano Eichenberg <[email protected]> Date: Sun Feb 13 01:26:02 2022 -0300 Fix context cancelled errors, add logging to missing places commit b509fea Author: Germano Eichenberg <[email protected]> Date: Sun Feb 13 01:00:52 2022 -0300 Remove context cancellation from globals commit df5f912 Author: Germano Eichenberg <[email protected]> Date: Sat Feb 12 15:54:37 2022 -0300 Implement graceful shutdowns commit b430cb9 Author: Germano Eichenberg <[email protected]> Date: Fri Feb 11 15:27:48 2022 -0300 Re-add context cancellation on FireGlobalRequest commit b4e753e Author: Germano Eichenberg <[email protected]> Date: Fri Feb 11 15:08:54 2022 -0300 Remove timeout from routeRequest commit 87a2ce9 Author: Germano Eichenberg <[email protected]> Date: Fri Feb 11 15:01:54 2022 -0300 Remove context call from FireGlobalRequest commit f557448 Author: Germano Eichenberg <[email protected]> Date: Fri Feb 11 14:38:45 2022 -0300 Add more context around log errors for main handlers commit d3650c5 Author: Germano Eichenberg <[email protected]> Date: Fri Feb 11 12:34:23 2022 -0300 Attempt to fix context canceled errors commit 4da3dcd Author: Germano Eichenberg <[email protected]> Date: Fri Feb 11 12:01:12 2022 -0300 Tweak client config to be more permissive commit 9a89bd3 Author: Germano Eichenberg <[email protected]> Date: Fri Feb 11 11:01:01 2022 -0300 More readme work commit fdf8d31 Author: Germano Eichenberg <[email protected]> Date: Thu Feb 10 18:01:40 2022 -0300 Fix small typo in readme commit 1d09b86 Author: Germano Eichenberg <[email protected]> Date: Thu Feb 10 17:57:23 2022 -0300 Move HTTP logic into QueueManager, add support for routing requests to other nodes, HA mvp commit ae897b3 Author: Germano Eichenberg <[email protected]> Date: Wed Feb 9 20:14:56 2022 -0300 Working implementation of memberlist commit 4cf9d37 Author: Germano Eichenberg <[email protected]> Date: Tue Feb 8 19:30:19 2022 -0300 Cleanup, move responsibilities of queue creation from main to queue, handle 401s and invalid tokens in one place
germanoeich · Feb 14, 2022 · aa0bcc8 · aa0bcc8
1 parent ccc6fac
commit aa0bcc8
Show file tree

Hide file tree

Showing 17 changed files with 719 additions and 562 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,6 @@
 .idea/
+nirn-proxy.*
 nirn-proxy
-nirn-proxy.exe
 .env
 *.txt
 *.log
diff --git a/CONFIG.md b/CONFIG.md
@@ -1,15 +1,17 @@
 # Config
+All variables are optional unless stated otherwise
+
 ##### LOG_LEVEL
 Logrus log level. Passed directly to [ParseLevel](https://github.com/sirupsen/logrus/blob/master/logrus.go#L25-L45)
 
 ##### PORT
 The port to listen for requests on
 
 ##### METRICS_PORT 
-The port for to listen on for metrics
+The port to listen for metrics requests on
 
 ##### ENABLE_METRICS
-Wether to enable and register metrics. Disabling may improve resource usage
+Toggle to enable and register metrics. Disabling may improve resource usage
 
 ##### ENABLE_PPROF
 Enables the performance profiling handler. Read more [here](https://github.com/google/pprof/blob/master/doc/README.md)
@@ -21,12 +23,27 @@ Decreasing this will improve memory usage, but beware that once a channel buffer
 ##### OUTBOUND_IP
 The local address to use when firing requests to discord.
 
-Example: `"120.121.122.123"`
+Example: `120.121.122.123`
 
 ##### BIND_IP
 The IP to bind the HTTP server on (both for requests and metrics). 127.0.0.1 will only allow requests coming from the loopback interface. Useful for preventing the proxy from being accessed from outside of LAN, for example.
 
-Example: `"10.0.0.42"` - Would only listen on LAN
+Example: `10.0.0.42` - Would only listen on LAN
 
 ##### REQUEST_TIMEOUT
-Defines the amount of time the proxy will wait for a response from discord. Does not include time waiting for ratelimits to clear.
+Defines the amount of time the proxy will wait for a response from discord. Does not include time waiting for ratelimits to clear.
+
+##### CLUSTER_PORT
+Sets the port that's going to be used to communicate with other cluster members. Default 7946
+
+##### CLUSTER_MEMBERS
+Comma separated list of stable/known members of the cluster. Does not need to include all members, a gossip protocol is used for discovery. You may include a port along with the address and if you don't, CLUSTER_PORT is used. This variable overrides CLUSTER_DNS.
+
+Example: `10.0.0.2,10.0.0.3:7244`
+
+##### CLUSTER_DNS
+DNS address that will resolve to multiple members of the cluster. Does not need to include all members, a gossip protocol is used for discovery. While this is the recommended method of discovery for Kubernetes or similar, it does come with a limitation, which is that all nodes must use the same port for communication since DNS can't return port information. The port used by the proxy for requests is broadcasted automatically and doesn't need to be the same for nodes.
+
+If using Kubernetes, create a headless service and use it here for easy clustering.
+
+Example: `nirn-headless.default.svc.cluster.local` or `nirn.mydomain.com`
diff --git a/README.md b/README.md
@@ -1,7 +1,11 @@
 # Nirn-proxy
-Nirn is a transparent & dynamic HTTP proxy that handles Discord ratelimits for you and exports meaningful prometheus metrics. It is considered beta software but is being used in production by [Dyno](https://dyno.gg) on the scale of hundreds of requests per second.
+Nirn-proxy is a highly available, transparent & dynamic HTTP proxy that handles Discord ratelimits for you and exports meaningful prometheus metrics. It is considered beta software but is being used in production by [Dyno](https://dyno.gg) on the scale of hundreds of requests per second.
+
+It is designed to be minimally invasive and exploits common library patterns to make the adoption as simple as a URL change.
 
 #### Features
+
+- Highly available, horizontally scalable
 - Transparent ratelimit handling, per-route and global
 - Multi-bot support with automatic detection for elevated REST limits (big bot sharding)
 - Works with any API version (Also supports using two or more versions for the same bot)
@@ -17,17 +21,20 @@ The proxy sits between the client and discord. Essentially, instead of pointing
 
 Configuration options are
 
-| Variable        | Value  | Default |
-|-----------------|--------|---------|
-| LOG_LEVEL       | panic, fatal, error, warn, info, debug, trace | info |
-| PORT            | number | 8080    |
-| METRICS_PORT    | number | 9000    |
-| ENABLE_METRICS  | boolean| true    |
-| ENABLE_PPROF    | boolean| false   |
-| BUFFER_SIZE     | number | 50      |
-| OUTBOUND_IP     | string | ""      |
-| BIND_IP         | string | 0.0.0.0 |
-| REQUEST_TIMEOUT | number (milliseconds) | 5000    |
+| Variable        | Value                                         | Default                 |
+|-----------------|-----------------------------------------------|-------------------------|
+| LOG_LEVEL       | panic, fatal, error, warn, info, debug, trace | info                    |
+| PORT            | number                                        | 8080                    |
+| METRICS_PORT    | number                                        | 9000                    |
+| ENABLE_METRICS  | boolean                                       | true                    |
+| ENABLE_PPROF    | boolean                                       | false                   |
+| BUFFER_SIZE     | number                                        | 50                      |
+| OUTBOUND_IP     | string                                        | ""                      |
+| BIND_IP         | string                                        | 0.0.0.0                 |
+| REQUEST_TIMEOUT | number (milliseconds)                         | 5000                    |
+| CLUSTER_PORT    | number                                        | 7946                    |
+| CLUSTER_MEMBERS | string list (comma separated)                 | ""                      |
+| CLUSTER_DNS     | string                                        | ""                      |
 
 Information on each config var can be found [here](https://github.com/germanoeich/nirn-proxy/blob/main/CONFIG.md)
 
@@ -61,14 +68,37 @@ This will vary depending on your usage, how many unique routes you see, etc. For
 
 ### Metrics
 
-| Key               | Labels                                 | Description                                    |
-|-------------------|----------------------------------------|------------------------------------------------|
-|nirn_proxy_error   | none                                   | Counter for errors                             |
-|nirn_proxy_requests| method, status, route, clientId        | Histogram that keeps track of all request metrics|
-|nirn_proxy_open_connections| none                           | Gauge for open client connections with the proxy|
+| Key                                | Labels                                 | Description                                                |
+|------------------------------------|----------------------------------------|------------------------------------------------------------|
+|nirn_proxy_error                    | none                                   | Counter for errors                                         |
+|nirn_proxy_requests                 | method, status, route, clientId        | Histogram that keeps track of all request metrics          |
+|nirn_proxy_open_connections         | none                                   | Gauge for open client connections with the proxy           |
+|nirn_proxy_requests_routed_sent     | none                                   | Counter for requests routed to other nodes                 |
+|nirn_proxy_requests_routed_received | none                                   | Counter for requests received from other nodes             |
+|nirn_proxy_requests_routed_error    | none                                   | Counter for requests routed that failed                    |
 
 Note: 429s can produce two status: 429 Too Many Requests or 429 Shared. The latter is only produced for requests that return with the x-ratelimit-scope header set to "shared", which means they don't count towards the cloudflare firewall limit and thus should not be used for alerts, etc.
 
+### High availability
+
+The proxy can be run in a cluster by setting either `CLUSTER_MEMBERS` or `CLUSTER_DNS` env vars. When in cluster mode, all nodes are a suitable gateway for all requests and the proxy will route requests consistently using the bucket hash. 
+
+It's recommended that all nodes are reachable through LAN. Please reach out if a WAN cluster is desired for your use case.
+
+If a node fails, there is a brief period where it will be unhealthy but requests will still be routed to it. When these requests fail, the proxy will mock a 429 to send back to the user. The 429 will signal the client to wait 1s and will have a custom header `generated-by-proxy`. This is done in order to allow seamless retries when a member fails. If you want to backoff, use the custom header to override your lib retry logic.
+
+The cluster uses [SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf), which is an [AP protocol](https://en.wikipedia.org/wiki/CAP_theorem) and is powered by hashicorps excellent [memberlist](https://github.com/hashicorp/memberlist) implementation.
+
+Being an AP system means that the cluster will tolerate a network partition and needs no quorum to function. In case a network partition occurs, you'll have two clusters running independently, which may or may not be desirable. Configure your network accordingly.
+
+In case you want to specifically target a node (i.e, for troubleshooting), set the `nirn-routed-to` header on the request. The value doesn't matter. This will prevent the node from routing the request to another node.
+
+During recovery periods or when nodes join/leave the cluster, you might notice increased 429s. This is expected since the hashing table is changing as members change. Once the cluster settles into a stable state, it'll go back to normal.
+
+Global ratelimits are handled by a single node on the cluster, however this affinity is soft. There is no concept of leader or elections and if this node leaves, the cluster will simply pick a new one. This is a bottleneck and might increase tail latency, but the other options were either too complex, required an external storage, or would require quorum for the proxy to function. Webhooks and other requests with no token bypass this mechanism completely.
+
+The best deployment strategy for the cluster is to kill nodes one at a time, preferably with the replacement node already up.
+
 ### Profiling
 
 The proxy can be profiled at runtime by enabling the ENABLE_PPROF flag and browsing to `http://ip:7654/debug/pprof/`