Skip to content

Commit

Permalink
Add support for ha, cleanup, refactor
Browse files Browse the repository at this point in the history
Squashed commit of the following:

commit 52a9c113b6cfe8c2ada17213f96148855dfcefda
Author: Germano Eichenberg <[email protected]>
Date:   Mon Feb 14 10:42:47 2022 -0300

    Remove dead code

commit 02400a8
Author: Germano Eichenberg <[email protected]>
Date:   Mon Feb 14 10:38:47 2022 -0300

    Add call to increment requests_routed_received

commit d8b7f16
Author: Germano Eichenberg <[email protected]>
Date:   Sun Feb 13 01:26:02 2022 -0300

    Fix context cancelled errors, add logging to missing places

commit b509fea
Author: Germano Eichenberg <[email protected]>
Date:   Sun Feb 13 01:00:52 2022 -0300

    Remove context cancellation from globals

commit df5f912
Author: Germano Eichenberg <[email protected]>
Date:   Sat Feb 12 15:54:37 2022 -0300

    Implement graceful shutdowns

commit b430cb9
Author: Germano Eichenberg <[email protected]>
Date:   Fri Feb 11 15:27:48 2022 -0300

    Re-add context cancellation on FireGlobalRequest

commit b4e753e
Author: Germano Eichenberg <[email protected]>
Date:   Fri Feb 11 15:08:54 2022 -0300

    Remove timeout from routeRequest

commit 87a2ce9
Author: Germano Eichenberg <[email protected]>
Date:   Fri Feb 11 15:01:54 2022 -0300

    Remove context call from FireGlobalRequest

commit f557448
Author: Germano Eichenberg <[email protected]>
Date:   Fri Feb 11 14:38:45 2022 -0300

    Add more context around log errors for main handlers

commit d3650c5
Author: Germano Eichenberg <[email protected]>
Date:   Fri Feb 11 12:34:23 2022 -0300

    Attempt to fix context canceled errors

commit 4da3dcd
Author: Germano Eichenberg <[email protected]>
Date:   Fri Feb 11 12:01:12 2022 -0300

    Tweak client config to be more permissive

commit 9a89bd3
Author: Germano Eichenberg <[email protected]>
Date:   Fri Feb 11 11:01:01 2022 -0300

    More readme work

commit fdf8d31
Author: Germano Eichenberg <[email protected]>
Date:   Thu Feb 10 18:01:40 2022 -0300

    Fix small typo in readme

commit 1d09b86
Author: Germano Eichenberg <[email protected]>
Date:   Thu Feb 10 17:57:23 2022 -0300

    Move HTTP logic into QueueManager, add support for routing requests to other nodes, HA mvp

commit ae897b3
Author: Germano Eichenberg <[email protected]>
Date:   Wed Feb 9 20:14:56 2022 -0300

    Working implementation of memberlist

commit 4cf9d37
Author: Germano Eichenberg <[email protected]>
Date:   Tue Feb 8 19:30:19 2022 -0300

    Cleanup, move responsibilities of queue creation from main to queue, handle 401s and invalid tokens in one place
  • Loading branch information
germanoeich committed Feb 14, 2022
1 parent ccc6fac commit aa0bcc8
Show file tree
Hide file tree
Showing 17 changed files with 719 additions and 562 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.idea/
nirn-proxy.*
nirn-proxy
nirn-proxy.exe
.env
*.txt
*.log
27 changes: 22 additions & 5 deletions CONFIG.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# Config
All variables are optional unless stated otherwise

##### LOG_LEVEL
Logrus log level. Passed directly to [ParseLevel](https://github.com/sirupsen/logrus/blob/master/logrus.go#L25-L45)

##### PORT
The port to listen for requests on

##### METRICS_PORT
The port for to listen on for metrics
The port to listen for metrics requests on

##### ENABLE_METRICS
Wether to enable and register metrics. Disabling may improve resource usage
Toggle to enable and register metrics. Disabling may improve resource usage

##### ENABLE_PPROF
Enables the performance profiling handler. Read more [here](https://github.com/google/pprof/blob/master/doc/README.md)
Expand All @@ -21,12 +23,27 @@ Decreasing this will improve memory usage, but beware that once a channel buffer
##### OUTBOUND_IP
The local address to use when firing requests to discord.

Example: `"120.121.122.123"`
Example: `120.121.122.123`

##### BIND_IP
The IP to bind the HTTP server on (both for requests and metrics). 127.0.0.1 will only allow requests coming from the loopback interface. Useful for preventing the proxy from being accessed from outside of LAN, for example.

Example: `"10.0.0.42"` - Would only listen on LAN
Example: `10.0.0.42` - Would only listen on LAN

##### REQUEST_TIMEOUT
Defines the amount of time the proxy will wait for a response from discord. Does not include time waiting for ratelimits to clear.
Defines the amount of time the proxy will wait for a response from discord. Does not include time waiting for ratelimits to clear.

##### CLUSTER_PORT
Sets the port that's going to be used to communicate with other cluster members. Default 7946

##### CLUSTER_MEMBERS
Comma separated list of stable/known members of the cluster. Does not need to include all members, a gossip protocol is used for discovery. You may include a port along with the address and if you don't, CLUSTER_PORT is used. This variable overrides CLUSTER_DNS.

Example: `10.0.0.2,10.0.0.3:7244`

##### CLUSTER_DNS
DNS address that will resolve to multiple members of the cluster. Does not need to include all members, a gossip protocol is used for discovery. While this is the recommended method of discovery for Kubernetes or similar, it does come with a limitation, which is that all nodes must use the same port for communication since DNS can't return port information. The port used by the proxy for requests is broadcasted automatically and doesn't need to be the same for nodes.

If using Kubernetes, create a headless service and use it here for easy clustering.

Example: `nirn-headless.default.svc.cluster.local` or `nirn.mydomain.com`
64 changes: 47 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
# Nirn-proxy
Nirn is a transparent & dynamic HTTP proxy that handles Discord ratelimits for you and exports meaningful prometheus metrics. It is considered beta software but is being used in production by [Dyno](https://dyno.gg) on the scale of hundreds of requests per second.
Nirn-proxy is a highly available, transparent & dynamic HTTP proxy that handles Discord ratelimits for you and exports meaningful prometheus metrics. It is considered beta software but is being used in production by [Dyno](https://dyno.gg) on the scale of hundreds of requests per second.

It is designed to be minimally invasive and exploits common library patterns to make the adoption as simple as a URL change.

#### Features

- Highly available, horizontally scalable
- Transparent ratelimit handling, per-route and global
- Multi-bot support with automatic detection for elevated REST limits (big bot sharding)
- Works with any API version (Also supports using two or more versions for the same bot)
Expand All @@ -17,17 +21,20 @@ The proxy sits between the client and discord. Essentially, instead of pointing

Configuration options are

| Variable | Value | Default |
|-----------------|--------|---------|
| LOG_LEVEL | panic, fatal, error, warn, info, debug, trace | info |
| PORT | number | 8080 |
| METRICS_PORT | number | 9000 |
| ENABLE_METRICS | boolean| true |
| ENABLE_PPROF | boolean| false |
| BUFFER_SIZE | number | 50 |
| OUTBOUND_IP | string | "" |
| BIND_IP | string | 0.0.0.0 |
| REQUEST_TIMEOUT | number (milliseconds) | 5000 |
| Variable | Value | Default |
|-----------------|-----------------------------------------------|-------------------------|
| LOG_LEVEL | panic, fatal, error, warn, info, debug, trace | info |
| PORT | number | 8080 |
| METRICS_PORT | number | 9000 |
| ENABLE_METRICS | boolean | true |
| ENABLE_PPROF | boolean | false |
| BUFFER_SIZE | number | 50 |
| OUTBOUND_IP | string | "" |
| BIND_IP | string | 0.0.0.0 |
| REQUEST_TIMEOUT | number (milliseconds) | 5000 |
| CLUSTER_PORT | number | 7946 |
| CLUSTER_MEMBERS | string list (comma separated) | "" |
| CLUSTER_DNS | string | "" |

Information on each config var can be found [here](https://github.com/germanoeich/nirn-proxy/blob/main/CONFIG.md)

Expand Down Expand Up @@ -61,14 +68,37 @@ This will vary depending on your usage, how many unique routes you see, etc. For

### Metrics

| Key | Labels | Description |
|-------------------|----------------------------------------|------------------------------------------------|
|nirn_proxy_error | none | Counter for errors |
|nirn_proxy_requests| method, status, route, clientId | Histogram that keeps track of all request metrics|
|nirn_proxy_open_connections| none | Gauge for open client connections with the proxy|
| Key | Labels | Description |
|------------------------------------|----------------------------------------|------------------------------------------------------------|
|nirn_proxy_error | none | Counter for errors |
|nirn_proxy_requests | method, status, route, clientId | Histogram that keeps track of all request metrics |
|nirn_proxy_open_connections | none | Gauge for open client connections with the proxy |
|nirn_proxy_requests_routed_sent | none | Counter for requests routed to other nodes |
|nirn_proxy_requests_routed_received | none | Counter for requests received from other nodes |
|nirn_proxy_requests_routed_error | none | Counter for requests routed that failed |

Note: 429s can produce two status: 429 Too Many Requests or 429 Shared. The latter is only produced for requests that return with the x-ratelimit-scope header set to "shared", which means they don't count towards the cloudflare firewall limit and thus should not be used for alerts, etc.

### High availability

The proxy can be run in a cluster by setting either `CLUSTER_MEMBERS` or `CLUSTER_DNS` env vars. When in cluster mode, all nodes are a suitable gateway for all requests and the proxy will route requests consistently using the bucket hash.

It's recommended that all nodes are reachable through LAN. Please reach out if a WAN cluster is desired for your use case.

If a node fails, there is a brief period where it will be unhealthy but requests will still be routed to it. When these requests fail, the proxy will mock a 429 to send back to the user. The 429 will signal the client to wait 1s and will have a custom header `generated-by-proxy`. This is done in order to allow seamless retries when a member fails. If you want to backoff, use the custom header to override your lib retry logic.

The cluster uses [SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf), which is an [AP protocol](https://en.wikipedia.org/wiki/CAP_theorem) and is powered by hashicorps excellent [memberlist](https://github.com/hashicorp/memberlist) implementation.

Being an AP system means that the cluster will tolerate a network partition and needs no quorum to function. In case a network partition occurs, you'll have two clusters running independently, which may or may not be desirable. Configure your network accordingly.

In case you want to specifically target a node (i.e, for troubleshooting), set the `nirn-routed-to` header on the request. The value doesn't matter. This will prevent the node from routing the request to another node.

During recovery periods or when nodes join/leave the cluster, you might notice increased 429s. This is expected since the hashing table is changing as members change. Once the cluster settles into a stable state, it'll go back to normal.

Global ratelimits are handled by a single node on the cluster, however this affinity is soft. There is no concept of leader or elections and if this node leaves, the cluster will simply pick a new one. This is a bottleneck and might increase tail latency, but the other options were either too complex, required an external storage, or would require quorum for the proxy to function. Webhooks and other requests with no token bypass this mechanism completely.

The best deployment strategy for the cluster is to kill nodes one at a time, preferably with the replacement node already up.

### Profiling

The proxy can be profiled at runtime by enabling the ENABLE_PPROF flag and browsing to `http://ip:7654/debug/pprof/`
Expand Down
Loading

0 comments on commit aa0bcc8

Please sign in to comment.