-
Notifications
You must be signed in to change notification settings - Fork 13
RFC: 1-click deploy #68
Comments
Hey @xphoniex, nice work 👍
I can imagine that there could be a essentials package, and then some optional ones. |
Yeah, it's implied here that it would be The issue with the monorepo state already exists currently, since the client services all read from the same state. |
@xphoniex could you describe the topology of the cluster(s)? One small issue I could see is that we have both UDP and TCP services, and I know there is some limitation in k8 with having both on the same instance. |
I'm also wondering where do we keep track of the mappings between org and physical instances, ie. IP addresses. Do we use DNS? For instance, right now, each org points to a DNS name via its ENS records. This DNS name in turn points to a physical address. How do you imagine this could work in the above scenario? |
I think we're going to have to expose unique tcp/udp as HTTP traffic would be handled using nginx, according to subdomain:
E.g. we'll have a wildcard dns record for Makes sense? |
Hi @xphoniex 👋. Nice work on the RFC. I anticipate some problems with org-nodes living outside of Kubernetes trying to form a cluster with the ones living inside it due to NATs. |
Regarding state storage, am I correct in assuming the plan is to use persistent volumes? |
Hi @adaszko 👋 QUIC uses UDP under the hood, but it's not listed here, however it is supported on Google Load Balancer. We might need to test it quickly before proceeding, I guess. (@cloudhead ) Would NAT still be an issue if each
Yes. |
Such a test would be nice 👍
Let's take the case of an outside org-node making 2 requests to an inside one. Even if we have distinct ports for every org-node, the load balancer can direct the 2nd request to a different node than the 1st one unless we set some session affinity (respective functionalities in other clouds need to be researched). Out of the session affinity types listed on the linked website, the most apt seems to be the one based on HTTP headers. The target It's not a trivial issue, unless of course, librad protocol implementation is so resilient that it can handle (1) Violation of the assumption that 2 consecutive requests of any type addressed to the same (DNS name, port) pair actually reach the same node (that's the load balancer issue) and (2) IP addresses of nodes change at an unpredictable times due to Kubernetes rebalancing to different pods and/or self-healing (DNS names and ports stay the same though). If the implement is in fact so resilient that we have (1) and (2), we can basically go ahead full steam with K8s deployment. These 2 are a pretty tall order though and we'll need be aware that the whole cluster will perform worse (more errors, retries, higher latency) even if the implementation handles (1) and (2) 100% correctly. @kim @FintanH I'm curious what you guys think, especially regarding seeing the last paragraph from the protocol implementation standpoint. |
I'm not so sure about this. We're gonna have a single |
I'm still talking about the case of org-nodes living outside of Kubernetes trying to connect the ones living inside it. Radicle is a peer to peer app (i.e. not a SaaS) so I think it's fair to say the nodes can connect from anywhere. |
It is a bit unclear what you're trying to achieve here. I am assuming that you are aware that You can surely use k8s to spin up singleton instances of a node (like a database, iirc If your goal is to only cluster the HTTP/git interfaces of an org-node for availability reasons, you may be able to do that by mounting a shared volume containing the state. Since network-attached storage is both slow and does not necessarily exhibit POSIX semantics ( |
For the latter to work, you would obviously need to be able to run those endpoints standalone, ie. without spinning up the p2p stack. |
So as I understand the discussion, the LB is not really used to "balance load" between homogeneous backends, but rather just to route traffic? or did I get this wrong? In the simplest case, each org/user has 1 replica, and the cluster is heterogeneous, ie. the nodes are not interchangeable. In the more advanced case, an org may want to deploy multiple replicas. Using a load balancer in front of those nodes would make sense in case we're worried that one of the nodes goes down, but I think we may find that having the clients directly connect to the individual replicas simplifies things, since this is supported by the protocol. |
Just to clear, the reason for choosing k8s here is that it standardizes our lifecycle management.
Sounds like the second option would be more appropriate at this. Any comments? Note: there are some limitations that apply to VPS route, DigitalOcean for example doesn't allow more than 10 droplets unless you talk to support. AWS is at 20. We also need to be careful we don't hit a hard limit on our DNS provider, as we're setting a new record per server. |
You can use k8s' You can not expect any kind of transparent mapping of a single address to multiple, independent nodes to work as you'd expect. Even if you do that for just the HTTP endpoints and use session affinity, it will probably not yield the web experience you're after, because replication is by design asynchronous. I'm not sure how important this is, though, as long as the node gets restarted automatically if something goes wrong. I get that what you want is essentially "virtual hosts", but I'm afraid this won't be possible on the p2p layer until HTTP/3 gets standardised. You could consider creating extra |
We can bypass
Issue with this is I'll end up writing some glue scripts to tie everything together and we can't use the idle resources anymore, as kim mentioned. If no one has any objection with the design, I can start prototyping with pulumi. |
The issue is not the NATing per-se (well, fingers crossed), but that you need
one LB (with one public IP) per instance if you want a fixed port (ie. resolve
using only the A record). Or maybe two, to support both TCP and UDP traffic.
If you have a way to assign a fixed port per instance, and clients can discover
that, you could potentially have a single LB route all HTTP traffic, and another
one route all QUIC traffic.
|
For that case we won't be using fixed port, each instance/org will have a separate port. |
For that case we won't be using fixed port, each instance/org will have a separate port.
Then how do you make that work in the browser?
|
HTTP traffic is simple, it's not p2p and would come through LB. We just need to update ingress rules with our controller. |
Ok, I guess I don't know what you're talking about then. Good luck! :)
|
Does this help?
|
That makes sense, if the There is no |
Problem
We want to give interested parties a chance to try out Radicle without having to take care of their own infrastructure. Goal is to introduce a low-friction solution which is also reliable.
Proposal
This is not Radicle's core offering and we'd even encourage competition in this space. Thus our design should be transferable and as plug-and-play as possible.
We'll have an entrance contract that list all contracts offering their service. Decisions would be made based on the number of subscribers to each contract and the price each one is asking.
Upon deciding to purchase, user sends money to either
topUp(address org)
orregisterOrgThenTopUp()
. We might need a conversion from ether to stablecoin here, to simplify financing for service providers who have obligations in fiat.This will eventually emit a
NewTopUp
event containing org address and probably more info like expiry block. (After talking with Alexis, we decided to keep accounting in block terms on contracts)Inside each k8s cluster, which ideally lives on a different cloud, we'll have a controller watching
NewTopUp
events for their respective contract. On new events, we create Deployment and Service for this new org, with the needed containers inside. If it already exists, we simply change the expiry block, without affecting anything else.We'll use IaC (Infrastructure as code) with Terraform managing the cloud resources for us, thus a potential third party can offer an alternative once they clone our infra code and fill in their own cloud keys.
Issues
We are relying on major clouds AWS, GCP and Azure which are in the same jurisdiction. Others lack support in our automation tooling because of poor API or lack of enough interest from community/maintainers.
GeoDNS. Our p2p system, as is, can't optimize for latency-based routing. I think, this needs to be solved on protocol level so we can ideally have two machines representing the same
org-node
ideally on a write-write capacity but if not, write-read.High availability. Same as above.
Durability. Data can get lost, while in worst-case scenario, data can be partially or fully recovered by connecting with users' p2p nodes, having a HA solution would make our system more robust.
The text was updated successfully, but these errors were encountered: