design doc for network policies

MaterializeInc · Oct 3, 2024 · dd0e74a · dd0e74a
1 parent c80b754
commit dd0e74a
Showing 1 changed file with 148 additions and 0 deletions.
diff --git a/doc/developer/design/20240925_network_policies.md b/doc/developer/design/20240925_network_policies.md
@@ -0,0 +1,148 @@
+# Network Policies
+
+- Associated:
+  - https://github.com/MaterializeInc/database-issues/issues/7062
+  - https://github.com/MaterializeInc/database-issues/issues/4637
+  - https://github.com/MaterializeInc/materialize/pull/29739
+  - https://github.com/MaterializeInc/materialize/pull/29179
+
+## The Problem
+Customers would like to restrict access to Materialize by IP address.
+https://github.com/MaterializeInc/database-issues/issues/4637
+
+## Success Criteria
+- Customers can define a global policy that restricts access to their Materialize environments based on the IP address of the client attempting to connect.
+- Materialize support can unlock an environment where policies prevent access.
+- The console is aware of whether an environment's network policies are blocking a connection it is trying to make.
+- Users can adjust network policies via SQL and in the console.
+
+Nice to haves:
+- Per-role network policies.
+- Policies for sources/sinks.
+- Mitigations to prevent user lockouts; such as ensuring the user is not blocking their current IP.
+- Termination of active connections based on newly applied policies.
+- Policies for egress traffic.
+
+## Out of Scope
+- Preventing out-of-policy traffic from reaching Environmentd.
+ (note this means network policies will not prevent DDOS)
+- Policy inheritance from associated roles. IE if 'bob' is a member of role 'eng'
+we will not apply policies from role 'eng' to 'bob'.
+- Restricting global API access.
+- Restricting access to Frontegg.
+
+## Solution Proposal
+
+#### Overview
+The proposed solution is to use role-based policies with a default network policy that applies to any role without a policy. This can initially be implemented as a global default policy and will be extended to per-user and per-source/sink. The policy will be applied when an attempt is made to establish a new client connection with the coordinator.
+
+#### New Resources
+A new `NetworkPolicy` resource will be added to the catalog.
+```rust
+struct NetworkPolicy {
+    id: NetworkPolicyId
+    name: String,
+    rules: Vec<NetworkPolicyRule>,
+}
+
+enum NetworkPolicyRule {
+    Ingress {
+        action: NetworkPolicyRuleAction,
+        source: IpNet,
+        comment: String
+ }
+}
+
+enum NetworkPolicyRuleAction {
+    Allow
+    // Deny - may be added later
+}
+```
+
+Users will be able to create `NetworkPolicies` directly.  A user must have `CREATENETWORKPOLICY` privileges to create, modify, or destroy network policies. Network policies will be limited to 25 rules. This will be controlled by an LD flag. `NetworkPolicyRules` must be created through a policy. The policy rules implementation will initially only contain an `Allow` variant, but we should be an enum to allow for a `Deny` variant in the future. `NetworkPolicyRule` will be an enum to allow for both ingress and egress policies. Only ingress policies will be initially implemented. `NetworkPolicyRules::Ingress` will also contain a single `IpNet` and a comment text field. Comments have become a standard for rules and greatly increase the manageability and auditability of policies.
+
+Example syntax for creating a network policy
+```sql
+CREATE NETWORK POLICY OFFICE_01 (
+ RULE ( ACTION=ALLOW, SOURCE="10.0.0.0/32", COMMENT="OFFICE IP - 2024-9-28" )
+);
+```
+
+Network policies will be assignable to roles - eventually source/sinks as well. A user must have usage privileges for the network policy they wish to assign, as well as the privileges to modify the role.
+
+Example syntax for assigning a network policy to a role
+```sql
+ALTER ROLE BOB SET network_policy = OFFICE_01;
+```
+* Policies can only be applied to login roles. There is no policy inheritance. Only the policy assigned to the role the user logged in as will be checked.
+
+Along with network policies a new `SystemVar` (`default_network_policy`) will be added that points to a specific `NetworkPolicy`. This system var will only be modifiable by `mz_system` and `superuser`. If a resource does not have a network policy this policy will be applied.
+
+Example syntax for updating the default_network_policy
+```sql
+ALTER SYSTEM SET default_network_policy = OFFICE_01;
+```
+
+### Policy Enforcement
+On `coord::handle_startup` a user will be inspected to see if they have a network policy. If the user does not have a policy, the policy specified by `default_network_policy` will be applied to the user. If the `client_ip` of the user is allowed by the policy the connection will continue normally. If the `client_ip` is denied by a policy, `handle_startup` will return an `AdapterError::UserSessionsDenied`. This error will be handled by the protocol layer, (`HTTP`,`pgwire`) to give the user an L7 response. In the case of `HTTP` this will be a `403 Forbidden`. Additionally, the response body will contain JSON data describing the failure ex:
+```json
+{
+   "message": "session denied",
+   "code": "MZ011",
+   "detail": "Access denied for address 1.2.3.4",
+}
+```
+
+When a 403 is returned with a `session denied` message, the console should be made to report that network policies are blocking the user to their environment. Access restriction will not be applied to the Global API, or Frontegg, as such, their UI components may still load.
+
+### Handling lockout.
+To mitigate user lockouts, we will prevent users from altering their network policy in a way that will block their current `client_ip`. In the case of a lockout, we would need to modify an admin role using the `mz_system` and temporarily set a network policy that either allowed global access for that user or allowed access to a particular IP they provide.
+
+### Possible downsides
+This design presents a highly configurable solution that guarantees no access to data and is likely the easiest mechanism to implement, however, it does have some downsides. The largest downside is in the guarantee it provides. The best level of network restriction we could provide is that no network traffic reaches the database.  The proposed solution only guarantees that no connection can be established with the data plane (coordinator). This has some implications for DOS attacks which must be handled outside the scope of these policies.
+
+## Minimal Viable Prototype
+
+Minimally, this feature can be implemented with a single `SystemVar` (`default_network_policy_allow_list`), configurable by `superusers`, that contains an allow-list of CIDRs (`Vec<IpNet>`).
+
+```sql
+ALTER SYSTEM SET default_network_policy_allow_list = '100.10.0.0/28,100.10.128.0/28'
+```
+
+The policy will be checked at `coord::handle_startup`, respond with L7 errors on denial, and apply to all user's connections to the system.
+
+PR for minimum prototype: https://github.com/MaterializeInc/materialize/pull/29739
+
+This minimal prototype can be built upon to achieve the proposed solution by
+- Introducing network policy resources, and SQL to create them
+- Moving the `SystemVar` from a `Vec<IpNet>` to an `Ident` pointing to a `NetworkPolicy` resource
+- Modifying roles to have an `Option<NetworkPolicy>` and adding the SQL to set this policy
+- Adding validations to prevent lock out
+- Following up with sinks/sources
+
+## Alternatives
+
+### What policies apply to.
+A common alternative approach is to have a global allow-list. This approach was considered and will be the initially delivered solution; however, adding per-user policies with a default had similar complexity and added clarity around the scope of the policies; i.e., they only impact users connections, not sources, or sinks.
+
+### Where network policies get applied.
+
+Network policies could be applied at many layers of our stack, from network firewalls or security groups that intercept traffic before it hits application subnets, to k8s or cilium network policies, balancers, or within the database itself. The above solutions choose to implement policies within the database itself. This comes with some disadvantages. For instance, this layer does not have auto-scaling and requires the database to do some work for each denied request. For this reason, it makes sense to shift the policies left. One possible shift is to the balancer layer. In this scenario, balancers would support both HTTP and pgwire load-balancing as well as network policy enforcement. Balancers have auto-scaling and are relatively stateless. A large number of out-of-policy requests to a balancer would likely not impact any ongoing connections. The biggest challenge with implementing network policies in the balancer is that they do not have access to the policies or roles, which are stored in the database. To move network policies to the balancers we would need some way of sharing all the policies and roles for all the environments a balancer is proxying. Another place we could shift these policies would be to a WAF or network firewalls. Neither one of these seems reasonable to implement for both pgwire and HTTP in a multi-tenant ingress layer, but this could be revisited for private ingress. It would still have the same issues of keeping policies up-to-date as the balancer.
+
+## Open questions
+
+#### Single default policy for all resources?
+Should there be different default policies for users, sources, and sinks, or should a single default policy be applied to all resources once those resources start supporting policies? It may be difficult to roll out new resources if we only have one default, but it does seem nicer in the long run.
+
+#### How do we handle Webhook Sources?
+Sources and sinks are planned as a follow-up to user-based policies, but it remains an open question how we provide a user-friendly mechanism for webhook sources where it may be hard to find a list of IPs if the webhook request is coming from
+a third party.
+
+#### The story on lockout is a bit weak.
+We may want to provide a more programmatic way to handle this, but we can wait and see if this becomes a problem.
+
+#### Should we support per-database or per-cluster policies?
+The answer to this is just no, at least not right now. These sorts of things can be enforced with RBAC for the foreseeable future. It may be worth revisiting if we get per-cluster use-case isolation where envd isn't the single access point.
+
+#### How do we terminate active connections on policy change?
+Once we get `pg_terminate_backend` we should be able to look at all existing connections and terminate them if the new policy would lead to a denial. Until then policies will not affect existing connections.