diff --git a/docs/use-cases/lb_traffic_no_affinity.svg b/docs/use-cases/lb_traffic_no_affinity.svg new file mode 100644 index 0000000..df35c5c --- /dev/null +++ b/docs/use-cases/lb_traffic_no_affinity.svg @@ -0,0 +1,4 @@ + + + +
Network
Load Balancer

R1
nfacctd_1_1
nfacctd_1_M
RN
nfacctd_T_1
nfacctd_T_1
...
nfacctd_1_2



BGP TCP/179
IPFIX UDP/4739
...
...
X
X
Worker Node T
Worker Node 1
\ No newline at end of file diff --git a/docs/use-cases/network-telemetry-nfacctd.md b/docs/use-cases/network-telemetry-nfacctd.md new file mode 100644 index 0000000..babc46b --- /dev/null +++ b/docs/use-cases/network-telemetry-nfacctd.md @@ -0,0 +1,127 @@ +# Multi-flow K8s affinity: BGP and IPFIX/Netflow Deploying `nfacctd` in K8s Network telemtry using `pmacct` + +This is the original use-case that motivated this project. The original +discussion can be found in this [Cilium's #general slack thread](https://cilium.slack.com/archives/C1MATJ5U5/p1723579808788789). + +## Context + +### pmacct and datahangar +[pmacct](https://github.com/pmacct/pmacct) is probably _the_ most widely +used Open Source project for passive monitoring of networks. `nfacctd` or +Network Flow ACCounting Daemon, collects flowlogs ([IPFIX](https://en.wikipedia.org/wiki/IP_Flow_Information_Export)/ +[Netflow](https://en.wikipedia.org/wiki/NetFlow)/[Sflow](https://en.wikipedia.org/wiki/SFlow)) +and enriches them, normalizes values etc. to later export it (e.g. to a DB or +a BUS). + +One of the main features of `nfacctd` is to enrich flowlogs with [BGP](https://en.wikipedia.org/wiki/Border_Gateway_Protocol) +information, e.g. `AS_PATH`, `DST_AS`. For doing so, `nfacctd` acts as both +a collector of flowlogs _and_ a BGP passive peer. Routers, therefore, connect to +`nfacctd` at - typically - `TCP/179` and send e.g. IPFIX datagrams to +`UDP/4739`: + +![A network router connecting to nfacctd](single_router_nfacctd.svg) + +[datahangar](https://github.com/datahangar/) was originally designed as an E2E +test framework for pmacct in the context of Kubernetes. While it still serves +this [purpose](), `datahangar` has evolved into a reference architecture for a +network data pipeline using off-the-shelf OSS components. + +While most of the times `nfacctd` runs outside of K8s contexts, and close to +routers, the objective was to make `nfacctd` _as cloud native as possible_, +allowing rescheduling on failure, autoscaling etc. + +### Connectivity requirements + +BGP and flowlogs must: + +* Preserve source IP address, which is used to deduce the router identity. +* End up in the same Pod + +## First attempt: `sessionAffinity: ClientIP` and `externalTrafficPolicy: Local` + +The initial attempt was to define a `LoadBalancer` service: + +``` +kind: Service +apiVersion: v1 +metadata: + name: nfacctd +spec: + selector: + app: nfacctd + ports: + - name: netflow + protocol: UDP + port: 2055 + targetPort: 2055 + - name: bgp + protocol: TCP + port: 179 + targetPort: 179 + type: LoadBalancer + sessionAffinity: ClientIP + externalTrafficPolicy: Local #Do not SNAT to the service! +``` + +The mockup test quickly shown that IP preservation worked in any of the cases, +but that affinity didn't work wth multiple replicas or multiple worker nodes... +:disappointed:. Flows were hitting different Pods, or even different worker +nodes. + +![BPG and Flowlogs traffic end up in different pods](lb_traffic_no_affinity.svg) + +## :bulb: What if... + +But, what if would be able to modify the traffic _before_ hitting the Network +Load Balancer (NLB), and pretend it's BGP (`TCP/179`), so that `sessionAffinity: ClientIP` +worked, and then "undo" that trick in the Pod, right before the traffic was +delivered to `nfacctd`? Humm, that _might_ work. + +Adding a new feature in Kubernetes, all NLBs and CNIs in the world wasn't quite +an option :sweat_smile:, so it was obvious that the solution ought to be a bit +of a hack. + +Time to go back to the drawing board... + +## :honeybee: eBPF to the rescue! + +### Funneling traffic through a single protocol and port + +Let's call it [funneling](../funneling.md) to not confuse it with a real tunnel. + +The diagram would show: + +XXXX + +Netflow/IPFIX traffic (only) would have to be intercepted, mangled, and then sent +to the NLB. This could be either done by modifying Netflow/IPFIX traffic _while_ +routing it in intermediate nodes, e.g.: + + +or by pointing routers to one or more "funnelers" that mangle the packet and +DNAT it to the NLB. E.g.: + +XXX + +### Time to eBPF it :honeybee:! + +The original prototype was as easy as this: + + +#### The code + +``` +``` +The initial + +#### Using an initContainer() + +## Conclusion and limitations + +XXX +Works, MTU, need for funnelers. + +## Acknowledgments + +Thank you to Martynas Pumputis, Chance Zibolski and Daniel Borkmann for their +support in the Cilium community. diff --git a/docs/use-cases/single_router_nfacctd.svg b/docs/use-cases/single_router_nfacctd.svg new file mode 100644 index 0000000..156b526 --- /dev/null +++ b/docs/use-cases/single_router_nfacctd.svg @@ -0,0 +1,4 @@ + + + +
nfacctd


BGP TCP/179
IPFIX UDP/4739
Network router
R1
\ No newline at end of file