Skip to content

Feature Release Plan 2021 05 30

Vinay Kulkarni edited this page Jun 3, 2021 · 47 revisions

Goals

  1. Label based network policies – Security Groups for IaaS network
    • Owner: Hong Chang
    • Summary: Insert a label (uint/uint64) into packet's GENEVE header options / encap IP options to speed-up ingress policy enforcement
    • Status:
      • Requirement clear, 5/30 scope identified.
      • Design doc complete and reviewed - See Label-based Network Policies.
      • Coding compete and checked-in.
      • Integration testing in progress.
  2. Bandwidth QoS monitor & control (for container networking)
    • Owner: Vinay
    • Summary: Guarantee network bandwidth for high priority containers
    • Status:
      • Requirement clear, 5/30 scope OK.
      • Design doc complete. See Bandwidth QoS for Kubernetes Pods
      • Coding complete.
        • Scoped down deliverable to the core feature.
          • High priority pods bw usage and low priority quota adjustment moved to next release.
      • PR in review. Integration testing in progress.
      • Implemented deployment of K8s cluster in AWS with Mizar as CNI
        • Single yaml Mizar deployment does not work in my case (works with kubeadm manual cluster). Investigating..
  3. Mizar Stability and Robustness
    • Owner: Phu
    • Summary: Improvements to Mizar so that it works in a stable, robust manner for Arktos and upstream K8s.
      • Investigate and fix issues to get Mizar single yaml deployment to work.
      • Investigate and fix the broken Mizar CNI.
      • Fix reliability issues causing 'kind-setup.sh' & 'kind-setup.sh dev'
      • Investigate and fix issues causing coredns and lpp pods to not come up and run successfully.
    • Status:
      • Coding complete and checked in.
      • Investigating corner-case issues.
  4. Network Policy Support for Arktos
    • Owner: Cathy
    • Summary: Network Controller changes to add network policy support in Arktos. Design & implement.
    • Status:
  5. Mizar <--> Arktos improvements

  1. Mizar Performance Test & eBPF/XDP offloading to NIC card - DE-PRIORITIZED FOR 5/30 DUE TO BLOCKING ISSUES

    • Owner: Phu
    • Summary: Compare Mizar perf (non-offloaded) with peer CNI solutions, e.g cilium, ovn-k8s
      • Metrics: Scalability, Latency, (TODO: What else..)
      • Status:
        • 5/30 scope OK.
        • Test setup identified.
        • Driving collaboration with USTC team, engaging with them during weekly meetings and on slack.
          • Potential contributions:
            • Move egress traffic handling from veth-pair host Rx to host NIC to reduce XDP memory footprint of Mizar - USTC started working on it
            • Add/enrich statistics gathering to Mizar - USTC team starting to work on this.
    • Summary: Evaluate Netronome NIC support for Mizar eBPF/XDP offload. If feasible, design & implement. Compare with DPDK.
      • Status:
        • 5/30 likely risk - external dependency, large task
        • NICs arrived, installed and switch configured.
        • Working on offloading Mizar eBPF to NIC.
  2. Stretch goal: Make Mizar work for scaleout architecture

    • Owner: TBD

Scrum meeting notes

03/12/2021

Vinay:

  • Investigating Google EDT for Bandwidth QoS project
  • Setup a talk to go over k8s components arktos-up.
  • Setup a session to understand mizar - how things flow.

Phu:

  • Working on understanding data-plane and DPDK.

Cathy:

  • Working on Arktos<-->Mizar bugs

Hong:

  • Design doc for Net policy WIP, ETA 03/15

03/17/2021

Vinay:

  • Upload the arktos-up & kube-up overview and share.
  • Design doc for QoS and colocation of high/low pri pods.
    • Use mutating webhook to add anti-affinity to high-network-priority annotated pods.
    • Investigating how to use EDT.
  • Change sync meetings to Tue & Thu @ 2pm instead of M-W-F

Phu:

  • Investigating how to load XDP program into offload NIC. Add CLI option for it.

Cathy:

  • Working on binary searching Akrtos master CL to narrow down where the Mizar CNI integration broke.

Hong:

  • Working on adding E2E workflow diagram.
  • Need pointers with how GENEVE is used today in Mizar.
    • Check if Phu's video, Sherif's KC talk has this info.

03/19/2021

Vinay:

  • Upload the arktos-up & kube-up overview and share.
  • Design doc for QoS and colocation of high/low priority pods.
    • Going through workflow of CNI network interface addition in Mizar.
      • Team uses kind. Check k8s version used
      • Try k8s in GCE with Mizar with kind k8s version that is known to work.
    • Investigating how to use EDT.
  • Change sync meetings to Tue & Thu @ 2pm instead of M-W-F

Phu:

  • Out sick today.

Cathy:

  • The issue exists in July version of arktos as well.
  • Create issue with all details of CLs we tried and loop me an XiaoNing.

Hong:

  • Hong to schedule design review meeting for Monday.
  • Need pointers with how GENEVE is used today in Mizar.
    • Find and send a video on overlay networks.
    • Check if Phu's video, Sherif's KC talk has this info.

03/23/2021

Vinay:

  • Design doc for QoS and colocation of high/low priority pods.
    • Going through workflow of CNI network interface addition in Mizar.
      • Team uses kind. Check k8s version used
    • Investigating how to use EDT.
      • Trying Cilium EDT code.

Phu:

  • CLI update to install XDP program in kernel vs offload
  • NICs will be here tomorrow maybe.

Cathy:

  • Hongwei fixed containerd version, it works now.
  • Network controller changes to add network policy to arktos.
    • Working on design doc, need to identify changes in Arktos work with network policy in Mizar.

Hong:

  • Update design doc with details of E2E flow.

03/25/2021

Vinay:

  • Design doc for QoS in progress.
    • Trying out XDP tutorial.
    • Trying out tc prototype to limit bandwidth.

Phu:

  • CLI update done. PR out.
  • NICs received, dropped it off to office.
  • Start looking into how to do perf-test.
  • Vinay to look into what switch to buy. Prefer to keep it local. Check with David on our own switch install.

Cathy:

  • Arktos keeps sending pod create requests without reason - investigating.
  • Send Cathy arktos PR submit pre-checks.

Hong:

  • Update design doc with details of E2E flow.
  • Setup follow-up design doc review

03/30/2021

Vinay:

  • Design doc for QoS in progress.
    • Investigating how to leverage linux TC after encapsulating outgoing packet at veth Rx.
    • Working with David to get the cabling needed to hookup Netronome NICs.

Phu:

  • Investigating DPDK setup for comparison with XDP.
  • Proposed test-setup:
    • 1 master, 2 workers [physical servers with 1 netronome NIC each] = total of 3 baremetal servers
    • Identify upstream version of k8s to use for cluster.
    • Mizar vs ovn vs Cilium vs DPDK

Cathy:

  • Arktos keeps sending pod create requests without reason - still investigating.
  • Arktos PR - hit CI issues. No progress yet. Not urgent, postponed until later.
  • Working on design doc for Network policy in Arktos. Review ETA tomorrow afternoon. #1 priority.

Hong:

  • Update design doc with details of E2E flow.
    • Struct change details identified.

04/01/2021

Vinay:

  • Going to office tomorrow to install NICs

Phu:

  • Investigating DPDK setup for comparison with XDP.
  • Working on perf test setup and plan.

Cathy:

  • Network Policy Design Doc in review, looks promising.

Hong:

  • Out today.

04/13/2021

Vinay:

  • Switch config for Phu - working now.
  • Attending NSDI talks.
  • Single-node deployment debugging - Pods stuck in container creating.
  • VMware fusion deployment (1 master 2 worker cluster Ubuntu 20.04 latest kernel) of Hongwei mizar yaml .. hit several issues.
  • Arktos AWS deployment with Mizar support. (kube-up).. arktos aws-kube-up has regressed.
  • Upstream k8s v1.19.2 + Ubuntu 20.10.

Phu:

  • kubeadm setup worked -- v1.21 + Ubuntu 18.04 + updated kernel.
  • Mizar deploy.mizar.components.yaml - did not work. Working on a fix..
  • Also hitting interface not found in mizarcni.log
  • Loading XDP in offload mode hitting issues..

Cathy:

  • Prototyping the network policy changes.
    • Blocked on GRPC error. Hong Chang will help.
  • Review Amit's change to simplify single-node arktos deployment for mizar.

Hong:

  • Good progress implementing label plumbing to XDP. PR needs review.
    • Vinay to review PR.

04/15/2021

Vinay:

  • Sent PR to fix docker image in kind-setup - fixes the issue USTC folks hit.
  • Submitted PR (in my k8s fork) to deploy k8s with Mizar in AWS (Flannel works, mizar single-yaml currently does not)
  • Investigating single-YAML in AWS & arktos-up. Phu also hit this issue.
    • arktos-up seems currently broken.
  • Reviewed Hong's PR, have some questions.

Phu:

  • Out sick today.

Cathy:

  • GRPC issue resolved. Progress is being made.
  • Reviewed Amit's changes, will provide more comments to update user-guide.

Hong:

  • Good progress implementing label plumbing to XDP. PR is in review.
    • Discussed the idea of creating a new generic struct to hold Geneve options data rather than using endpoint struct.

04/20/2021

Vinay:

  • Working on fixing the single-yaml deployment. Pod-to-pod pings don't work.
    • Mizar currently broken - regression. Need CI tests. Looking for commit that caused regression.
  • Next: pick back up on Bandwidth QoS project.

Phu:

  • Found issue with droplets not coming up in single Yaml. Sending PR today / tomorrow.
    • This is blocking Phu as well.
  • Will be mentoring Wei with kind-setup CI test task next week.

Cathy:

  • Two PRs out for review.
  • Working on converting arktos n/w policy to mizar for py.

Hong:

  • Investigating creating a separate BPF map to store packet options info.

04/22/2021

Vinay:

  • PR to cleanup kind-setup merged.
  • Plan to start working on QoS project soon.

Phu:

  • Found issue with droplets not coming up in single Yaml.
  • Fixed issue with droplet not being main interface.
  • Looking into CNI issue.
  • Will be mentoring Wei with kind-setup CI test task next week.

Cathy:

  • PR are merged.
  • Working on converting arktos n/w policy to mizar for py.

Hong:

  • Investigating creating a separate BPF map to store packet options.
    • Need help with creating a new BPF map

04/27/2021

Vinay:

  • Started working on QoS project.
    • Blocked on trying to find the underlying sk_buff for in XDP code (xdp_md / xdp_buff)
  • Project proposal for summer of code.

Phu:

  • Netronome does not support XDP_REDIRECT. This is a blocker for us.
  • Found issue with droplets not coming up in single Yaml.
  • Working on pods not being created. CNI ADD call is not handled correctly.
  • Will be mentoring Wei with kind-setup CI test task next week.

Cathy:

  • Out sick - getting COVID vaccine, maybe out tomorrow as well.

Hong:

  • Reviewed PR. Fixing issues, adding unit tests.

04/29/2021

Vinay:

  • Started working on QoS project.
    • sk_buff is not created at the point where we intercept egress packet. Cannot use EDT mechanism as is done by Cilium or goog (at TC hook)
    • Need a different approach - investigating.
  • Project proposal for summer of code written up and reviewed.

Phu:

  • Netronome does not support XDP_REDIRECT. This is a blocker for us.
  • Found issue with droplets not coming up in single Yaml.
  • Continuing to investigate CNI ADD call is not handled correctly.

Cathy:

  • Working on JSON conversion from mizar <--> arktos.

Hong:

  • Updated PR needs review.
  • Vinay will try and review it today.

05/04/2021

Vinay:

  • Continuing work on QoS project. Going SLOW...

Phu:

  • Node not ready issue fixed.
  • Investigating another issue in pod create - CNI issue.

Cathy:

  • Working on JSON conversion from mizar <--> arktos.
    • Ingress rules fixed. Working on egress & podSelector rules.
    • PR by EOW.

Hong:

  • Updated PR for plumbing labels for egress processing.
  • Working on XDP code to build Geneve frame with label options.
  • Vinay will try and review it today.

Wei:

  • Ramping up.

05/06/2021

Vinay:

  • Continuing work on QoS project.

Phu:

  • Out today

Cathy:

  • PR by EoW.
    • Code for Arktos side is getting done.
    • Minor changes needed for Mizar side.

Hong:

  • PR for plumbing labels for egress processing done and submitted.
  • XDP code to build Geneve frame with label options is working.
  • Vinay will review XDP PR.
  • Working on extracting labels from ingress packet and applying ingress policy processing.

Wei:

  • Ramping up.

05/11/2021

Vinay:

  • Continuing work on QoS project. Added code to XDP_PASS for low-priority pods and apply EDT.
    • Facing various issues, investigating.
  • Working on SoW metrics for C2C

Phu:

  • Mentoring & ramping up Wei.
  • Fixing the yaml. Targeting a PR by tomorrow.

Cathy:

  • PR in review. Fixing feedback.
  • Troubleshooting Mizar gRPC issue.

Hong:

  • PR for plumbing labels for egress processing done and submitted.
  • XDP code to build Geneve frame with label options is working.
  • Vinay will review XDP PR. Review is done. Will merge today.
  • Working on extracting labels from ingress packet and applying ingress policy processing.

Wei:

  • Ramping up, no blocking issues.

05/14/2021

Vinay:

  • Partially working prototype of steering low priority traffic into TC framework for EDT rate-limiting is done.
    • Facing issue with bpf_debug causing agent XDP program to fail to load. Debugging..

Phu:

  • Mentoring & ramping up Wei.
  • UW presentation and zeta meeting presentation went very well.
  • PR out for fixing yaml. Needs review. Fixing it for kind-setup.

Cathy:

  • PR merged.
  • Mizar code changes and then testing.

Hong:

  • PR for plumbing labels for egress processing done and submitted.
  • Plumb the labels data to ingress XDP maps and do e2e tests.

Wei:

  • Ramping up. Tried out arktos-mizar with Cathy's help in AWS.

05/18/2021

Vinay:

  • Partially working prototype of steering low priority traffic into TC framework for EDT rate-limiting is done.
    • Filed issue for benchmark=True causing _agent XDP program to load. We can look at this after 5/30.
    • Ping drop in the TC slow path was due to bridge calling into iptables. Disabled iptables and things work.
    • Writing EDT program for slow path.

Phu:

  • Mentoring & ramping up Wei.
  • PR out for fixing yaml. Needs review. Fixed it for kind-setup.
    • Resolve conflicts and update PR.
    • Wei will help test the change in kind & real-cluster in AWS.
  • Working on USTC tasks (tail-call to offloaded XDP program) & CI.

Cathy:

  • PR merged.
  • Mizar code changes done.
  • Deployment scripts need changes to test the mizar and arktos changes together.
    • Phu can help with it once done with his work.

Hong:

  • Plumb the labels data to ingress XDP maps and do e2e tests.
    • Running into issues with plumbing data to maps. Investigating..

Wei:

  • Ramping up. Tried out arktos-mizar with Cathy's help in AWS.
  • Will work with Phu to test kind-setup & deploy.yaml in AWS.

05/20/2021

Vinay:

  • Partially working prototype of steering low priority traffic into TC framework for EDT rate-limiting is done.
    • Filed issue for benchmark=True causing _agent XDP program to load. We can look at this after 5/30.
    • Ping drop in the TC slow path was due to bridge calling into iptables. Disabled iptables and things work.
    • Writing EDT program for slow path.
    • Send docker credentials to Phu for image update.

Phu:

  • Mentoring & ramping up Wei.
  • PR out for fixing yaml. Needs review. Fixed it for kind-setup.
    • Resolve conflicts and update PR.
    • Wei will help test the change in kind - this works.
    • Wei to try it in real-cluster (latest k8s) in AWS.
  • Working on USTC tasks (tail-call to offloaded XDP program) & CI.
  • Working on CI test framework.
  • Meeting with Peng Du.

Cathy:

  • PR merged.
  • Mizar code changes done.
  • Deployment scripts need changes to test the mizar and arktos changes together.
    • Phu can help with it once done with his work.

Hong:

  • Plumb the labels data to ingress XDP maps and do e2e tests.
    • PR ready to review. E2E testing completed.

Wei:

  • Ramping up.
  • Will work with Phu to test kind-setup & deploy.yaml in AWS with k8s latest release (kubeadm cluster).

05/25/2021

Vinay:

  • Partially working prototype of steering low priority traffic into TC framework for EDT rate-limiting is done.
    • Working on code changes to add mizar bridge, attach tc edt program to eth0.
      • Planning to send out a draft-PR tomorrow.
    • Send docker credentials to Phu for image update.

Phu:

  • Mentoring & ramping up Wei.
    • Wei to try it in real-cluster (latest k8s) in AWS.
  • Working on USTC tasks (tail-call to offloaded XDP program) & CI.
  • Working on CI test framework.
  • Peng Du gave us an overview.
  • Phu out Thu/Fri.

Cathy:

  • Arktos deployment PR needs review.
  • PR for param-diff issues needs review.
  • PRs merged in arktos repo.
  • Deployment scripts need changes to test the mizar and arktos changes together.
  • Issue with Pod from store does not have pod IP. Hong Chang suggested trying to find endpoint for pod and taking the IP from there.

Hong:

  • Plumb the labels data to ingress XDP maps and do e2e tests.
    • PR merged.
    • Will demo simple policy (calico example) on Friday.
    • Review next PR that decides whether to pass or block traffic.

Wei:

  • Working with Phu to test kind-setup & deploy.yaml in AWS with k8s latest release (kubeadm cluster).
    • Recommend using t2.2xlarge , and t3.2xlarge for future iterations.
    • Experimenting with XDP offload in Netronome.

05/27/2021

Vinay:

  • Sent out PR 492 for review covering the low priority b/w limiting feature.
  • Send docker credentials to Phu for image update.

Phu:

  • Mentoring & ramping up Wei.
    • Wei tried Phu's change (latest k8s) in AWS. It works now.
  • Working on USTC tasks (tail-call to offloaded XDP program).
  • Working on CI test framework (Travis doing pay-mode, using github actions)
  • Phu out Thu/Fri.

Cathy:

  • Arktos deployment PR needs review.
  • PR for param-diff issues needs review.
  • Deployment scripts need changes to test mizar and arktos changes together.
  • Issue with Pod from store does not have pod IP. Hong Chang suggested trying to find endpoint for pod and taking the IP from there.

Hong:

  • Plumb the labels data to ingress XDP maps and do e2e tests. Found and fixed issue
    • Will send updated PR.
    • Review updated PR that decides whether to pass or block traffic.
  • Plan to demo simple policy (calico example) on Friday is still on.

Wei:

  • Working with Phu to test kind-setup & deploy.yaml in AWS with k8s latest release (kubeadm cluster) - DONE.
    • Recommend using t2.2xlarge , and t3.2xlarge for future iterations.
    • Experimenting with XDP offload in Netronome.
  • Help Vinay with TC EDT program bandwidth rate-limit configuration.

05/29/2021

Vinay:

  • Sent out PR 492 for review covering the low priority b/w limiting feature.
  • Working on automated unit tests and e2e tests.
  • I will schedule time for reviewing PR 492.

Phu:

  • Working on USTC tasks (tail-call to offloaded XDP program).
  • Working on CI test framework (Travis doing pay-mode, using github actions)

Cathy:

  • Arktos deployment PR needs review. Vinay will review by Thursday.
  • PR for param-diff issues needs review. (This can wait post 5/30 release)
  • (Post 5/30) Deployment scripts need changes to test mizar and arktos changes together.
  • (Post 5/30) Issue with Pod from store does not have pod IP. Hong Chang suggested trying to find endpoint for pod and taking the IP from there.

Hong:

  • Plumb the labels data to ingress XDP maps and do e2e test.
    • PR in review. Cathy/Phu will review today.
  • Demo on 5/28 went well. Thanks!
  • Working on another round of E2E tests. Automate possible? "make teste2e"

Wei:

  • Helping Vinay with TC EDT program bandwidth rate-limit configuration.
    • Clean kind-setup works. Able to try out my code and verify iperf works, ping works.
      • Working on modifying trn_edt_tc.c to figure out iperf bandwidth - rate limit BPS disconnect.
      • Add code to configure egress bandwidth bps from usermode instead of default/fixed kernel mode.
  • Translating slides from Tencent.

06/03/2021

Vinay:

  • Sent out PR 496 - code cleanup.
  • Working on e2e tests.
  • I will schedule time for reviewing PR 492.

Phu:

  • PR for single yaml and docker image updated
  • PR for the bug fixes
  • Working on CI test framework with github actions.
  • USTC collab (tail-call to offloaded XDP program) .. gonna do this next week.

Cathy:

  • Arktos deployment PR needs review. Vinay will review by Thursday.
  • PR for param-diff issues needs review. (This can wait post 5/30 release)
  • Fixed failing network policy unit tests. PR ready for review.
  • (Post 5/30) Deployment scripts need changes to test mizar and arktos changes together.
  • (Post 5/30) Issue with Pod from store does not have pod IP. Hong Chang suggested trying to find endpoint for pod and taking the IP from there.

Hong:

  • Plumb the labels data to ingress XDP maps and do e2e test.
    • PR in review. Cathy/Phu will review today.
  • Need to do another round of E2E test after Vinay's PR is merged.

Wei:

  • Discussed Tencent slides
  • Overview of QoS project I have worked on. Potentially work on the next phase.
  • Delete pod and create same pod again and it does not work.
    • Create github issue with repro steps and details.