Skip to content

Feature Release Plan 2021 08 30

w-yue edited this page Oct 5, 2021 · 63 revisions

Goals

  1. QoS for Pods - Multiple traffic classes and priorities

    • Owner: Vinay
    • Summary: Guarantee network bandwidth for high priority containers
      • Finish bandwidth use monitoring and dynamic adjustment of low priority pod bandwidth limits. DONE. - Vinay
      • Design and add support for multiple priority levels (TC based) (HQ ask). DONE - Vinay.
      • Design and add mechanism to perform granular bandwidth classification and prioritization. - Vinay (stretch)
    • Status:
      • Requirements mostly clear.
      • 100% complete (green)
  2. CentaurusEdge feature requirements

    • Owner: Phu
    • Summary: Assign pods to VPCs
    • Status:
      • Received requirements from Peng Du & team.
      • PR Merged.
      • 100% complete (green)
      • Working on bug-fixes.
  3. Mizar CNI in golang

    • Owner: Hong Chang
    • Summary: Mizar CNI is written in py, need to rewrite in go
    • Status:
      • Coding nearly complete, testing complete. PR out for review, blocked on lack of reviewer bandwidth.
      • Added improvement for building docker containers.
      • 100% complete (green)
  4. Mizar Service Regression Fix

    • Owner: Hong Chang
    • Summary: Mizar scaled endpoint regression. Not working as of today.
    • Status:
      • Kubernetes Service and CoreDNS are running.
      • PR merged in.
      • Scaled endpoint functional test not passing.
        • Instability with UDP test.
      • 80% complete
  5. Automated tests & CI & Stability

    • Owner: Phu(lead) + everyone
    • Summary: Framework + automated E2E tests
      • Add framework to run E2E tests locally and in CI
      • Write E2E tests for existing features & new features (team)
      • Deploy K8s(1.21)+Mizar in GCE or AWS via kube-up to facilitate easy E2E dev/testing in real cluster (Vinay) (Stretch)
        • Automated deployment can help perf testing as well
        • Very long stretch for 8/30
    • Status:
      • 90% complete on CI framework
      • ??% on tests
  6. Performance Metrics and Comparison

    • Owner: Wei
    • Summary: Performance benchmark and analysis
      • Primarily focus on small-scale tests on throughput, latency, etc in this phase.
      • Analysis results to find potential issues we need to improve.
    • Status:
      • Get non-debug binaries working on Ubuntu 20.04.
      • Had first set of benchmarking numbers based on existing hardware and code base.
      • Plan to test on new nics with latest code and bug fixes.
      • 100% complete (green)
  7. Stability and optimization goals (Stretch)

    • Owner: TBD
    • Summary: Various improvements
      • Remove dependence on “eth0” interface name in Mizar code
      • Mizar deployment yaml create / delete / create (install & cleanup & install) in K8s
      • Pod create-delete-create with Mizar networking in K8s
      • Nice to have:
        • Remove label options from Geneve frames when labels are not in use.
        • Remove scaled_EP and RTS options when not in use
        • AWS setup - Intermittent connectivity failure on initial attempt from remote pod to a pod (iperf receiver / ping dest) is on same host as bouncer.
          • root@ip-192-168-1-144:~# kubectl exec -ti netpod3 -- iperf3 -t 15 -u -p 8888 -c 20.0.0.82
          • iperf3: error - unable to connect to server: No route to host
  8. Hardware offload (ongoing, not 8/30)

    • Owner: Wei(lead) + USTC
    • Summary: Offload bouncer/divider
    • Status:
    • USTC continue to refactor code for hardware offload.
  9. Mizar - Arktos integration (ongoing, not 8/30)

    • Owner: Vinay + Click2Cloud
    • Summary: Get Mizar working with Arktos (including network policy).
    • Status:
      • 20% complete (looking optimistic)
      • PR out for kernel update script

  1. Mizar/Zeta Integration or converged into one data plane platform
  2. Complete Zeta DFT (stateful) features
  3. Support more types of Arktos Services (Cluster IP, Node Port and Load Balancing)
  4. Change Mizar/Zeta controllers (control plane) to Go
  5. Chaining of XDP programs dynamically for complete e2e network services – service upgrade at runtime.
  6. Mizar supports CNI-Genie (multiple networks)?

Scrum meeting notes

09/30/2021

  • Vinay:

    • Multi-level QoS PR checked in.
    • Will start looking into checksum failure issue. (9/30 release blocker)
    • Validate switch support for QoS (not 9/30 release blocker)
  • Hong Chang:

    • Investigate bootstrap changing go.mod issue. (9/30 release blocker)
    • Look into iperf repro of the checksum issue.
  • Wei:

    • Fixing CI test issue in change that removes eth0 dependency.

09/27/2021

  • Vinay:

    • Multi-level QoS is working just not working the way I was expecting it to.
    • Will generate PR by EOD tomorrow.
  • Hong Chang:

    • Using existing test code to debug the checksum issue using just containers.
    • Cannot make 9/30 release.
  • Wei:

    • PR for benchmark=true merged.
    • PR for eth0 ready tomorrow.
    • Will have some early perf numbers for review by tomorrow.

09/23/2021

  • Vinay:

    • Still looking to get tc prio working correctly.
  • Phu:

    • Helping me with the unit tests and switch configuration.
  • Hong Chang:

    • Working on intermittent checksum failure issue.
      • We have a suboptimal solution. Looking for the right fix.
  • Wei:

    • Running the benchmark tests to get numbers.

09/20/2021

  • Vinay:

    • PR implementing multiple levels of priority submitted. Working on the last piece of puzzle - tc prio.
    • Need to add tests and fix currently broken tests.
  • Hong Chang:

    • Bootstrap regression fixed and merged.
    • Working on checksum issue, not much progress debugging.
  • Wei:

    • Investigating TCP tx checksum failure in on-prem systems (works on AWS) (different issue from Hong's problem)

09/16/2021

  • Vinay:

    • Working on implementing multiple levels of priority, investigating why tc setup is not working.
  • Phu:

    • Working on network creation validation check.
    • PengDu hitting issue with adding new node (regular k8s)
  • Hong:

    • Regression in bootstrap script. PR fixing it ETA Friday.
    • Possibly missed checksum recalculation. Investigating.. regression is higher priority.
  • Wei:

    • Investigate checksum issue with Hong.
    • File issue for pod delete re-create ping not working problem (possible cleanup bug)

09/13/2021

  • Vinay:

    • PR is in review, fixing CI and tests still pending.
    • Working on implementing multiple levels priorities and classes.
  • Phu:

    • Validation of network creation for P.Du's feature.
    • No known issues blocking edge team as of now.
  • Hong Chang:

    • Last bit of review changes to golang CNI PR
    • Continue investigation of flakiness in service
  • Wei:

    • Perf numbers comparing us vs Cilium
    • PR for benchmark=true. ETA this week.

08/23/2021

  • Vinay:

    • PR is in review.
    • Doing code reviews for C2C and Hong's changes.
    • Monitor changes and tests by Wednesday.
    • On vacation 08/27 to 09/05.
  • Phu:

    • PR out for review.
    • Reviewing other PRs.
  • Hong Chang:

08/20/2021

  • Vinay:

    • PR is out for review.
    • Doing code reviews for C2C and Hong's changes.
    • Starting code changes for monitor and configure low-priority b/w.
  • Hong Chang:

    • Blocked by issue building daemon docker image.
    • Working on service XDP regression.
  • Phu:

    • Working on testing the kubeedge requirement changes.
  • Wei:

    • Mizar works on 20.04 in AWS.
    • Working on 21.04.

08/09/2021

  • Vinay:

    • Separate program to lookup the stats map and update it with sync_fetch_add works.
      • Tail-call works.
      • Working on passing action through xdp_md. If it works, PR by EOD tomorrow.
  • Phu:

    • Implementation of kubeedge requirements done.
    • PR by next Monday.
  • Hong Chang:

  • Wei:

    • Working on getting Mizar running with v1.22
    • Should get Mizar running by next Monday.

08/05/2021

Next release moved to 09/30.

  • Vinay:

    • Separate program to lookup the stats map and update it with sync_fetch_add works.
      • Working on tail_call of this program - hope this works.
  • Phu:

    • Design doc done. Ready for merge.
    • Working on implementation. Sometime before 9/30.
  • Hong Chang:

    • PR ready for review.
    • I'll review it tomorrow.
  • Wei:

    • Mizar with Cilium infra.
    • Working on fixing the eth0 name dependency issue.

08/02/2021

  • Hong Chang:

    • Unit tests in progress. ETA for PR Thursday.
  • Phu:

    • Design doc reviewed and nearly complete.
    • In parallel, working on implementation.
  • Wei:

    • Starting to get Mizar working on Cilium way of perf-testing.
  • Vinay:

    • Still stuck on getting tran_agent XDP program with stats map lookup/write code, it fails to load.
      • Need ideas on how to debug this.

07/29/2021

Hong Chang:

  • CNI work nearly done. Adding tests.
  • Refactored CNI code to make it easier for test.
  • PR ETA mid-next week.

Phu:

  • Design doc out for review.
  • Getting head-start on implementation.

Wei:

  • Working on getting Cilium perf-framework and tests running.
    • Going well so far. Might hit issues with mizar.

Vinay:

  • Working on implementing basic statistics to compute high-pri b/w usage.
    • Hitting issues loading trn-agent-xdp after adding map lookup.

07/26/2021

Hong Chang:

  • CI/CD is passing now. Adding tests.

Phu:

  • Design doc nearly done. We will share with PDu team when reviewed and ready.

Vinay:

  • Design doc and PR out. Awaiting review.
  • Working on adding tests and bandwidth control.

Wei:

  • Setting up test environment.

07/22/2021

Hong Chang:

  • CNI code-complete. Manual testing done.
  • Fixing issues with go build process, need Makefile updates to get required go packages.

Phu:

  • Design doc in progress. ETA next Monday.
  • PR for CI issue.

Wei:

  • Perf test environment setup done.
  • Starting basic testing tomorrow.

Vinay:

  • PR out for review. Design doc out for review. Presenting to HQ today.

07/19/2021

Vinay:

  • PR and design doc out for review.
  • Worked on deploying Arktos in AWS to unblock C2C effort.
  • After working around the kubeadm regression, hit another regression.
  • Aborting the work. We will ask them to GCE.

Hong Chang:

  • CNI code is almost done.
  • Manual testing is next.

Phu:

  • Welcome back :)

Wei:

  • Out of office 07/19.

07/15/2021

Vinay:

  • Sent out PR for CLI based config of egress-bandwidth-limit.
    • Stuck at not being able to successfully bpf_map_lookup_elem. Tried various things.
    • Team: Please review PR to see if I missed something.
  • Design doc nearly complete. Hope to finish it by EOD.

Wei:

  • Multiple XDP program on the same NIC works, need kernel 5.10+ and Ubuntu 21.04.
  • Dunant is getting machines ready. ETA unknown.
  • Out next Monday.

Hong Chang:

  • CNI code .. go version of moving veth pair into network namespace for the pod.

07/12/2021

Vinay:

  • Figured out the CLI issue and testing code with key (source addr).
  • ETA tomorrow.

Hong:

  • Implementing CNI and addressing challenges / issues hit.
  • GRPC communication is working.
  • Should be nearing code-complete in a week or so.

Wei:

  • Performance measurement plan work in progress.
    • Need plan to get metrics for provisioning speed.

06/24/2021

Vinay:

  • Finishing up b/w QoS configuration, hitting issues in CLI.

Phu:

  • Github actions VMs cannot allocated agent metadata BPF maps. 2 cores/3G (low) resources.
  • Will use self-hosted VM instead.

Wei:

  • Exploring what LT ideas to pick up and work on.

Hong:

  • Ramping up on CNI - how it works. IPAM.

07/01/2021

Vinay:

  • Bandwidth QoS work going on. 80% complete.
  • Working with kubeEdge to help them on mizar setup.

Wei:

  • Performance comparison plan document
    • ETA 07/08.
  • Understanding the difference between our management plane and Alcor management plane.

Hong:

  • On vacation next Tue, Wed.
  • Complete design doc with "pseudocode" of what actions are taken on cmdAdd / cmdDel.

06/21/2021

Vinay:

  • 20.0.0.0 -> 10.0.0.0
  • Documentation cleanup
  • coredns crashing - scaled endpoint regression.
  • Simple b/w accounting and adjustment for QoS project.
  • Get rid of master branch, merge dev-next latest release to main branch.
  • Create a location for storing team stuff (NN special firmware etc)

Phu:

  • Investigating Github actions issue with Azure.
    • ping test does not run successfully.
  • Create design doc for Qian's requirements.
  • Qian's mizar as a service.
  • Phu out July 1st - 17th.

Hong:

  • To look into how py CNI works, scope out the work to write in go.
    • Look at flannel or another CNI for reference.
  • Investigating Dr. Xiong's suggestion on AWS advanced VPC idea.

Wei:

  • Following up with Netronome on special firmware - ticket filed.
  • bpf_map_update from XDP datapath uses spinlock which kills perf.
  • We have updated version of firmware.

06/17/2021

Vinay:

  • Get rid of master branch, merge dev-next latest release to main branch.
  • 20.0.0.0 -> 10.0.0.0 .
  • coredns crashing - scaled endpoint regression.
  • Documentation cleanup
  • wiki for "Dev tips and tricks"
  • Simple b/w accounting and adjustment for QoS project.

Phu:

  • Investigating Github actions issue with Azure.
  • Agent tail-call offload to Netronome NIC.
    • Does not work. Cannot add offloaded XDP program ID to jump table.
  • Edge team and Alcor team questions.
  • Create design doc for Qian's requirements.
  • Qian's mizar as a service.
  • Phu out July 5th - 17th.

Cathy:

Hong:

  • Investigating Dr. Xiong's suggestion on AWS advanced VPC idea.

Wei:

  • Following up with Netronome on special firmware - ticket filed.

06/14/2021

Vinay:

  • Get rid of master branch, merge dev-next latest release to main branch.
  • 20.0.0.0 -> 10.0.0.0 .
  • coredns crashing - scaled endpoint regression.
  • Documentation cleanup and wiki for "Dev tips and tricks"
  • Simple b/w accounting and adjustment for QoS project.

Phu:

  • Agent tail-call offload to Netronome NIC.
  • Edge team and Alcor team questions.
  • Create design doc for Qian's requirements.
  • Qian's mizar as a service.
  • Phu out July 5th - 17th.

Cathy:

Hong:

  • Investigating Dr. Xiong's suggestion on AWS advanced VPC idea.