-
Notifications
You must be signed in to change notification settings - Fork 50
Feature Release Plan 2021 08 30
-
QoS for Pods - Multiple traffic classes and priorities
- Owner: Vinay
-
Summary: Guarantee network bandwidth for high priority containers
- Finish bandwidth use monitoring and dynamic adjustment of low priority pod bandwidth limits. DONE. - Vinay
- Design and add support for multiple priority levels (TC based) (HQ ask). DONE - Vinay.
- Design and add mechanism to perform granular bandwidth classification and prioritization. - Vinay (stretch)
-
Status:
- Requirements mostly clear.
- 100% complete (green)
-
CentaurusEdge feature requirements
- Owner: Phu
- Summary: Assign pods to VPCs
-
Status:
- Received requirements from Peng Du & team.
- PR Merged.
- 100% complete (green)
- Working on bug-fixes.
-
Mizar CNI in golang
- Owner: Hong Chang
- Summary: Mizar CNI is written in py, need to rewrite in go
-
Status:
- Coding nearly complete, testing complete. PR out for review, blocked on lack of reviewer bandwidth.
- Added improvement for building docker containers.
- 100% complete (green)
-
Mizar Service Regression Fix
- Owner: Hong Chang
- Summary: Mizar scaled endpoint regression. Not working as of today.
-
Status:
- Kubernetes Service and CoreDNS are running.
- PR merged in.
- Scaled endpoint functional test not passing.
- Instability with UDP test.
- 80% complete
-
Automated tests & CI & Stability
- Owner: Phu(lead) + everyone
-
Summary: Framework + automated E2E tests
- Add framework to run E2E tests locally and in CI
- Write E2E tests for existing features & new features (team)
- Deploy K8s(1.21)+Mizar in GCE or AWS via kube-up to facilitate easy E2E dev/testing in real cluster (Vinay) (Stretch)
- Automated deployment can help perf testing as well
- Very long stretch for 8/30
-
Status:
- 90% complete on CI framework
- ??% on tests
-
Performance Metrics and Comparison
- Owner: Wei
-
Summary: Performance benchmark and analysis
- Primarily focus on small-scale tests on throughput, latency, etc in this phase.
- Analysis results to find potential issues we need to improve.
-
Status:
- Get non-debug binaries working on Ubuntu 20.04.
- Had first set of benchmarking numbers based on existing hardware and code base.
- Plan to test on new nics with latest code and bug fixes.
- 100% complete (green)
-
Stability and optimization goals (Stretch)
- Owner: TBD
-
Summary: Various improvements
- Remove dependence on “eth0” interface name in Mizar code
- Mizar deployment yaml create / delete / create (install & cleanup & install) in K8s
- Pod create-delete-create with Mizar networking in K8s
- Nice to have:
- Remove label options from Geneve frames when labels are not in use.
- Remove scaled_EP and RTS options when not in use
- AWS setup - Intermittent connectivity failure on initial attempt from remote pod to a pod (iperf receiver / ping dest) is on same host as bouncer.
- root@ip-192-168-1-144:~# kubectl exec -ti netpod3 -- iperf3 -t 15 -u -p 8888 -c 20.0.0.82
- iperf3: error - unable to connect to server: No route to host
-
Hardware offload (ongoing, not 8/30)
- Owner: Wei(lead) + USTC
- Summary: Offload bouncer/divider
- Status:
- USTC continue to refactor code for hardware offload.
-
Mizar - Arktos integration (ongoing, not 8/30)
- Owner: Vinay + Click2Cloud
- Summary: Get Mizar working with Arktos (including network policy).
-
Status:
- 20% complete (looking optimistic)
- PR out for kernel update script
- Mizar/Zeta Integration or converged into one data plane platform
- Complete Zeta DFT (stateful) features
- Support more types of Arktos Services (Cluster IP, Node Port and Load Balancing)
- Change Mizar/Zeta controllers (control plane) to Go
- Chaining of XDP programs dynamically for complete e2e network services – service upgrade at runtime.
- Mizar supports CNI-Genie (multiple networks)?
-
Vinay:
- Multi-level QoS PR checked in.
- Will start looking into checksum failure issue. (9/30 release blocker)
- Validate switch support for QoS (not 9/30 release blocker)
-
Hong Chang:
- Investigate bootstrap changing go.mod issue. (9/30 release blocker)
- Look into iperf repro of the checksum issue.
-
Wei:
- Fixing CI test issue in change that removes eth0 dependency.
-
Vinay:
- Multi-level QoS is working just not working the way I was expecting it to.
- Will generate PR by EOD tomorrow.
-
Hong Chang:
- Using existing test code to debug the checksum issue using just containers.
- Cannot make 9/30 release.
-
Wei:
- PR for benchmark=true merged.
- PR for eth0 ready tomorrow.
- Will have some early perf numbers for review by tomorrow.
-
Vinay:
- Still looking to get tc prio working correctly.
-
Phu:
- Helping me with the unit tests and switch configuration.
-
Hong Chang:
- Working on intermittent checksum failure issue.
- We have a suboptimal solution. Looking for the right fix.
- Working on intermittent checksum failure issue.
-
Wei:
- Running the benchmark tests to get numbers.
-
Vinay:
- PR implementing multiple levels of priority submitted. Working on the last piece of puzzle - tc prio.
- Need to add tests and fix currently broken tests.
-
Hong Chang:
- Bootstrap regression fixed and merged.
- Working on checksum issue, not much progress debugging.
-
Wei:
- Investigating TCP tx checksum failure in on-prem systems (works on AWS) (different issue from Hong's problem)
-
Vinay:
- Working on implementing multiple levels of priority, investigating why tc setup is not working.
-
Phu:
- Working on network creation validation check.
- PengDu hitting issue with adding new node (regular k8s)
-
Hong:
- Regression in bootstrap script. PR fixing it ETA Friday.
- Possibly missed checksum recalculation. Investigating.. regression is higher priority.
-
Wei:
- Investigate checksum issue with Hong.
- File issue for pod delete re-create ping not working problem (possible cleanup bug)
-
Vinay:
- PR is in review, fixing CI and tests still pending.
- Working on implementing multiple levels priorities and classes.
-
Phu:
- Validation of network creation for P.Du's feature.
- No known issues blocking edge team as of now.
-
Hong Chang:
- Last bit of review changes to golang CNI PR
- Continue investigation of flakiness in service
-
Wei:
- Perf numbers comparing us vs Cilium
- PR for benchmark=true. ETA this week.
-
Vinay:
- PR is in review.
- Doing code reviews for C2C and Hong's changes.
- Monitor changes and tests by Wednesday.
- On vacation 08/27 to 09/05.
-
Phu:
- PR out for review.
- Reviewing other PRs.
-
Vinay:
- PR is out for review.
- Doing code reviews for C2C and Hong's changes.
- Starting code changes for monitor and configure low-priority b/w.
-
Hong Chang:
- Blocked by issue building daemon docker image.
- Working on service XDP regression.
-
Phu:
- Working on testing the kubeedge requirement changes.
-
Wei:
- Mizar works on 20.04 in AWS.
- Working on 21.04.
-
Vinay:
- Separate program to lookup the stats map and update it with sync_fetch_add works.
- Tail-call works.
- Working on passing action through xdp_md. If it works, PR by EOD tomorrow.
- Separate program to lookup the stats map and update it with sync_fetch_add works.
-
Phu:
- Implementation of kubeedge requirements done.
- PR by next Monday.
-
Hong Chang:
- PR out for review.
- Will look into service-name XDP regression/issue in Mizar (Issue https://github.com/CentaurusInfra/mizar/issues/506 )
-
Wei:
- Working on getting Mizar running with v1.22
- Should get Mizar running by next Monday.
Next release moved to 09/30.
-
Vinay:
- Separate program to lookup the stats map and update it with sync_fetch_add works.
- Working on tail_call of this program - hope this works.
- Separate program to lookup the stats map and update it with sync_fetch_add works.
-
Phu:
- Design doc done. Ready for merge.
- Working on implementation. Sometime before 9/30.
-
Hong Chang:
- PR ready for review.
- I'll review it tomorrow.
-
Wei:
- Mizar with Cilium infra.
- Working on fixing the eth0 name dependency issue.
-
Hong Chang:
- Unit tests in progress. ETA for PR Thursday.
-
Phu:
- Design doc reviewed and nearly complete.
- In parallel, working on implementation.
-
Wei:
- Starting to get Mizar working on Cilium way of perf-testing.
-
Vinay:
- Still stuck on getting tran_agent XDP program with stats map lookup/write code, it fails to load.
- Need ideas on how to debug this.
- Still stuck on getting tran_agent XDP program with stats map lookup/write code, it fails to load.
Hong Chang:
- CNI work nearly done. Adding tests.
- Refactored CNI code to make it easier for test.
- PR ETA mid-next week.
Phu:
- Design doc out for review.
- Getting head-start on implementation.
Wei:
- Working on getting Cilium perf-framework and tests running.
- Going well so far. Might hit issues with mizar.
Vinay:
- Working on implementing basic statistics to compute high-pri b/w usage.
- Hitting issues loading trn-agent-xdp after adding map lookup.
Hong Chang:
- CI/CD is passing now. Adding tests.
Phu:
- Design doc nearly done. We will share with PDu team when reviewed and ready.
Vinay:
- Design doc and PR out. Awaiting review.
- Working on adding tests and bandwidth control.
Wei:
- Setting up test environment.
Hong Chang:
- CNI code-complete. Manual testing done.
- Fixing issues with go build process, need Makefile updates to get required go packages.
Phu:
- Design doc in progress. ETA next Monday.
- PR for CI issue.
Wei:
- Perf test environment setup done.
- Starting basic testing tomorrow.
Vinay:
- PR out for review. Design doc out for review. Presenting to HQ today.
Vinay:
- PR and design doc out for review.
- Worked on deploying Arktos in AWS to unblock C2C effort.
- After working around the kubeadm regression, hit another regression.
- Aborting the work. We will ask them to GCE.
Hong Chang:
- CNI code is almost done.
- Manual testing is next.
Phu:
- Welcome back :)
Wei:
- Out of office 07/19.
Vinay:
- Sent out PR for CLI based config of egress-bandwidth-limit.
- Stuck at not being able to successfully bpf_map_lookup_elem. Tried various things.
- Team: Please review PR to see if I missed something.
- Design doc nearly complete. Hope to finish it by EOD.
Wei:
- Multiple XDP program on the same NIC works, need kernel 5.10+ and Ubuntu 21.04.
- Dunant is getting machines ready. ETA unknown.
- Out next Monday.
Hong Chang:
- CNI code .. go version of moving veth pair into network namespace for the pod.
Vinay:
- Figured out the CLI issue and testing code with key (source addr).
- ETA tomorrow.
Hong:
- Implementing CNI and addressing challenges / issues hit.
- GRPC communication is working.
- Should be nearing code-complete in a week or so.
Wei:
- Performance measurement plan work in progress.
- Need plan to get metrics for provisioning speed.
Vinay:
- Finishing up b/w QoS configuration, hitting issues in CLI.
Phu:
- Github actions VMs cannot allocated agent metadata BPF maps. 2 cores/3G (low) resources.
- Will use self-hosted VM instead.
Wei:
- Exploring what LT ideas to pick up and work on.
Hong:
- Ramping up on CNI - how it works. IPAM.
Vinay:
- Bandwidth QoS work going on. 80% complete.
- Working with kubeEdge to help them on mizar setup.
Wei:
- Performance comparison plan document
- ETA 07/08.
- Understanding the difference between our management plane and Alcor management plane.
Hong:
- On vacation next Tue, Wed.
- Complete design doc with "pseudocode" of what actions are taken on cmdAdd / cmdDel.
Vinay:
- 20.0.0.0 -> 10.0.0.0
- Documentation cleanup
- coredns crashing - scaled endpoint regression.
- Simple b/w accounting and adjustment for QoS project.
- Get rid of master branch, merge dev-next latest release to main branch.
- Create a location for storing team stuff (NN special firmware etc)
Phu:
- Investigating Github actions issue with Azure.
- ping test does not run successfully.
- Create design doc for Qian's requirements.
- Qian's mizar as a service.
- Phu out July 1st - 17th.
Hong:
- To look into how py CNI works, scope out the work to write in go.
- Look at flannel or another CNI for reference.
- Investigating Dr. Xiong's suggestion on AWS advanced VPC idea.
Wei:
- Following up with Netronome on special firmware - ticket filed.
- bpf_map_update from XDP datapath uses spinlock which kills perf.
- We have updated version of firmware.
Vinay:
- Get rid of master branch, merge dev-next latest release to main branch.
- 20.0.0.0 -> 10.0.0.0 .
- coredns crashing - scaled endpoint regression.
- Documentation cleanup
- wiki for "Dev tips and tricks"
- Simple b/w accounting and adjustment for QoS project.
Phu:
- Investigating Github actions issue with Azure.
- Agent tail-call offload to Netronome NIC.
- Does not work. Cannot add offloaded XDP program ID to jump table.
- Edge team and Alcor team questions.
- Create design doc for Qian's requirements.
- Qian's mizar as a service.
- Phu out July 5th - 17th.
Cathy:
- Arktos<->Mizar integration, revert back to working version.
- Issue https://github.com/CentaurusInfra/mizar/pull/497
- Try using gdb to debug.
- Issue https://github.com/CentaurusInfra/mizar/pull/487
- No progress yet.
Hong:
- Investigating Dr. Xiong's suggestion on AWS advanced VPC idea.
Wei:
- Following up with Netronome on special firmware - ticket filed.
Vinay:
- Get rid of master branch, merge dev-next latest release to main branch.
- 20.0.0.0 -> 10.0.0.0 .
- coredns crashing - scaled endpoint regression.
- Documentation cleanup and wiki for "Dev tips and tricks"
- Simple b/w accounting and adjustment for QoS project.
Phu:
- Agent tail-call offload to Netronome NIC.
- Edge team and Alcor team questions.
- Create design doc for Qian's requirements.
- Qian's mizar as a service.
- Phu out July 5th - 17th.
Cathy:
- Arktos<->Mizar integration, revert back to working version.
- Issue https://github.com/CentaurusInfra/mizar/pull/497
- Try using gdb to debug.
- Issue https://github.com/CentaurusInfra/mizar/pull/487
- No progress yet.
Hong:
- Investigating Dr. Xiong's suggestion on AWS advanced VPC idea.