forked from CentaurusInfra/arktos
-
Notifications
You must be signed in to change notification settings - Fork 0
730 test
Yunwen Bai edited this page Jul 18, 2022
·
10 revisions
components
Test clients to simulate schedulers ( only the registration, list, watch part + cache of the schedulers,
each scheduler will list 25K nodes and watch -- fixed dimension for now -- we are not testing scheduler logic for now
|
|
Resource management service ( 730 )
|
|
Test data service to simulate the regional resource managers, each simulate one region with 10 RP clusters
basic test criteria
Resource management service sustains with 500K nodes simulated, 10-30 minutes,
N node change ( update and add , at ~2 min time, 5k changes, ~5 min mark, 25k changes, ~7 minutes: 1k changes.
-- repeat.
basic base line cases with 10 minutes duration
basic test cases with 30 minutes duration
_(not a must have bar): node changes from simulator to scheduler within 5 seconds._
test env setup
Single Resource management service instance on one server machine
( start with gce n1-32 VM with 32 core and 120GB Ram, 500GB SSD, premium network )
( client sides and the resource simulator sides can start with ni-16 core VMs)
N test data service machines with one machine to simulate 1 region, so for 10 region test, there will be 10 machines.
M test client machines with one machine to simulate 10 scheduler, so for 100 scheduler test, there will be 10 machines.
data config
Basic cases:
Test | Total Nodes | Regions | Nodes per Region | RPs per Region | Nodes per RP | Schedulers | Nodes per scheduler list |
---|---|---|---|---|---|---|---|
baseline | 500K | 2 | 250K | 10 | 25K | 1 | 25K |
baseline | 500K | 2 | 250K | 10 | 25K | 10 | 25K |
baseline | 500K | 2 | 250K | 10 | 25K | 10 | 5K |
Test | 500K | 2 | 250K | 10 | 25K | 20 | 25K |
goals:
Test | Total Nodes | Regions | Nodes per Region | RPs per Region | Nodes per RP | Schedulers | Nodes per scheduler list |
---|---|---|---|---|---|---|---|
field goal | 1m | 5 | 200K | 10 | 20K | 40 | 25K |
touch down | 2m | 10 | 200K | 20 | 10K | 200 | 10K |
730 test client Operations
Each scheduler will do register, list and watch for its nodes for designated time. todo: add the service APIs here:
Test automation
One option:
- containerize the region manager and scheduler simulator tools
- setup a Admin k8s cluster with up to 20 nodes, label 10 for region managers and 10 for schedulers
- simulators run as containers, deployed with node labels
Test data*
Test | Total Nodes | Regions | Nodes per Region | RPs per Region | Nodes per RP | Schedulers | Nodes per scheduler list |
---|---|---|---|---|---|---|---|
field goal | 1m | 5 | 200K | 10 | 20K | 40 | 25K |
Current finding is:
- with logging level 4, all components in the same region, prolonged watches, less than 1%. i.e. 99% of watch session is less than 1 second.
- With logging level 9, all components in the same region, prolonged watches, 26-31%
- With logging level 4, cross regions, prolonged watches, 14. – 64 % depending on how regions are spreaded.
Issues to be investigated
- env setup to support multiple regions for simulators and clients
- reduce perf impact for debugging logging -- re-adjust some L9 loggings
- investigate cross region perf impact root causes and fixes.