Skip to content

730 test

Yunwen Bai edited this page Jul 18, 2022 · 10 revisions

components

Test clients to simulate schedulers ( only the registration, list, watch part  + cache of the schedulers, 
each scheduler will list 25K nodes and watch -- fixed dimension for now -- we are not testing scheduler logic for now
     |
     |
Resource management service ( 730 )
     |
     |
Test data service to simulate the regional resource managers, each simulate one region with 10 RP clusters       

basic test criteria

   Resource management service sustains with 500K nodes simulated, 10-30 minutes, 
   N node change ( update and add , at ~2 min time, 5k changes, ~5 min mark, 25k changes,  ~7 minutes: 1k changes. 
   -- repeat.

   basic base line cases with 10 minutes duration
   basic test cases with 30 minutes duration

   _(not a must have bar): node changes from simulator to scheduler within 5 seconds._

test env setup

Single Resource management service instance on one server machine 
( start with gce n1-32 VM with 32 core and 120GB Ram, 500GB SSD, premium  network )

( client sides and the resource simulator sides can start with ni-16 core VMs)

N test data service machines with one machine to simulate 1 region, so for 10 region test, there will be 10 machines.
M test client machines with one machine to simulate 10 scheduler, so for 100 scheduler test, there will be 10 machines.

data config

Basic cases:

Test Total Nodes Regions Nodes per Region RPs per Region Nodes per RP Schedulers Nodes per scheduler list
baseline 500K 2 250K 10 25K 1 25K
baseline 500K 2 250K 10 25K 10 25K
baseline 500K 2 250K 10 25K 10 5K
Test 500K 2 250K 10 25K 20 25K

goals:

Test Total Nodes Regions Nodes per Region RPs per Region Nodes per RP Schedulers Nodes per scheduler list
field goal 1m 5 200K 10 20K 40 25K
touch down 2m 10 200K 20 10K 200 10K

730 test client Operations

Each scheduler will do register, list and watch for its nodes for designated time. todo: add the service APIs here:

Test automation

One option:

  1. containerize the region manager and scheduler simulator tools
  2. setup a Admin k8s cluster with up to 20 nodes, label 10 for region managers and 10 for schedulers
  3. simulators run as containers, deployed with node labels

Test data*

Test Total Nodes Regions Nodes per Region RPs per Region Nodes per RP Schedulers Nodes per scheduler list
field goal 1m 5 200K 10 20K 40 25K

Current finding is:

  • with logging level 4, all components in the same region, prolonged watches, less than 1%. i.e. 99% of watch session is less than 1 second.
  • With logging level 9, all components in the same region, prolonged watches, 26-31%
  • With logging level 4, cross regions, prolonged watches, 14. – 64 % depending on how regions are spreaded.

Issues to be investigated

  • env setup to support multiple regions for simulators and clients
  • reduce perf impact for debugging logging -- re-adjust some L9 loggings
  • investigate cross region perf impact root causes and fixes.
Clone this wiki locally