730 test

components

Test clients to simulate schedulers ( only the registration, list, watch part  + cache of the schedulers, 
each scheduler will list 25K nodes and watch -- fixed dimension for now -- we are not testing scheduler logic for now
     |
     |
Resource management service ( 730 )
     |
     |
Test data service to simulate the regional resource managers, each simulate one region with 10 RP clusters

basic test criteria

   Resource management service sustains with 500K nodes simulated, 10-30 minutes, 
   N node change ( update and add , at ~2 min time, 5k changes, ~5 min mark, 25k changes,  ~7 minutes: 1k changes. 
   -- repeat.

   basic base line cases with 10 minutes duration
   basic test cases with 30 minutes duration

   _(not a must have bar): node changes from simulator to scheduler within 5 seconds._

test env setup

Single Resource management service instance on one server machine 
( start with gce n1-32 VM with 32 core and 120GB Ram, 500GB SSD, premium  network )

( client sides and the resource simulator sides can start with ni-16 core VMs)

N test data service machines with one machine to simulate 1 region, so for 10 region test, there will be 10 machines.
M test client machines with one machine to simulate 10 scheduler, so for 100 scheduler test, there will be 10 machines.

data config

Basic cases:

Test	Total Nodes	Regions	Nodes per Region	RPs per Region	Nodes per RP	Schedulers	Nodes per scheduler list
baseline	500K	2	250K	10	25K	1	25K
baseline	500K	2	250K	10	25K	10	25K
baseline	500K	2	250K	10	25K	10	5K
Test	500K	2	250K	10	25K	20	25K

goals:

Test	Total Nodes	Regions	Nodes per Region	RPs per Region	Nodes per RP	Schedulers	Nodes per scheduler list
field goal	1m	5	200K	10	20K	40	25K
touch down	2m	10	200K	20	10K	200	10K

730 test client Operations

Each scheduler will do register, list and watch for its nodes for designated time. todo: add the service APIs here:

Test automation

One option:

containerize the region manager and scheduler simulator tools
setup a Admin k8s cluster with up to 20 nodes, label 10 for region managers and 10 for schedulers
simulators run as containers, deployed with node labels

Test data*

Test	Total Nodes	Regions	Nodes per Region	RPs per Region	Nodes per RP	Schedulers	Nodes per scheduler list
field goal	1m	5	200K	10	20K	40	25K

Current finding is:

with logging level 4, all components in the same region, prolonged watches, less than 1%. i.e. 99% of watch session is less than 1 second.
With logging level 9, all components in the same region, prolonged watches, 26-31%
With logging level 4, cross regions, prolonged watches, 14. – 64 % depending on how regions are spreaded.

Issues to be investigated

env setup to support multiple regions for simulators and clients
reduce perf impact for debugging logging -- re-adjust some L9 loggings
investigate cross region perf impact root causes and fixes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

730 test

Clone this wiki locally