730 test

components

Test clients to simulate schedulers ( only the registration, list, watch part  + cache of the schedulers, 
each scheduler will list 25K nodes and watch -- fixed dimension for now -- we are not testing scheduler logic for now
     |
     |
Resource management service ( 730 )
     |
     |
Test data service to simulate the regional resource managers, each simulate one region with 10 RP clusters

simulated data change patterns there will be two data change patterns to simulate day to day node failures in a region and occasionally cluster down ( 25k nodes )5 DailyRoutine: 10 node failures per region, with 5 simulators, it is 50 node failures per minutes, and total of 1500 node failures for 30 minutes.

RpDown: simulate at a given time, one RP is down, all nodes in that RP were marked a failed nodes.

basic test criteria

   Resource management service sustains with 1m nodes simulated, 30 minutes, i.e. register, list and watch for 30 minutes
   (not a must have bar): node changes from simulator to scheduler within 1 seconds.

test env setup

Single Resource management service instance on one server machine 
( start with gce n1-32 VM with 32 core and 120GB Ram, 500GB SSD, premium  network )

( client sides and the resource simulator sides can start with ni-16 core VMs)

N test data service machines with one machine to simulate 1 region, so for 10 region test, there will be 10 machines.
M test client machines with one machine to simulate 10 scheduler, so for 100 scheduler test, there will be 10 machines.

data config

Basic cases:

Test	Total Nodes	Regions	Nodes per Region	RPs per Region	Nodes per RP	Schedulers	Nodes per scheduler list
baseline	500K	2	250K	10	25K	1	25K
baseline	500K	2	250K	10	25K	10	25K
baseline	500K	2	250K	10	25K	10	5K
Test	500K	2	250K	10	25K	20	25K

goals:

Test	Total Nodes	Regions	Nodes per Region	RPs per Region	Nodes per RP	Schedulers	Nodes per scheduler list
field goal	1m	5	200K	10	20K	10-20	25K - 50K
touch down	2m	10	200K	20	10K	200	10K

730 test client Operations

Each scheduler will do register, list and watch for its nodes for designated time. todo: add the service APIs here:

Test automation GCE env test-setup.sh test-teardown.sh

Test data*

Test	Total Nodes	Regions	Nodes per Region	RPs per Region	Nodes per RP	Schedulers	Nodes per scheduler list
field goal	1m	5	200K	10	20K	40	25K

Current finding is:

with logging level 4, all components in the same region, prolonged watches, less than 1%. i.e. 99% of watch session is less than 1 second.
With logging level 9, all components in the same region, prolonged watches, 26-31%
With logging level 4, cross regions, prolonged watches, 14. – 64 % depending on how regions are spreaded.

Issues to be investigated

env setup to support multiple regions for simulators and clients
reduce perf impact for debugging logging -- re-adjust some L9 loggings
investigate cross region perf impact root causes and fixes.

options and alternatives:

change schedulers to 20 for the field goal test case
use large server machine ( say 64 vcpus) and see perf differences

730 sign off tests

Test	Total Nodes	Regions	Nodes per Region	RPs per Region	Nodes per RP	Schedulers	Nodes per scheduler list	Notes	Register Latency (ms)	List Latency (ms)	Watch P50(ms)	P90(ms)	P99(ms)
test-1	1m	5	200K	10	20K	20	25K	disable metric, daily data change pattern	301	871	108	175	211
test-1.1	1m	5	200K	10	20K	20	25K	disable metric, RP down data change pattern	374	1012	1021	1137	1156
test-2	1m	5	200K	10	20K	20	25K	enable metric, daily data change pattern	298	1097	116	181	201
test-2.1	1m	5	200K	10	20K	20	25K	enable metric,RP down data change pattern	359	1012	1002	1074	1093
test-3	1m	5	200K	10	20K	20	50K	disable metric, daily data change pattern	369	1766	109	173	217
test-3.1	1m	5	200K	10	20K	20	50K	disable metric, RP down data change pattern	337	1679	877	1174	1200
test-4	1m	5	200K	10	20K	40	25K	disable metric, daily data change pattern	135	811	92	161	195
test-5	1m	5	200K	10	20K	20	25K	disable metric, daily data change pattern, all in one region
test-5.1	1m	5	200K	10	20K	40	25K	disable metric, daily data change pattern, all in one region
test-5.2	1m	5	200K	10	20K	20	25K	disable metric, rp down, all in one region

*** Regions each components deployed

Service:

Region	Location
us Central-1a	Council bluffs, IOWA

Simulators:

Region	Location
us Central-1a	Council bluffs, IOWA
us east1-b	Moncks COrner, SC
us west2-a	LA, CA
us west3-c	Salt Lake city, Utah
us west4-a	las Vegas, Nevada

Schedulers:

Region	Location
us west3-b	Salt Lake city, Utah
us east4-b	Ashburn, Virginia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

730 test

Service:

Simulators:

Schedulers:

Clone this wiki locally