Skip to content

730 test

Ying Huang edited this page Aug 1, 2022 · 10 revisions

components

Test clients to simulate schedulers ( only the registration, list, watch part  + cache of the schedulers, 
each scheduler will list 25K nodes and watch -- fixed dimension for now -- we are not testing scheduler logic for now
     |
     |
Resource management service ( 730 )
     |
     |
Test data service to simulate the regional resource managers, each simulate one region with 10 RP clusters       

simulated data change patterns there will be two data change patterns to simulate day to day node failures in a region and occasionally cluster down ( 25k nodes )5 DailyRoutine: 10 node failures per region, with 5 simulators, it is 50 node failures per minutes, and total of 1500 node failures for 30 minutes.

RpDown: simulate at a given time, one RP is down, all nodes in that RP were marked a failed nodes.

basic test criteria

   Resource management service sustains with 1m nodes simulated, 30 minutes, i.e. register, list and watch for 30 minutes
   (not a must have bar): node changes from simulator to scheduler within 1 seconds.

test env setup

Single Resource management service instance on one server machine 
( start with gce n1-32 VM with 32 core and 120GB Ram, 500GB SSD, premium  network )

( client sides and the resource simulator sides can start with ni-16 core VMs)

N test data service machines with one machine to simulate 1 region, so for 10 region test, there will be 10 machines.
M test client machines with one machine to simulate 10 scheduler, so for 100 scheduler test, there will be 10 machines.

data config

Basic cases:

Test Total Nodes Regions Nodes per Region RPs per Region Nodes per RP Schedulers Nodes per scheduler list
baseline 500K 2 250K 10 25K 1 25K
baseline 500K 2 250K 10 25K 10 25K
baseline 500K 2 250K 10 25K 10 5K
Test 500K 2 250K 10 25K 20 25K

goals:

Test Total Nodes Regions Nodes per Region RPs per Region Nodes per RP Schedulers Nodes per scheduler list
field goal 1m 5 200K 10 20K 10-20 25K - 50K
touch down 2m 10 200K 20 10K 200 10K

730 test client Operations

Each scheduler will do register, list and watch for its nodes for designated time. todo: add the service APIs here:

Test automation GCE env test-setup.sh test-teardown.sh

Test data*

Test Total Nodes Regions Nodes per Region RPs per Region Nodes per RP Schedulers Nodes per scheduler list
field goal 1m 5 200K 10 20K 40 25K

Current finding is:

  • with logging level 4, all components in the same region, prolonged watches, less than 1%. i.e. 99% of watch session is less than 1 second.
  • With logging level 9, all components in the same region, prolonged watches, 26-31%
  • With logging level 4, cross regions, prolonged watches, 14. – 64 % depending on how regions are spreaded.

Issues to be investigated

  • env setup to support multiple regions for simulators and clients
  • reduce perf impact for debugging logging -- re-adjust some L9 loggings
  • investigate cross region perf impact root causes and fixes.

options and alternatives:

  1. change schedulers to 20 for the field goal test case
  2. use large server machine ( say 64 vcpus) and see perf differences

730 sign off tests

Test Total Nodes Regions Nodes per Region RPs per Region Nodes per RP Schedulers Nodes per scheduler list Notes Register
Latency
(ms)
List
Latency
(ms)
Watch
P50(ms)
P90(ms) P99(ms)
test-1 1m 5 200K 10 20K 20 25K disable metric, daily data change pattern 301 871 108 175 211
test-1.1 1m 5 200K 10 20K 20 25K disable metric, RP down data change pattern 374 1012 1021 1137 1156
test-2 1m 5 200K 10 20K 20 25K enable metric, daily data change pattern 298 1097 116 181 201
test-2.1 1m 5 200K 10 20K 20 25K enable metric,RP down data change pattern 359 1012 1002 1074 1093
test-3 1m 5 200K 10 20K 20 50K disable metric, daily data change pattern 369 1766 109 173 217
test-3.1 1m 5 200K 10 20K 20 50K disable metric, RP down data change pattern 337 1679 877 1174 1200
test-4 1m 5 200K 10 20K 40 25K disable metric, daily data change pattern 135 811 92 161 195
test-5 1m 5 200K 10 20K 20 25K disable metric, daily data change pattern, all in one region
test-5.1 1m 5 200K 10 20K 40 25K disable metric, daily data change pattern, all in one region
test-5.2 1m 5 200K 10 20K 20 25K disable metric, rp down, all in one region

*** Regions each components deployed

Service:

Region Location
us Central-1a Council bluffs, IOWA

Simulators:

Region Location
us Central-1a Council bluffs, IOWA
us east1-b Moncks COrner, SC
us west2-a LA, CA
us west3-c Salt Lake city, Utah
us west4-a las Vegas, Nevada

Schedulers:

Region Location
us west3-b Salt Lake city, Utah
us east4-b Ashburn, Virginia
Clone this wiki locally