-
Notifications
You must be signed in to change notification settings - Fork 0
730 test
components
Test clients to simulate schedulers ( only the registration, list, watch part + cache of the schedulers,
each scheduler will list 25K nodes and watch -- fixed dimension for now -- we are not testing scheduler logic for now
|
|
Resource management service ( 730 )
|
|
Test data service to simulate the regional resource managers, each simulate one region with 10 RP clusters
simulated data change patterns there will be two data change patterns to simulate day to day node failures in a region and occasionally cluster down ( 25k nodes )5 DailyRoutine: 10 node failures per region, with 5 simulators, it is 50 node failures per minutes, and total of 1500 node failures for 30 minutes.
RpDown: simulate at a given time, one RP is down, all nodes in that RP were marked a failed nodes.
basic test criteria
Resource management service sustains with 1m nodes simulated, 30 minutes, i.e. register, list and watch for 30 minutes
(not a must have bar): node changes from simulator to scheduler within 1 seconds.
test env setup
Single Resource management service instance on one server machine
( start with gce n1-32 VM with 32 core and 120GB Ram, 500GB SSD, premium network )
( client sides and the resource simulator sides can start with ni-16 core VMs)
N test data service machines with one machine to simulate 1 region, so for 10 region test, there will be 10 machines.
M test client machines with one machine to simulate 10 scheduler, so for 100 scheduler test, there will be 10 machines.
data config
Basic cases:
Test | Total Nodes | Regions | Nodes per Region | RPs per Region | Nodes per RP | Schedulers | Nodes per scheduler list |
---|---|---|---|---|---|---|---|
baseline | 500K | 2 | 250K | 10 | 25K | 1 | 25K |
baseline | 500K | 2 | 250K | 10 | 25K | 10 | 25K |
baseline | 500K | 2 | 250K | 10 | 25K | 10 | 5K |
Test | 500K | 2 | 250K | 10 | 25K | 20 | 25K |
goals:
Test | Total Nodes | Regions | Nodes per Region | RPs per Region | Nodes per RP | Schedulers | Nodes per scheduler list |
---|---|---|---|---|---|---|---|
field goal | 1m | 5 | 200K | 10 | 20K | 10-20 | 25K - 50K |
touch down | 2m | 10 | 200K | 20 | 10K | 200 | 10K |
730 test client Operations
Each scheduler will do register, list and watch for its nodes for designated time. todo: add the service APIs here:
Test automation GCE env test-setup.sh test-teardown.sh
Test data*
Test | Total Nodes | Regions | Nodes per Region | RPs per Region | Nodes per RP | Schedulers | Nodes per scheduler list |
---|---|---|---|---|---|---|---|
field goal | 1m | 5 | 200K | 10 | 20K | 40 | 25K |
Current finding is:
- with logging level 4, all components in the same region, prolonged watches, less than 1%. i.e. 99% of watch session is less than 1 second.
- With logging level 9, all components in the same region, prolonged watches, 26-31%
- With logging level 4, cross regions, prolonged watches, 14. – 64 % depending on how regions are spreaded.
Issues to be investigated
- env setup to support multiple regions for simulators and clients
- reduce perf impact for debugging logging -- re-adjust some L9 loggings
- investigate cross region perf impact root causes and fixes.
options and alternatives:
- change schedulers to 20 for the field goal test case
- use large server machine ( say 64 vcpus) and see perf differences
730 sign off tests
Test | Total Nodes | Regions | Nodes per Region | RPs per Region | Nodes per RP | Schedulers | Nodes per scheduler list | Notes | Register Latency (ms) |
List Latency (ms) |
Watch P50(ms) |
P90(ms) | P99(ms) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
test-1 | 1m | 5 | 200K | 10 | 20K | 20 | 25K | disable metric, daily data change pattern | 301 | 871 | 108 | 175 | 211 |
test-1.1 | 1m | 5 | 200K | 10 | 20K | 20 | 25K | disable metric, RP down data change pattern | 374 | 1012 | 1021 | 1137 | 1156 |
test-2 | 1m | 5 | 200K | 10 | 20K | 20 | 25K | enable metric, daily data change pattern | 298 | 1097 | 116 | 181 | 201 |
test-2.1 | 1m | 5 | 200K | 10 | 20K | 20 | 25K | enable metric,RP down data change pattern | 359 | 1012 | 1002 | 1074 | 1093 |
test-3 | 1m | 5 | 200K | 10 | 20K | 20 | 50K | disable metric, daily data change pattern | 369 | 1766 | 109 | 173 | 217 |
test-3.1 | 1m | 5 | 200K | 10 | 20K | 20 | 50K | disable metric, RP down data change pattern | 337 | 1679 | 877 | 1174 | 1200 |
test-4 | 1m | 5 | 200K | 10 | 20K | 40 | 25K | disable metric, daily data change pattern | 135 | 811 | 92 | 161 | 195 |
test-5 | 1m | 5 | 200K | 10 | 20K | 20 | 25K | disable metric, daily data change pattern, all in one region | |||||
test-5.1 | 1m | 5 | 200K | 10 | 20K | 40 | 25K | disable metric, daily data change pattern, all in one region | |||||
test-5.2 | 1m | 5 | 200K | 10 | 20K | 20 | 25K | disable metric, rp down, all in one region |
*** Regions each components deployed
Region | Location |
---|---|
us Central-1a | Council bluffs, IOWA |
Region | Location |
---|---|
us Central-1a | Council bluffs, IOWA |
us east1-b | Moncks COrner, SC |
us west2-a | LA, CA |
us west3-c | Salt Lake city, Utah |
us west4-a | las Vegas, Nevada |
Region | Location |
---|---|
us west3-b | Salt Lake city, Utah |
us east4-b | Ashburn, Virginia |