forked from gardener/landscaper
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
disable periodic item reconsile (run-int-tests) (gardener#951)
- Loading branch information
1 parent
66a1238
commit 7fbb243
Showing
6 changed files
with
117 additions
and
100 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,144 +1,146 @@ | ||
# Performance Analysis | ||
|
||
This document describes the current state of the performance analysis of Landscaper used in the context | ||
This document describes the last state of the performance analysis of Landscaper used in the context | ||
of the Landscaper as a Service ([LaaS](https://github.com/gardener/landscaper-service)) with | ||
[Gardener Clusters](https://github.com/gardener/gardener). | ||
|
||
# Test 1 | ||
## Initial Situation | ||
|
||
## Test Setup | ||
Tests with Landscaper version v.0.90.0. | ||
|
||
Usage of one Landscaper instance of the Dev-Landscape with the test data from | ||
[here](https://github.com/gardener/landscaper-examples/tree/master/scaling/many-deployitems/installation3) consisting of: | ||
Installations were created in one namespace in steps of 200. | ||
|
||
- 6 root installations | ||
- 50 sub installations for every root installation | ||
- One deploy item for every sub installation | ||
- Every deploy item deploys a helm chart with a config map with about 1,3kB input data | ||
The following shows the duration for a packet of 200 Installations to be finished: | ||
|
||
## Test Results | ||
- First 200: 183 s | ||
- Next 200: 326 s | ||
- Next 200: 501 s | ||
- Next 200: 598 s | ||
- Next 200: 771 s | ||
- Next 200: 976 s | ||
- Next 200: 1170 s (1 Installation failed) | ||
- Next 200: 1242 s (6 Installations failed) | ||
|
||
This chapter shows the duration for deploying the 6 root installations for different versions of the Landscaper and | ||
our current interpretation of the results. | ||
After the creation of these 1600 Installations in one namespace another packet of 200 Installations were created in | ||
another namespace. The duration for this was 185s. | ||
|
||
### Current Landscaper Version | ||
Conclusion: If the number of Installations in one namespace increases, also the duration for their executions increases | ||
heavily. If there are already 1000 Installations in a namespace, the execution of further 200 Installations requires about | ||
20 minutes. The reason for this is the huge amount of list operations with label selectors, the Landscaper executes against | ||
the API server of the resource cluster. | ||
|
||
Tests with the current official Landscaper release with LaaS v0.71.0 | ||
## Improvements | ||
|
||
- **Duration: 25:00 (minutes/seconds)** | ||
The following improvements where implemented to reduce the number of list operations with label selections: | ||
|
||
Investigations showed that main reason for the bad performance in the first tests was due to request rate limits of the | ||
kubernetes clients. You can find the following entries in the logs indicating this: | ||
- DeployItems cache in the status of Executions: [PR](https://github.com/gardener/landscaper/pull/935) | ||
- Used to directly access the DeployItems instead of fetching them via list oerations | ||
- Subinstallation cache in the status of Installations: [PR](https://github.com/gardener/landscaper/pull/936) | ||
- Used to directly access the Subinstallations instead of fetching them via list oerations | ||
- Sibling import/export hints: [PR](https://github.com/gardener/landscaper/pull/937) | ||
- Prevent list operations to compute predecessor and successor installations if no data is exchanged | ||
|
||
``` | ||
Waited for 8.78812882s due to client-side throttling, not priority and fairness, request: GET:https://api.je09c359.laasds.shoot.live.k8s-hana.ondemand.com/apis/landscaper.gardener.cloud/v1alpha1 | ||
``` | ||
The improvements were tested with the following test setup: | ||
|
||
- One Lansdscaper instance with 10 namespaces. | ||
- In every namespace about 1000 Installations with 1000 Executions and 1000 DeployItems. The DeployItems just | ||
install a configmap. There are no sibling exports or imports and these flags are set on true in the Installations. | ||
- One helm deployer pod with 120 worker threads. | ||
- Ome main controller pod with 60 worker threads for Installations and 60 worker threads for Executions. | ||
|
||
The tests were executed with an old Landscaper version v.0.90.0 and a Landscaper with the improvements described above. | ||
|
||
Test results: | ||
|
||
### Landscaper with improved client request rate limits | ||
- Creation of 1000 Installations/1000 Executions/1000 Deploy Items in one namespace | ||
- Duration before optimisation: 3046s | ||
- Duration after optimisation: 1050s | ||
|
||
Tests with a Landscaper with a client having very high request rate limits (burst rate and queries per second = 10000). | ||
- Update of 1000 Installations/1000 Executions/1000 Deploy Items in one namespace | ||
- Duration before optimisation: 3601s | ||
- Duration after optimisation: 1166s | ||
|
||
- 30 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-11d2919a8e2bce4a02c3928f7a49fe183d35f63d) | ||
- **Duration: 4.16** | ||
|
||
The creation and update time for 1000/1000/1000 objects remain stable until 20.0000 Installations with 20.0000 Executions | ||
and 20.0000 DeployItems were created in 20 different namespaces. No tests were executed with more objects so far. | ||
|
||
- 60 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-8db791bf996047f1b849207472ff9d97bac80481) | ||
- **Duration: 4:10** | ||
## Comparison with cached client | ||
|
||
The optimized version was compared with a version using a cached k8s client. The test setup was similar to the chapter | ||
before. | ||
|
||
- 120 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-66eb650b1156d7eaced0b3e63def4a8dc0f6cbff) | ||
- **Duration: 5:02** | ||
- Creation of 500 Installations/500 Executions/500 Deploy Items in another namespace | ||
- Duration with optimisation: 400s | ||
- Duration with cached client: 228s | ||
|
||
- Update of 500 Installations/500 Executions/500 Deploy Items in one namespace | ||
- Duration with optimisation: 389s | ||
- Duration with cached client: 217s | ||
|
||
- 310 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-eb5bb0f8424f25a6ae2871e0bc9f1c50d35228f8) | ||
- **Duration: 4:41** | ||
The memory consumption of the version with the cached client was about ten time more than for the optimized version: | ||
|
||
|
||
The tests show: | ||
- The performance is much better compared to the k8s client with rate limiting. | ||
- The number of parallel worker threads should not be increased too far. | ||
|
||
### Landscaper with improved client request rate limits and parallelisation | ||
**Memory consumption of the optimised version:** | ||
|
||
Tests with a Landscaper with a client having very high request rate limits (burst rate and queries per second (qps) = 10000) | ||
and multiple replicas for the pods running the controller for installations, executions and helm deploy items. | ||
``` | ||
NAME CPU(cores) MEMORY(bytes) | ||
container-test0001-2f9e5e91-container-deployer-5f646cff6-5vqjd 2m 98Mi | ||
helm-test0001-2f9e5e91-helm-deployer-d8b7744b6-wxslx 312m 318Mi | ||
landscaper-test0001-2f9e5e91-7f844f9f7c-9mfsq 9m 157Mi | ||
landscaper-test0001-2f9e5e91-main-545ccccc6d-75qpl 164m 343Mi | ||
manifest-test0001-2f9e5e91-manifest-deployer-7c555589bd-wdwf8 2m 79Mi | ||
``` | ||
|
||
LaaS version: v0.72.0-dev-7f456ae4edb6a86847bb210e25ef9c3f26ed6ada | ||
**Memory consumption of version with cached client:** | ||
|
||
- 1 pods für inst, exec, di controller: **Duration: 4:16** | ||
``` | ||
NAME CPU(cores) MEMORY(bytes) | ||
container-test0001-2f9e5e91-container-deployer-697d7b6449-6b6vf 15m 240Mi | ||
helm-test0001-2f9e5e91-helm-deployer-6ff7686c6f-zl5rc 1664m 4445Mi | ||
landscaper-test0001-2f9e5e91-7776698fb-lx56p 32m 627Mi | ||
landscaper-test0001-2f9e5e91-main-6bcbd8788c-j25mh 508m 2268Mi | ||
manifest-test0001-2f9e5e91-manifest-deployer-6d546d9c6c-46dz2 20m 845Mi | ||
``` | ||
|
||
- 2 pods für inst, exec, di controller: **Duration: 2:24** | ||
## Duration for small numbers without sibling hints | ||
|
||
- 3 pods für inst, exec, di controller: **Duration: 1:21** | ||
The following shows the duration to create or update only a few number of Installations/Executions/DeployItems in a new | ||
and empty namespace whereby the sibling hints of optimisation three are not used. The cluster already contains about | ||
20.0000 Installations with 20.0000 Executions and 20.0000 DeployItems in 20 namespaces. | ||
|
||
- 4 pods für inst, exec, di controller: **Duration: 1:30** | ||
100/100/100: create: 173s - update: 159s - delete: 63s | ||
200/200/200: create: 323s - update: 272s - delete: 115s | ||
300/300/300: create: 413s - update: 401s - delete: 175s | ||
400/400/400: create: 543s - update: 542s - delete: 285s | ||
500/500/500: create: 678s - update: 659s - delete: 394s | ||
|
||
- 5 pods für inst, exec, di controller: | ||
- error with message: 'Op: CreateImportsAndSubobjects - Reason: ReconcileExecution - Message: | ||
Op: errorWithWriteID - Reason: write - Message: write operation w000022 failed | ||
with Get "https://[::1]:443/api/v1/namespaces/cu-test/resourcequotas": dial | ||
tcp [::1]:443: connect: connection refused' | ||
Here the corresponding numbers if the sibling hints are activated: | ||
|
||
The tests show: | ||
100/100/100: create: 109s - update: 106s - delete: 59s | ||
200/200/200: create: 183s - update: 176s - delete: 120s | ||
300/300/300: create: 263s - update: 242s - delete: 204s | ||
400/400/400: create: 356s - update: 329s - delete: 365s | ||
500/500/500: create: 429s - update: 436s - delete: 532s | ||
|
||
- Activating the parallelization results in a similar performance for the one pod scenario, though there are more | ||
requests to the API server for synchronization. | ||
- Further increasing the number of pods results in a better performance. | ||
- Going beyond some number of pods, the API server becomes overloaded and the deployment fails. | ||
|
||
### Landscaper with restricted Burst and QPS rates | ||
## Improve startup behaviour | ||
|
||
These tests were executed with restricted burst and qps rates and no parallelization. | ||
With more k8s objects in a resource cluster, the startup times for the Landscaper become much slower because all watched | ||
objects are presented first to the controller. When restarting the Landscaper watching a resource cluster with about | ||
20.0000 Installations with 20.0000 Executions and 20.0000 DeployItems in 20 namespaces, it requires about 10 minutes | ||
until Landscaper starts processing newly created Installations. | ||
|
||
LaaS version: v0.72.0-dev-baa5654c9e727a70e568e24407277181c0aef1b3 | ||
After introducing a startup cache ([see](https://github.com/gardener/landscaper/pull/948)) Landscaper requires only 30 s | ||
until the processing the newly created Installations starts. | ||
|
||
- burst=30, qps=20: **Duration: 6:48** | ||
- burst=60, qps=40: **Duration: 4:25** (default settings) | ||
- burst=80, qps=60: **Duration: 4:20** | ||
Beside the startup problem, also the periodic reconciliation of all watched items of a controller every 10 hours, prevents | ||
the execution of modified items for several minutes. Therefore, the frequency of this operation was reduced to 1000 | ||
days, such that this should not happen anymore, because the pods are usually restarted before at least during the regular | ||
updates. | ||
|
||
The results sho that the default settings give quite good results. | ||
|
||
For settings other than the default, the configuration of the root installation of a landscaper instance in a LaaS | ||
landscape has to be adapted as follows: | ||
|
||
```yaml | ||
landscaperConfig: | ||
k8sClientSettings: # changed | ||
resourceClient: # changed | ||
burst: <newValue> # changed | ||
qps: <newValue> # changed | ||
deployers: | ||
- helm | ||
- manifest | ||
- container | ||
deployersConfig: # changed | ||
helm: # changed | ||
deployer: # changed | ||
k8sClientSettings: # changed | ||
resourceClient: # changed | ||
burst: <newValue> # changed | ||
qps: <newValue> # changed | ||
manifest: # changed | ||
deployer: # changed | ||
k8sClientSettings: # changed | ||
resourceClient: # changed | ||
burst: <newValue> # changed | ||
qps: <newValue> # changed | ||
``` | ||
|
||
## Conclusions | ||
|
||
The communication with the API server of the resource cluster has a big influence on the Landscaper performance. | ||
Increasing the request restrictions of the k8s client used by the Landscaper results in a speed-up of about 6. | ||
Parallelization could further improve the performance by a factor of 3. | ||
|
||
Unfortunately, if the number of requests to the API server becomes too high, the API server might become unresponsive | ||
resulting in deployment errors. Due to the large amount of different usage scenarios, it is currently hard to judge | ||
which setup is optimal with respect to performance and stability. | ||
|
||
For now we decide to release the Landscaper with no parallelization and the default restricted burst and qps rates | ||
(60/40). If there will be problems with an overloaded API server, the values could be reduced accordingly. | ||
|
||
So far the tests were quite restricted and other usage pattern might also show different bottlenecks like huge memory | ||
consumption etc. Therefore, we need to investigate this on our productive landscapes for the different customer scenarios. | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters