Skip to content

Commit

Permalink
disable periodic item reconsile (run-int-tests) (gardener#951)
Browse files Browse the repository at this point in the history
  • Loading branch information
achimweigel authored Jan 19, 2024
1 parent 66a1238 commit 7fbb243
Show file tree
Hide file tree
Showing 6 changed files with 117 additions and 100 deletions.
3 changes: 3 additions & 0 deletions cmd/landscaper-agent/app/options.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ import (
"errors"
goflag "flag"
"fmt"
"k8s.io/utils/pointer"
"os"
"time"

flag "github.com/spf13/pflag"
"k8s.io/apimachinery/pkg/runtime/serializer"
Expand Down Expand Up @@ -68,6 +70,7 @@ func (o *options) Complete() error {
LeaderElection: false,
Port: 9443,
MetricsBindAddress: "0",
SyncPeriod: pointer.Duration(time.Hour * 24 * 1000),
}

hostRestConfig, err := ctrl.GetConfig()
Expand Down
3 changes: 3 additions & 0 deletions cmd/landscaper-controller/app/app.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ package app
import (
"context"
"fmt"
"k8s.io/utils/pointer"
"os"
"time"

"github.com/mandelsoft/vfs/pkg/osfs"
"github.com/spf13/cobra"
Expand Down Expand Up @@ -85,6 +87,7 @@ func (o *Options) run(ctx context.Context) error {
LeaderElection: false,
Port: 9443,
MetricsBindAddress: "0",
SyncPeriod: pointer.Duration(time.Hour * 24 * 1000),
}

//TODO: investigate whether this is used with an uncached client
Expand Down
2 changes: 2 additions & 0 deletions cmd/target-sync-controller/app/app.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (
"context"
"fmt"
"os"
"time"

"github.com/spf13/cobra"
"k8s.io/client-go/tools/clientcmd"
Expand Down Expand Up @@ -56,6 +57,7 @@ func (o *options) run(ctx context.Context) error {
LeaderElection: false,
Port: 9443,
MetricsBindAddress: "0",
SyncPeriod: pointer.Duration(time.Hour * 24 * 1000),
}

data, err := os.ReadFile(o.landscaperKubeconfigPath)
Expand Down
198 changes: 100 additions & 98 deletions docs/technical/performance.md
Original file line number Diff line number Diff line change
@@ -1,144 +1,146 @@
# Performance Analysis

This document describes the current state of the performance analysis of Landscaper used in the context
This document describes the last state of the performance analysis of Landscaper used in the context
of the Landscaper as a Service ([LaaS](https://github.com/gardener/landscaper-service)) with
[Gardener Clusters](https://github.com/gardener/gardener).

# Test 1
## Initial Situation

## Test Setup
Tests with Landscaper version v.0.90.0.

Usage of one Landscaper instance of the Dev-Landscape with the test data from
[here](https://github.com/gardener/landscaper-examples/tree/master/scaling/many-deployitems/installation3) consisting of:
Installations were created in one namespace in steps of 200.

- 6 root installations
- 50 sub installations for every root installation
- One deploy item for every sub installation
- Every deploy item deploys a helm chart with a config map with about 1,3kB input data
The following shows the duration for a packet of 200 Installations to be finished:

## Test Results
- First 200: 183 s
- Next 200: 326 s
- Next 200: 501 s
- Next 200: 598 s
- Next 200: 771 s
- Next 200: 976 s
- Next 200: 1170 s (1 Installation failed)
- Next 200: 1242 s (6 Installations failed)

This chapter shows the duration for deploying the 6 root installations for different versions of the Landscaper and
our current interpretation of the results.
After the creation of these 1600 Installations in one namespace another packet of 200 Installations were created in
another namespace. The duration for this was 185s.

### Current Landscaper Version
Conclusion: If the number of Installations in one namespace increases, also the duration for their executions increases
heavily. If there are already 1000 Installations in a namespace, the execution of further 200 Installations requires about
20 minutes. The reason for this is the huge amount of list operations with label selectors, the Landscaper executes against
the API server of the resource cluster.

Tests with the current official Landscaper release with LaaS v0.71.0
## Improvements

- **Duration: 25:00 (minutes/seconds)**
The following improvements where implemented to reduce the number of list operations with label selections:

Investigations showed that main reason for the bad performance in the first tests was due to request rate limits of the
kubernetes clients. You can find the following entries in the logs indicating this:
- DeployItems cache in the status of Executions: [PR](https://github.com/gardener/landscaper/pull/935)
- Used to directly access the DeployItems instead of fetching them via list oerations
- Subinstallation cache in the status of Installations: [PR](https://github.com/gardener/landscaper/pull/936)
- Used to directly access the Subinstallations instead of fetching them via list oerations
- Sibling import/export hints: [PR](https://github.com/gardener/landscaper/pull/937)
- Prevent list operations to compute predecessor and successor installations if no data is exchanged

```
Waited for 8.78812882s due to client-side throttling, not priority and fairness, request: GET:https://api.je09c359.laasds.shoot.live.k8s-hana.ondemand.com/apis/landscaper.gardener.cloud/v1alpha1
```
The improvements were tested with the following test setup:

- One Lansdscaper instance with 10 namespaces.
- In every namespace about 1000 Installations with 1000 Executions and 1000 DeployItems. The DeployItems just
install a configmap. There are no sibling exports or imports and these flags are set on true in the Installations.
- One helm deployer pod with 120 worker threads.
- Ome main controller pod with 60 worker threads for Installations and 60 worker threads for Executions.

The tests were executed with an old Landscaper version v.0.90.0 and a Landscaper with the improvements described above.

Test results:

### Landscaper with improved client request rate limits
- Creation of 1000 Installations/1000 Executions/1000 Deploy Items in one namespace
- Duration before optimisation: 3046s
- Duration after optimisation: 1050s

Tests with a Landscaper with a client having very high request rate limits (burst rate and queries per second = 10000).
- Update of 1000 Installations/1000 Executions/1000 Deploy Items in one namespace
- Duration before optimisation: 3601s
- Duration after optimisation: 1166s

- 30 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-11d2919a8e2bce4a02c3928f7a49fe183d35f63d)
- **Duration: 4.16**

The creation and update time for 1000/1000/1000 objects remain stable until 20.0000 Installations with 20.0000 Executions
and 20.0000 DeployItems were created in 20 different namespaces. No tests were executed with more objects so far.

- 60 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-8db791bf996047f1b849207472ff9d97bac80481)
- **Duration: 4:10**
## Comparison with cached client

The optimized version was compared with a version using a cached k8s client. The test setup was similar to the chapter
before.

- 120 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-66eb650b1156d7eaced0b3e63def4a8dc0f6cbff)
- **Duration: 5:02**
- Creation of 500 Installations/500 Executions/500 Deploy Items in another namespace
- Duration with optimisation: 400s
- Duration with cached client: 228s

- Update of 500 Installations/500 Executions/500 Deploy Items in one namespace
- Duration with optimisation: 389s
- Duration with cached client: 217s

- 310 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-eb5bb0f8424f25a6ae2871e0bc9f1c50d35228f8)
- **Duration: 4:41**
The memory consumption of the version with the cached client was about ten time more than for the optimized version:


The tests show:
- The performance is much better compared to the k8s client with rate limiting.
- The number of parallel worker threads should not be increased too far.

### Landscaper with improved client request rate limits and parallelisation
**Memory consumption of the optimised version:**

Tests with a Landscaper with a client having very high request rate limits (burst rate and queries per second (qps) = 10000)
and multiple replicas for the pods running the controller for installations, executions and helm deploy items.
```
NAME CPU(cores) MEMORY(bytes)
container-test0001-2f9e5e91-container-deployer-5f646cff6-5vqjd 2m 98Mi
helm-test0001-2f9e5e91-helm-deployer-d8b7744b6-wxslx 312m 318Mi
landscaper-test0001-2f9e5e91-7f844f9f7c-9mfsq 9m 157Mi
landscaper-test0001-2f9e5e91-main-545ccccc6d-75qpl 164m 343Mi
manifest-test0001-2f9e5e91-manifest-deployer-7c555589bd-wdwf8 2m 79Mi
```

LaaS version: v0.72.0-dev-7f456ae4edb6a86847bb210e25ef9c3f26ed6ada
**Memory consumption of version with cached client:**

- 1 pods für inst, exec, di controller: **Duration: 4:16**
```
NAME CPU(cores) MEMORY(bytes)
container-test0001-2f9e5e91-container-deployer-697d7b6449-6b6vf 15m 240Mi
helm-test0001-2f9e5e91-helm-deployer-6ff7686c6f-zl5rc 1664m 4445Mi
landscaper-test0001-2f9e5e91-7776698fb-lx56p 32m 627Mi
landscaper-test0001-2f9e5e91-main-6bcbd8788c-j25mh 508m 2268Mi
manifest-test0001-2f9e5e91-manifest-deployer-6d546d9c6c-46dz2 20m 845Mi
```

- 2 pods für inst, exec, di controller: **Duration: 2:24**
## Duration for small numbers without sibling hints

- 3 pods für inst, exec, di controller: **Duration: 1:21**
The following shows the duration to create or update only a few number of Installations/Executions/DeployItems in a new
and empty namespace whereby the sibling hints of optimisation three are not used. The cluster already contains about
20.0000 Installations with 20.0000 Executions and 20.0000 DeployItems in 20 namespaces.

- 4 pods für inst, exec, di controller: **Duration: 1:30**
100/100/100: create: 173s - update: 159s - delete: 63s
200/200/200: create: 323s - update: 272s - delete: 115s
300/300/300: create: 413s - update: 401s - delete: 175s
400/400/400: create: 543s - update: 542s - delete: 285s
500/500/500: create: 678s - update: 659s - delete: 394s

- 5 pods für inst, exec, di controller:
- error with message: 'Op: CreateImportsAndSubobjects - Reason: ReconcileExecution - Message:
Op: errorWithWriteID - Reason: write - Message: write operation w000022 failed
with Get "https://[::1]:443/api/v1/namespaces/cu-test/resourcequotas": dial
tcp [::1]:443: connect: connection refused'
Here the corresponding numbers if the sibling hints are activated:

The tests show:
100/100/100: create: 109s - update: 106s - delete: 59s
200/200/200: create: 183s - update: 176s - delete: 120s
300/300/300: create: 263s - update: 242s - delete: 204s
400/400/400: create: 356s - update: 329s - delete: 365s
500/500/500: create: 429s - update: 436s - delete: 532s

- Activating the parallelization results in a similar performance for the one pod scenario, though there are more
requests to the API server for synchronization.
- Further increasing the number of pods results in a better performance.
- Going beyond some number of pods, the API server becomes overloaded and the deployment fails.

### Landscaper with restricted Burst and QPS rates
## Improve startup behaviour

These tests were executed with restricted burst and qps rates and no parallelization.
With more k8s objects in a resource cluster, the startup times for the Landscaper become much slower because all watched
objects are presented first to the controller. When restarting the Landscaper watching a resource cluster with about
20.0000 Installations with 20.0000 Executions and 20.0000 DeployItems in 20 namespaces, it requires about 10 minutes
until Landscaper starts processing newly created Installations.

LaaS version: v0.72.0-dev-baa5654c9e727a70e568e24407277181c0aef1b3
After introducing a startup cache ([see](https://github.com/gardener/landscaper/pull/948)) Landscaper requires only 30 s
until the processing the newly created Installations starts.

- burst=30, qps=20: **Duration: 6:48**
- burst=60, qps=40: **Duration: 4:25** (default settings)
- burst=80, qps=60: **Duration: 4:20**
Beside the startup problem, also the periodic reconciliation of all watched items of a controller every 10 hours, prevents
the execution of modified items for several minutes. Therefore, the frequency of this operation was reduced to 1000
days, such that this should not happen anymore, because the pods are usually restarted before at least during the regular
updates.

The results sho that the default settings give quite good results.

For settings other than the default, the configuration of the root installation of a landscaper instance in a LaaS
landscape has to be adapted as follows:

```yaml
landscaperConfig:
k8sClientSettings: # changed
resourceClient: # changed
burst: <newValue> # changed
qps: <newValue> # changed
deployers:
- helm
- manifest
- container
deployersConfig: # changed
helm: # changed
deployer: # changed
k8sClientSettings: # changed
resourceClient: # changed
burst: <newValue> # changed
qps: <newValue> # changed
manifest: # changed
deployer: # changed
k8sClientSettings: # changed
resourceClient: # changed
burst: <newValue> # changed
qps: <newValue> # changed
```

## Conclusions

The communication with the API server of the resource cluster has a big influence on the Landscaper performance.
Increasing the request restrictions of the k8s client used by the Landscaper results in a speed-up of about 6.
Parallelization could further improve the performance by a factor of 3.

Unfortunately, if the number of requests to the API server becomes too high, the API server might become unresponsive
resulting in deployment errors. Due to the large amount of different usage scenarios, it is currently hard to judge
which setup is optimal with respect to performance and stability.

For now we decide to release the Landscaper with no parallelization and the default restricted burst and qps rates
(60/40). If there will be problems with an overloaded API server, the values could be reduced accordingly.

So far the tests were quite restricted and other usage pattern might also show different bottlenecks like huge memory
consumption etc. Therefore, we need to investigate this on our productive landscapes for the different customer scenarios.
8 changes: 6 additions & 2 deletions docs/usage/Optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,16 @@
This chapter contains some hints to improve the performance of Landscaper instances.

- Do not create too many Installations, Executions, DeployItems, Targets etc. in one namespace watched by your
Landscaper instance. A reasonable upper bound is about 500 objects for every object type. If you have more
Landscaper instance. A reasonable upper bound is about 200 objects for every object type. If you have more
objects, spread them over more than one namespace.

- If you know that an installation does not import/exports data from/to sibling installations or has no
siblings at all, you could specify this in the `spec` of an installation as follows. If nothing set, the default
value `false` is assumed. This hint prevents the need for complex dependency computation and speads up processing.
value `false` is assumed. This hint prevents the need for complex dependency computation and speeds up processing.
Only use this feature, if you are sure about the data exchange of your Installations because if this is enabled and
siblings are exchanging data, this might produce erratic results. If you could enable this feature for all of your
installations in a namespace, a reasonable upper limit for the number of objects of this namespace is 500 for every
object type.

```yaml
apiVersion: landscaper.gardener.cloud/v1alpha1
Expand Down
3 changes: 3 additions & 0 deletions pkg/deployer/lib/cmd/default.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@ import (
"errors"
goflag "flag"
"fmt"
"k8s.io/utils/pointer"
"os"
"time"

flag "github.com/spf13/pflag"
"golang.org/x/sync/errgroup"
Expand Down Expand Up @@ -82,6 +84,7 @@ func (o *DefaultOptions) Complete() error {
opts := manager.Options{
LeaderElection: false,
MetricsBindAddress: "0", // disable the metrics serving by default
SyncPeriod: pointer.Duration(time.Hour * 24 * 1000),
}

hostRestConfig, err := ctrl.GetConfig()
Expand Down

0 comments on commit 7fbb243

Please sign in to comment.