disable periodic item reconsile (run-int-tests) (gardener#951)

enrico-kaack-comp · Jan 19, 2024 · 7fbb243 · 7fbb243
1 parent 66a1238
commit 7fbb243
Show file tree

Hide file tree

Showing 6 changed files with 117 additions and 100 deletions.
diff --git a/cmd/landscaper-agent/app/options.go b/cmd/landscaper-agent/app/options.go
@@ -8,7 +8,9 @@ import (
 	"errors"
 	goflag "flag"
 	"fmt"
+	"k8s.io/utils/pointer"
 	"os"
+	"time"
 
 	flag "github.com/spf13/pflag"
 	"k8s.io/apimachinery/pkg/runtime/serializer"
@@ -68,6 +70,7 @@ func (o *options) Complete() error {
 		LeaderElection:     false,
 		Port:               9443,
 		MetricsBindAddress: "0",
+		SyncPeriod:         pointer.Duration(time.Hour * 24 * 1000),
 	}
 
 	hostRestConfig, err := ctrl.GetConfig()

diff --git a/cmd/landscaper-controller/app/app.go b/cmd/landscaper-controller/app/app.go
@@ -7,7 +7,9 @@ package app
 import (
 	"context"
 	"fmt"
+	"k8s.io/utils/pointer"
 	"os"
+	"time"
 
 	"github.com/mandelsoft/vfs/pkg/osfs"
 	"github.com/spf13/cobra"
@@ -85,6 +87,7 @@ func (o *Options) run(ctx context.Context) error {
 		LeaderElection:     false,
 		Port:               9443,
 		MetricsBindAddress: "0",
+		SyncPeriod:         pointer.Duration(time.Hour * 24 * 1000),
 	}
 
 	//TODO: investigate whether this is used with an uncached client

diff --git a/cmd/target-sync-controller/app/app.go b/cmd/target-sync-controller/app/app.go
@@ -8,6 +8,7 @@ import (
 	"context"
 	"fmt"
 	"os"
+	"time"
 
 	"github.com/spf13/cobra"
 	"k8s.io/client-go/tools/clientcmd"
@@ -56,6 +57,7 @@ func (o *options) run(ctx context.Context) error {
 		LeaderElection:     false,
 		Port:               9443,
 		MetricsBindAddress: "0",
+		SyncPeriod:         pointer.Duration(time.Hour * 24 * 1000),
 	}
 
 	data, err := os.ReadFile(o.landscaperKubeconfigPath)

diff --git a/docs/technical/performance.md b/docs/technical/performance.md
@@ -1,144 +1,146 @@
 # Performance Analysis
 
-This document describes the current state of the performance analysis of Landscaper used in the context
+This document describes the last state of the performance analysis of Landscaper used in the context
 of the Landscaper as a Service ([LaaS](https://github.com/gardener/landscaper-service)) with
 [Gardener Clusters](https://github.com/gardener/gardener). 
 
-# Test 1
+## Initial Situation
 
-## Test Setup
+Tests with Landscaper version v.0.90.0.
 
-Usage of one Landscaper instance of the Dev-Landscape with the test data from 
-[here](https://github.com/gardener/landscaper-examples/tree/master/scaling/many-deployitems/installation3) consisting of:
+Installations were created in one namespace in steps of 200. 
 
-- 6 root installations
-- 50 sub installations for every root installation
-- One deploy item for every sub installation
-- Every deploy item deploys a helm chart with a config map with about 1,3kB input data 
+The following shows the duration for a packet of 200 Installations to be finished:
 
-## Test Results
+- First 200:   183 s
+- Next  200:   326 s
+- Next  200:   501 s
+- Next  200:   598 s
+- Next  200:   771 s
+- Next  200:   976 s
+- Next  200:   1170 s (1 Installation failed)
+- Next  200:   1242 s (6 Installations failed)
 
-This chapter shows the duration for deploying the 6 root installations for different versions of the Landscaper and 
-our current interpretation of the results.
+After the creation of these 1600 Installations in one namespace another packet of 200 Installations were created in 
+another namespace. The duration for this was 185s.
 
-### Current Landscaper Version
+Conclusion: If the number of Installations in one namespace increases, also the duration for their executions increases 
+heavily. If there are already 1000 Installations in a namespace, the execution of further 200 Installations requires about 
+20 minutes. The reason for this is the huge amount of list operations with label selectors, the Landscaper executes against 
+the API server of the resource cluster. 
 
-Tests with the current official Landscaper release with LaaS v0.71.0
+## Improvements
 
-- **Duration: 25:00 (minutes/seconds)**
+The following improvements where implemented to reduce the number of list operations with label selections:
 
-Investigations showed that main reason for the bad performance in the first tests was due to request rate limits of the 
-kubernetes clients. You can find the following entries in the logs indicating this:
+- DeployItems cache in the status of Executions: [PR](https://github.com/gardener/landscaper/pull/935)
+  - Used to directly access the DeployItems instead of fetching them via list oerations 
+- Subinstallation cache in the status of Installations: [PR](https://github.com/gardener/landscaper/pull/936)
+  - Used to directly access the Subinstallations instead of fetching them via list oerations
+- Sibling import/export hints: [PR](https://github.com/gardener/landscaper/pull/937)
+  - Prevent list operations to compute predecessor and successor installations if no data is exchanged   
 
-```
-Waited for 8.78812882s due to client-side throttling, not priority and fairness, request: GET:https://api.je09c359.laasds.shoot.live.k8s-hana.ondemand.com/apis/landscaper.gardener.cloud/v1alpha1
-```
+The improvements were tested with the following test setup:
+
+- One Lansdscaper instance with 10 namespaces. 
+- In every namespace about 1000 Installations with 1000 Executions and 1000 DeployItems. The DeployItems just
+  install a configmap. There are no sibling exports or imports and these flags are set on true in the Installations.
+- One helm deployer pod with 120 worker threads. 
+- Ome main controller pod with 60 worker threads for Installations and 60 worker threads for Executions.
+
+The tests were executed with an old Landscaper version v.0.90.0 and a Landscaper with the improvements described above.
+
+Test results: 
 
-### Landscaper with improved client request rate limits
+- Creation of 1000 Installations/1000 Executions/1000 Deploy Items in one namespace
+    - Duration before optimisation: 3046s
+    - Duration after optimisation:  1050s
 
-Tests with a Landscaper with a client having very high request rate limits (burst rate and queries per second = 10000).
+- Update of 1000 Installations/1000 Executions/1000 Deploy Items in one namespace
+    - Duration before optimisation: 3601s
+    - Duration after optimisation:  1166s
 
-- 30 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-11d2919a8e2bce4a02c3928f7a49fe183d35f63d) 
-  - **Duration: 4.16**
-
+The creation and update time for 1000/1000/1000 objects remain stable until 20.0000 Installations with 20.0000 Executions 
+and 20.0000 DeployItems were created in 20 different namespaces. No tests were executed with more objects so far.
 
-- 60 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-8db791bf996047f1b849207472ff9d97bac80481)
-  - **Duration: 4:10**
+## Comparison with cached client
 
+The optimized version was compared with a version using a cached k8s client. The test setup was similar to the chapter
+before.
 
-- 120 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-66eb650b1156d7eaced0b3e63def4a8dc0f6cbff)
-  - **Duration: 5:02**
+- Creation of 500 Installations/500 Executions/500 Deploy Items in another namespace
+  - Duration with optimisation: 400s
+  - Duration with cached client: 228s
 
+- Update of 500 Installations/500 Executions/500 Deploy Items in one namespace
+  - Duration with optimisation: 389s
+  - Duration with cached client: 217s
 
-- 310 worker threads for installations, executions, deploy items (LaaS version: v0.72.0-dev-eb5bb0f8424f25a6ae2871e0bc9f1c50d35228f8)
-  - **Duration: 4:41**
+The memory consumption of the version with the cached client was about ten time more than for the optimized version:
 
-
-The tests show: 
-  - The performance is much better compared to the k8s client with rate limiting. 
-  - The number of parallel worker threads should not be increased too far.
 
-### Landscaper with improved client request rate limits and parallelisation
+**Memory consumption of the optimised version:**
 
-Tests with a Landscaper with a client having very high request rate limits (burst rate and queries per second (qps) = 10000)
-and multiple replicas for the pods running the controller for installations, executions and helm deploy items.
+```
+      NAME                                                             CPU(cores)   MEMORY(bytes)   
+      container-test0001-2f9e5e91-container-deployer-5f646cff6-5vqjd   2m           98Mi            
+      helm-test0001-2f9e5e91-helm-deployer-d8b7744b6-wxslx             312m         318Mi           
+      landscaper-test0001-2f9e5e91-7f844f9f7c-9mfsq                    9m           157Mi           
+      landscaper-test0001-2f9e5e91-main-545ccccc6d-75qpl               164m         343Mi           
+      manifest-test0001-2f9e5e91-manifest-deployer-7c555589bd-wdwf8    2m           79Mi 
+```
 
-LaaS version: v0.72.0-dev-7f456ae4edb6a86847bb210e25ef9c3f26ed6ada
+**Memory consumption of version with cached client:**
 
-- 1 pods für inst, exec, di controller: **Duration: 4:16**
+```
+      NAME                                                              CPU(cores)   MEMORY(bytes)   
+      container-test0001-2f9e5e91-container-deployer-697d7b6449-6b6vf   15m          240Mi           
+      helm-test0001-2f9e5e91-helm-deployer-6ff7686c6f-zl5rc             1664m        4445Mi          
+      landscaper-test0001-2f9e5e91-7776698fb-lx56p                      32m          627Mi           
+      landscaper-test0001-2f9e5e91-main-6bcbd8788c-j25mh                508m         2268Mi          
+      manifest-test0001-2f9e5e91-manifest-deployer-6d546d9c6c-46dz2     20m          845Mi   
+```
 
-- 2 pods für inst, exec, di controller: **Duration: 2:24**
+## Duration for small numbers without sibling hints
 
-- 3 pods für inst, exec, di controller: **Duration: 1:21**
+The following shows the duration to create or update only a few number of Installations/Executions/DeployItems in a new 
+and empty namespace whereby the sibling hints of optimisation three are not used. The cluster already contains about 
+20.0000 Installations with 20.0000 Executions and 20.0000 DeployItems in 20 namespaces. 
 
-- 4 pods für inst, exec, di controller: **Duration: 1:30**
+100/100/100: create: 173s - update: 159s - delete: 63s
+200/200/200: create: 323s - update: 272s - delete: 115s
+300/300/300: create: 413s - update: 401s - delete: 175s
+400/400/400: create: 543s - update: 542s - delete: 285s
+500/500/500: create: 678s - update: 659s - delete: 394s
 
-- 5 pods für inst, exec, di controller: 
-  - error with message: 'Op: CreateImportsAndSubobjects - Reason: ReconcileExecution - Message:
-      Op: errorWithWriteID - Reason: write - Message: write operation w000022 failed
-      with Get "https://[::1]:443/api/v1/namespaces/cu-test/resourcequotas": dial
-      tcp [::1]:443: connect: connection refused'
+Here the corresponding numbers if the sibling hints are activated:
 
-The tests show:
+100/100/100: create: 109s - update: 106s - delete: 59s
+200/200/200: create: 183s - update: 176s - delete: 120s
+300/300/300: create: 263s - update: 242s - delete: 204s
+400/400/400: create: 356s - update: 329s - delete: 365s
+500/500/500: create: 429s - update: 436s - delete: 532s
 
-- Activating the parallelization results in a similar performance for the one pod scenario, though there are more
-  requests to the API server for synchronization. 
-- Further increasing the number of pods results in a better performance.
-- Going beyond some number of pods, the API server becomes overloaded and the deployment fails.
 
-### Landscaper with restricted Burst and QPS rates
+## Improve startup behaviour
 
-These tests were executed with restricted burst and qps rates and no parallelization. 
+With more k8s objects in a resource cluster, the startup times for the Landscaper become much slower because all watched
+objects are presented first to the controller. When restarting the Landscaper watching a resource cluster with about 
+20.0000 Installations with 20.0000 Executions and 20.0000 DeployItems in 20 namespaces, it requires about 10 minutes 
+until Landscaper starts processing newly created Installations.
 
-LaaS version: v0.72.0-dev-baa5654c9e727a70e568e24407277181c0aef1b3
+After introducing a startup cache ([see](https://github.com/gardener/landscaper/pull/948)) Landscaper requires only 30 s 
+until the processing the newly created Installations starts.
 
-- burst=30, qps=20: **Duration: 6:48**
-- burst=60, qps=40: **Duration: 4:25** (default settings)
-- burst=80, qps=60: **Duration: 4:20**
+Beside the startup problem, also the periodic reconciliation of all watched items of a controller every 10 hours, prevents
+the execution of modified items for several minutes. Therefore, the frequency of this operation was reduced to 1000
+days, such that this should not happen anymore, because the pods are usually restarted before at least during the regular 
+updates.
 
-The results sho that the default settings give quite good results.
 
-For settings other than the default, the configuration of the root installation of a landscaper instance in a LaaS 
-landscape has to be adapted as follows:
 
-```yaml
-    landscaperConfig:
-      k8sClientSettings:            # changed
-        resourceClient:             # changed
-          burst: <newValue>         # changed
-          qps: <newValue>           # changed
-      deployers:
-      - helm
-      - manifest
-      - container
-      deployersConfig:              # changed
-        helm:                       # changed
-          deployer:                 # changed
-            k8sClientSettings:      # changed
-              resourceClient:       # changed
-                burst: <newValue>   # changed
-                qps: <newValue>     # changed
-        manifest:                   # changed
-          deployer:                 # changed
-            k8sClientSettings:      # changed
-              resourceClient:       # changed
-                burst: <newValue>   # changed
-                qps: <newValue>     # changed
-```
 
-## Conclusions
 
-The communication with the API server of the resource cluster has a big influence on the Landscaper performance. 
-Increasing the request restrictions of the k8s client used by the Landscaper results in a speed-up of about 6.
-Parallelization could further improve the performance by a factor of 3.
 
-Unfortunately, if the number of requests to the API server becomes too high, the API server might become unresponsive 
-resulting in deployment errors. Due to the large amount of different usage scenarios, it is currently hard to judge 
-which setup is optimal with respect to performance and stability.
 
-For now we decide to release the Landscaper with no parallelization and the default restricted burst and qps rates 
-(60/40). If there will be problems with an overloaded API server, the values could be reduced accordingly.
 
-So far the tests were quite restricted and other usage pattern might also show different bottlenecks like huge memory 
-consumption etc. Therefore, we need to investigate this on our productive landscapes for the different customer scenarios. 
-  
diff --git a/docs/usage/Optimization.md b/docs/usage/Optimization.md
@@ -3,12 +3,16 @@
 This chapter contains some hints to improve the performance of Landscaper instances.
 
 - Do not create too many Installations, Executions, DeployItems, Targets etc. in one namespace watched by your 
-  Landscaper instance. A reasonable upper bound is about 500 objects for every object type. If you have more
+  Landscaper instance. A reasonable upper bound is about 200 objects for every object type. If you have more
   objects, spread them over more than one namespace.
 
 - If you know that an installation does not import/exports data from/to sibling installations or has no 
   siblings at all, you could specify this in the `spec` of an installation as follows. If nothing set, the default
-  value `false` is assumed. This hint prevents the need for complex dependency computation and speads up processing. 
+  value `false` is assumed. This hint prevents the need for complex dependency computation and speeds up processing. 
+  Only use this feature, if you are sure about the data exchange of your Installations because if this is enabled and 
+  siblings are exchanging data, this might produce erratic results. If you could enable this feature for all of your 
+  installations in a namespace, a reasonable upper limit for the number of objects of this namespace is 500 for every 
+  object type.
 
 ```yaml
 apiVersion: landscaper.gardener.cloud/v1alpha1

diff --git a/pkg/deployer/lib/cmd/default.go b/pkg/deployer/lib/cmd/default.go
@@ -9,7 +9,9 @@ import (
 	"errors"
 	goflag "flag"
 	"fmt"
+	"k8s.io/utils/pointer"
 	"os"
+	"time"
 
 	flag "github.com/spf13/pflag"
 	"golang.org/x/sync/errgroup"
@@ -82,6 +84,7 @@ func (o *DefaultOptions) Complete() error {
 	opts := manager.Options{
 		LeaderElection:     false,
 		MetricsBindAddress: "0", // disable the metrics serving by default
+		SyncPeriod:         pointer.Duration(time.Hour * 24 * 1000),
 	}
 
 	hostRestConfig, err := ctrl.GetConfig()