[oCIS] Scaling oCIS in kubernetes causes requests to fail Timebox 8PD #8589

butonic · 2024-03-06T10:53:13Z

During loadtests we seem to be losing requests. We have identified several possible causes:

1. when a new pod is added it does not seem to receive traffic

This might be caused by clients not picking up the new service. One reason would be that the same grpc connection is reused. We need to make sure that every service uses the a selector.Next() call to get a fresh client from the registry.

2. when a pod is shut down because kubernetes moves it to a different node or it is descheduled it still receives traffic

This might be caused by latency. The client got a grpc client with selector.Next() but then the pod was killed before the request reached it. We should retry requests, but the grpc built in retry mechanism would need to know all possible services. That is not how the reva pool works.

We could configure the grpc connection to retry requests:

	var retryPolicy = `{
		"methodConfig": [{
			// config per method or all methods under service
			"name": [{"service": "grpc.examples.echo.Echo"}],
			"waitForReady": true,

			"retryPolicy": {
				"MaxAttempts": 4,
				"InitialBackoff": ".01s",
				"MaxBackoff": ".01s",
				"BackoffMultiplier": 1.0,
				// this value is grpc code
				"RetryableStatusCodes": [ "UNAVAILABLE" ]
			}
		}]
	}`

	conn, err := grpc.Dial(
		address,
		grpc.WithTransportCredentials(cred),
		grpc.WithDefaultServiceConfig(retryPolicy),
		grpc.WithDefaultCallOptions(
			grpc.MaxCallRecvMsgSize(maxRcvMsgSize),
		),
		grpc.WithStatsHandler(otelgrpc.NewClientHandler(
			otelgrpc.WithTracerProvider(
				options.tracerProvider,
			),
			otelgrpc.WithPropagators(
				rtrace.Propagator,
			),
		)),
	)

but they would just try the same ip. To actually send requests to different servers, aka client side load balancing we would have to add sth. like:

	// Make another ClientConn with round_robin policy.
	roundrobinConn, err := grpc.Dial(
		fmt.Sprintf("%s:///%s", exampleScheme, exampleServiceName),
		grpc.WithDefaultServiceConfig(`{"loadBalancingConfig": [{"round_robin":{}}]}`), // This sets the initial balancing policy.
		grpc.WithTransportCredentials(insecure.NewCredentials()),
	)

The load balancing works based on name resolving.

We could add all this to the reva pool ... or we use a go micro grpc client that already implements a pool, integrates with the service registry and can do retry, backoff and whatnot. But this requires generating micro glients for the cs3 api using github.com/go-micro/generator/cmd/protoc-gen-micro

3. pod readyness and health endpoints do not reflect the actual state of the pod

Currently, the /healthz and /readyz endpoints are independent from the actual service implementation. But some services need some time to be ready or flush all requests on shutdown. This also needs to be investigated.
For ready we could use a channel to communicate between the actual handler and the debug handler.
And AFAIR @rhafer mentioned we need to take care of shutdown functions ... everywhere.

4. the services are needlessly split into separate pods

Instead of startinf a pod for every service we should aggregate all processes that are involved in translating a request until they reach a storage provider:

proxy should stay alone as it is the first service that is hit by traffic. and we may need it to shard the userbase of large instances by routing requests to a specific shard
frontend, webdav, ocs & graph -> gateway & auth providers are all stateless and should run in a single frontend pod
storage-system might go together with user and group providers
strorage-users does the bulk of the work this makes sense to put into a pod (actually this already combines a storageprovider and a dataprovider which we should maybe even split? one is for metadata, the other for blob transfer)
sharing ... might even go into the frontend

The services should use localhost or even unix sockets to talk to each other. go can very efficiently use the resources in a pod an handle requests concurrently. We really only create a ton of overhead that stresses the kubernetes APIs and can be reduced.

The text was updated successfully, but these errors were encountered:

rhafer · 2024-03-06T15:39:36Z

And AFAIR @rhafer mentioned we need to take care of shutdown functions ... everywhere.

Hm, I don't remember what exactly I mentioned, but the biggest issues with shutdown were IIRC related to running ocis in single binary mode, because reva just does an os.Exit() from the first service finishing the SIGTERM/SIGQUIT/SIGINT signal handler, causing all other services to go away with before finishing their shutdown,l obviously.

When running as separate services there is already the possiblity to do a more graceful shutdown for the reva services. By default reva does this only when shutdown via SIGQUIT. When a setting graceful_shutdown_timeout to something != 0 (in the reva config) the graceful shutdown can also be triggered by sending the default SIGTERM signal. (AFAIK we currently only expose graceful_shutdown_timeout in ocis for the storage-users service. (For details: cs3org/reva#4072, #6840)

wkloucek · 2024-03-11T06:35:32Z

Please also see https://github.com/owncloud/enterprise/issues/6441:

oCIS doesn't benefit from the Kubernetes readiness probes behavior since it's not using Kubernetes Services to talk to each other. It uses the go micro service registry instead that knows / doesn't know about service readiness!??

For a specific Kubernetes environment with Cilium: If we could just configure hostnames / DNS names and not use the micro registry, we probably could leverage Cilium for load balancing: https://docs.cilium.io/en/stable/network/servicemesh/envoy-load-balancing/ (but it's in beta state)

Please also be aware of the "retry" concept: https://github.com/grpc/grpc-go/blob/master/examples/features/retry/README.md

micbar · 2024-04-22T20:11:36Z

@butonic @kobergj @dragonchaser

I think we should start working on 2)

butonic · 2024-05-28T13:10:29Z

What is the current state of this. We found a few bugs that explain why the search service was not scaling.

AFAICT we need to reevaluate this with a load test.

butonic · 2024-05-31T09:27:49Z

There are two options. 1. use native GRPC mechanisms to retry. 2. generate go micro clients for the CS3 API.

I'd vote the latter, because go micro already retries requests that time out and we want to move some services into ocis anyway.

A first step could be to generate go micro clients for the gateway so our graph service can use them to make CS3 calls against the gateway.

another step would be to bring ocdav to ocis ... and then replace all grpc clients with go micre generated clients.

This is a ton of work. 😞

Note that using the native GRPC client and teaching it to retry services also requires configuring which calls should be retried.

Maybe we can just tell the grpc client in the reva pool to retry all requests?

Then we would still have two ways of making requests ... I'm not sure if we can use the native grpc retry mechanism, becasue we are using a single ip addresse that has been resolved with a micro selector. AFAICT the the grpc client cannot use DNS to find the next ip.

Two worlds are colliding here ...

💥

butonic · 2024-05-31T09:32:12Z

Furthermore, I still want to be able to use an in memory transport, which we could use when embracing go micro further.

dragonchaser · 2024-06-05T08:07:06Z

Furthermore, I still want to be able to use an in memory transport, which we could use when embracing go micro further.

#9321 we have to discuss about this

dj4oC · 2024-06-25T12:21:52Z

Priority increasing due to multiple customers are effected
\cc @dragotin

micbar · 2024-06-25T14:00:33Z

@dj4oC Can you please provide more info from the other customers too?

dj4oC · 2024-06-25T17:01:20Z

The customer @grischdian & @blicknix is reporting, that after kubectl patch oCIS does not work because new requests still try to reach old pods. kubectl deploy on the other hand does work, because the registration is done from scratch (new pods all over). Unfortunately we cannot export logs due to security constraints. Deployment is done on with OpenShift and ArgoCD.

butonic · 2024-06-25T20:50:07Z

um

# kubectl deploy
error: unknown command "deploy" for "kubectl"

Did you mean apply?

What MICRO_REGISTRY is configured?

butonic · 2024-06-26T09:41:37Z

@dj4oC @grischdian @blicknix the built in nats in the ocis helm chart cannot be scaled. you have to keep the replica at 1. if you need a redundant deployment use a dedicated nats cluster.

running multiple nats instances from the ocis chart causes a split brain situation where service lookups might return stale data. this is related to kubernetes scale up / down, but we tackled scale up and should pick up new pods properly.

This issue is tracking scale down problems, which we can address by retrying calls. Unfortuately, that is a longer path because we need to touch a lot of code.

kubectl apply vs kubectl patch vs argocd are a different issue.

blicknix · 2024-06-27T12:22:56Z

We only have one nats pod in den environment as it is only a dev environment. So no split brain.
MICRO_REGISTRY is nats-js-kv

butonic · 2024-06-27T18:48:58Z

I think I have found a way to allow using the native grpc-go Thick Client round robin load balancing using the dns:/// transport and kubernetes headless services by taking into account the transport in the service metadata. It requires reading the transport from the service and registering services with a configurable transport.

This works without ripping out the go micro service registry but we need to test these changes with helm charts that use headless services und configure the grpc protocol to be dns.

🤔

hm and we may have to register the service with its domain name ... not the external ip ... urgh ... needs more work.

butonic · 2024-06-28T08:16:40Z

@wkloucek @d7oc what were the problems when ocis was using the kubernetes service registry? AFAIK etcd was under heavy load.

@dragonchaser mentioned that it is possible to set up an etcd per namespace to shard the load.

when every school uses ~40 pods and every pod registers a watcher on the kubernetes api (provided by etcd) and reregisters itself every 30 sec that does create some load. I don't know if the go micro kubernetes registry subscribes to ALL pod events or if it is even possible to only receive events for a single namespace. I can imagine when every pod change needs to be propagated to every watcher that that might cause load problems.

So if you can shed some light on why the kubernetes registry was 'bad' I'd be delighted.

wkloucek · 2024-07-09T06:11:03Z

well since we are only the "user" of the openshift we have no numbers on the availability

Having no SLA for the control plane would be a argument for me to not use the "Kubernetes" service registry. If the control plane had a 5 minutes downtime, this would create a roughly equal a oCIS downtime. Especially if you have no control over WHEN the control plane maintenance is performed, this might be a blocker, since this might conflict with your SLAs for the oCIS workload. Having the service registry component like NATS for the "nats-js-kv" service registry running on the Kubernetes workers, provides a good separation between workload and control plane.

wkloucek · 2024-07-15T12:04:22Z

Looking at #9535 might explain some things:

when a new pod is added it does not seem to receive traffic

yeah, because only one pod might receive load ever, because the registry only holds one registered service instance

when a pod is shut down because kubernetes moves it to a different node or it is descheduled it still receives traffic

yeah, because only one service instance is known at all. So next() will always use the same one until the TTL expires

butonic · 2024-07-22T14:07:03Z

ok, to double check we reproduced the broken nats-js-kv registry behaviour:

deploy ocis with a helmchart like this

releases:
  - name: ocis
    chart: ../../charts/ocis
    namespace: ocis
    values:
      - image:
          repository: owncloud/ocis
          tag: "5.0.6"
      - externalDomain: cloud.khal.localdomain
      - features:
          basicAuthentication: true
          demoUsers: true
      - ingress:
          enabled: true
          tls:
            - secretName: ocis-dev-tls
              hosts:
                - cloud.khal.localdomain

      - logging:
          level: debug

      - insecure:
          oidcIdpInsecure: true
          ocisHttpApiInsecure: true

      - services:
          idm:
            persistence:
              enabled: true
              accessModes:
                - ReadWriteOnce

          nats:
            persistence:
              enabled: true
              accessModes:
                - ReadWriteOnce

          search:
            persistence:
              enabled: true
              accessModes:
                - ReadWriteOnce

          storagesystem:
            persistence:
              enabled: true
              accessModes:
                - ReadWriteOnce

          storageusers:
            persistence:
              enabled: true
              accessModes:
                - ReadWriteOnce

          store:
            persistence:
              enabled: true
              accessModes:
                - ReadWriteOnce

          thumbnails:
            persistence:
              enabled: true
              accessModes:
                - ReadWriteOnce

          web:
            persistence:
              enabled: true
              accessModes:
                - ReadWriteOnce

continuously propfind einsteins home like this (replace your storageid):

go run ./ocis/cmd/ocis benchmark client -u einstein:relativity -k  https://cloud.khal.localdomain/dav/spaces/{storageid}

watch gateway logs to see which pod receives the traffic:

kubectl -n ocis logs -l 'app=gateway' --prefix -f | grep '/Stat' | cut -d' ' -f1

increase the number of replicas for the gateway deployment to 3 and observe the above log output.

in 5.0.6 all requests remain on the same pod.

butonic · 2024-07-22T15:22:06Z

with ocis-rolling@master upscaling picks up new pods and distributes the load to the gateway pods properly.

when scaling down we see intermittient 401 and some 500 responses to the propfind for ~10 sec. then all requests return back to 207.

Note that in the 10secs there is not a single 207. Presumably, because the auth services cannot connect to the gateway as well, explaining the short 4ms 401 responses.

we will verify and dig into the scale down tomorrow ...

butonic · 2024-07-31T13:12:23Z

For reference: the server side connection management with GRPC_MAX_CONNECTION_AGE in oCIS and reva follows the upstream design: https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md#implementation

butonic · 2024-08-01T13:47:37Z

so now I see the storageusers pods being OOM killed as described in #9656 (comment)

I currently think we are running into kubernetes/kubernetes#43916 (comment)

edit: @wkloucek pointed out that we have to disable mime multipart uploads because it allocates too much memory:

              driver: s3ng
              driverConfig:
                s3ng:
                  metadataBackend: messagepack
                  endpoint: ...
                  region: ...
                  bucket: ...
                  putObject:
                    # -- Disable multipart uploads when copying objects to S3
                    disableMultipart: true

now tests are more stable:

k6 run ~/cdperf/packages/k6-tests/artifacts/koko-platform-000-mixed-ramping-k6.js

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: /root/cdperf/packages/k6-tests/artifacts/koko-platform-000-mixed-ramping-k6.js
     output: -

  scenarios: (100.00%) 8 scenarios, 75 max VUs, 6m30s max duration (incl. graceful stop):
           * add_remove_tag_100: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: add_remove_tag_100, gracefulStop: 30s)
           * create_remove_group_share_090: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: create_remove_group_share_090, gracefulStop: 30s)
           * create_space_080: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: create_space_080, gracefulStop: 30s)
           * create_upload_rename_delete_folder_and_file_040: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: create_upload_rename_delete_folder_and_file_040, gracefulStop: 30s)
           * download_050: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: download_050, gracefulStop: 30s)
           * navigate_file_tree_020: Up to 10 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: navigate_file_tree_020, gracefulStop: 30s)
           * sync_client_110: Up to 20 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: sync_client_110, gracefulStop: 30s)
           * user_group_search_070: Up to 20 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: user_group_search_070, gracefulStop: 30s)


     ✓ authn -> loginPageResponse - status
     ✓ authn -> authorizationResponse - status
     ✓ authn -> accessTokenResponse - status
     ✓ client -> search.searchForSharees - status
     ✓ client -> role.getMyDrives - status
     ✓ client -> resource.getResourceProperties - status
     ✓ client -> application.createDrive - status
     ✓ client -> resource.createResource - status
     ✓ client -> drive.deactivateDrive - status
     ✓ client -> tag.getTags - status -- (SKIPPED)
     ✓ client -> tag.createTag - status -- (SKIPPED)
     ✓ client -> resource.downloadResource - status
     ✓ client -> drive.deleteDrive - status
     ✗ client -> tag.addTagToResource - status
      ↳  0% — ✓ 0 / ✗ 47
     ✗ client -> share.createShare - status
      ↳  0% — ✓ 0 / ✗ 44
     ✗ client -> tag.removeTagToResource - status
      ↳  0% — ✓ 0 / ✗ 47
     ✗ client -> share.deleteShare - status
      ↳  0% — ✓ 0 / ✗ 44
     ✓ client -> resource.deleteResource - status
     ✓ client -> resource.uploadResource - status
     ✓ client -> resource.moveResource - status

     checks.........................: 95.97% ✓ 4343     ✗ 182 
     data_received..................: 1.7 GB 4.5 MB/s
     data_sent......................: 812 MB 2.1 MB/s
     http_req_blocked...............: avg=897.71µs min=201ns   med=271ns    max=50.13ms p(90)=581ns    p(95)=954ns   
     http_req_connecting............: avg=311.9µs  min=0s      med=0s       max=19.64ms p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=319.32ms min=2.53ms  med=275.79ms max=9.91s   p(90)=483.04ms p(95)=689.99ms
       { expected_response:true }...: avg=326.99ms min=2.53ms  med=280.14ms max=9.91s   p(90)=495.38ms p(95)=699.66ms
     http_req_failed................: 3.90%  ✓ 182      ✗ 4474
     http_req_receiving.............: avg=7.25ms   min=25.73µs med=94.36µs  max=1.18s   p(90)=198.97µs p(95)=50.19ms 
     http_req_sending...............: avg=1.6ms    min=27.66µs med=92.49µs  max=1s      p(90)=157.35µs p(95)=189.69µs
     http_req_tls_handshaking.......: avg=569.49µs min=0s      med=0s       max=29.77ms p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=310.46ms min=2.45ms  med=269.36ms max=9.29s   p(90)=473.95ms p(95)=655.22ms
     http_reqs......................: 4656   12.11429/s
     iteration_duration.............: avg=6.74s    min=1.18s   med=1.31s    max=46.38s  p(90)=15.65s   p(95)=18.41s  
     iterations.....................: 3419   8.895781/s
     vus............................: 1      min=0      max=75
     vus_max........................: 75     min=75     max=75


running (6m24.3s), 00/75 VUs, 3419 complete and 2 interrupted iterations
add_remove_tag_100             ✓ [======================================] 0/5 VUs    6m0s
create_remove_group_share_090  ✓ [======================================] 0/5 VUs    6m0s
create_space_080               ✓ [======================================] 0/5 VUs    6m0s
create_upload_rename_delete... ✓ [======================================] 0/5 VUs    6m0s
download_050                   ✓ [======================================] 0/5 VUs    6m0s
navigate_file_tree_020         ✓ [======================================] 00/10 VUs  6m0s
sync_client_110                ✓ [======================================] 00/20 VUs  6m0s
user_group_search_070          ✓ [======================================] 00/20 VUs  6m0s

hmmm but I still got a kill:

Memory cgroup out of memory: Killed process 3030326 (ocis) total-vm:2363552kB, anon-rss:100068kB, file-rss:62856kB, shmem-rss:0kB, UID:1000 pgtables:576kB oom_score_adj:997

butonic · 2024-08-01T13:57:46Z

setting concurrentStreamParts: false also does not fix this ...

butonic · 2024-08-01T15:14:55Z

forcing a guaranteed memory limit by setting it to the same as request also does not stop kubernetes from OOMKilling things

          storageusers:
            resources:
              limits:
                memory: 600Mi
              requests:
                cpu: 100m
                memory: 600Mi

it might not be the storage users pod ... I need to better understand the events:

┌────────────────────────────────────────────────────────────────── Events(default)[70] ──────────────────────────────────────────────────────────────────┐
│ LAST SEEN↑       TYPE          REASON                                   OBJECT                                                             COUNT        │
│ 2m49s            Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-pv8pb        1            │
│ 4m16s            Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-j8xbn        1            │
│ 16m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-pv8pb        1            │
│ 20m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-ncxpg        1            │
│ 21m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-ncxpg        1            │
│ 22m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-j8xbn        1            │
│ 23m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-ncxpg        1            │
│ 24m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-7q2hn        1            │
│ 34m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-7q2hn        1            │
│ 35m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-2trtl        1            │
│ 35m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-2trtl        1            │
│ 35m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-2trtl        1            │
│ 35m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-7q2hn        1            │
│ 35m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-7q2hn        1            │
│ 36m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-2trtl        1            │
│ 36m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-2trtl        1            │
│ 36m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-7q2hn        1            │
│ 36m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-7q2hn        1            │
│ 36m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-hrhjd        1            │
│ 36m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-hrhjd        1            │
│ 38m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-2trtl        1            │
│ 38m              Warning       OOMKilling                               node/shoot--420505--de-lasttest-worker-icb8m-z1-74fcc-7q2hn        1            │

Also running the tests with more than 150VUs fails ... I need to check if enough users are available ...

butonic · 2024-08-02T14:42:22Z

nats-js-kv-registry still seems broken. we tried disabling the cache but still see old ips show up ... 😞

butonic · 2024-08-02T15:26:32Z

hm a micro Selector always uses a cache with a default TTL of 1 minute:

// NewSelector creates a new default selector.
func NewSelector(opts ...Option) Selector {
	sopts := Options{
		Strategy: Random,
	}

	for _, opt := range opts {
		opt(&sopts)
	}

	if sopts.Registry == nil {
		sopts.Registry = registry.DefaultRegistry
	}

	s := &registrySelector{
		so: sopts,
	}
	s.rc = s.newCache()

	return s
}

and we use that selector at least in our proxy/pkg/router/router.go:

	reg := registry.GetRegistry()
	sel := selector.NewSelector(selector.Registry(reg))

butonic · 2024-08-06T11:01:05Z

we fixed more issues

with the natsjskv registry: Nats registry fixes #9740
with the proxy registering a nats watcher for every host in the configured routes: use less selectors #9741
with the netsjskv store witcher implementation not sending deletes: do not try to unmarshal on deletes kobergj/plugins#1

butonic · 2024-08-06T11:07:05Z

these fixes bring us to a more reasonable loadtest. Sharing and tagging seem broken, though. tagging is a known issue AFAIK but sharing used to work.

# k6 run ~/cdperf/packages/k6-tests/artifacts/koko-platform-000-mixed-ramping-k6.js

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: /root/cdperf/packages/k6-tests/artifacts/koko-platform-000-mixed-ramping-k6.js
     output: -

  scenarios: (100.00%) 8 scenarios, 75 max VUs, 6m30s max duration (incl. graceful stop):
           * add_remove_tag_100: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: add_remove_tag_100, gracefulStop: 30s)
           * create_remove_group_share_090: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: create_remove_group_share_090, gracefulStop: 30s)
           * create_space_080: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: create_space_080, gracefulStop: 30s)
           * create_upload_rename_delete_folder_and_file_040: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: create_upload_rename_delete_folder_and_file_040, gracefulStop: 30s)
           * download_050: Up to 5 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: download_050, gracefulStop: 30s)
           * navigate_file_tree_020: Up to 10 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: navigate_file_tree_020, gracefulStop: 30s)
           * sync_client_110: Up to 20 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: sync_client_110, gracefulStop: 30s)
           * user_group_search_070: Up to 20 looping VUs for 6m0s over 3 stages (gracefulRampDown: 30s, exec: user_group_search_070, gracefulStop: 30s)


     ✓ authn -> loginPageResponse - status
     ✓ authn -> authorizationResponse - status
     ✓ authn -> accessTokenResponse - status
     ✓ client -> search.searchForSharees - status
     ✓ client -> role.getMyDrives - status
     ✓ client -> resource.getResourceProperties - status
     ✓ client -> application.createDrive - status
     ✓ client -> resource.createResource - status
     ✓ client -> drive.deactivateDrive - status
     ✓ client -> drive.deleteDrive - status
     ✗ client -> share.createShare - status
      ↳  0% — ✓ 0 / ✗ 43
     ✗ client -> share.deleteShare - status
      ↳  0% — ✓ 0 / ✗ 43
     ✓ client -> resource.deleteResource - status
     ✓ client -> tag.getTags - status -- (SKIPPED)
     ✓ client -> tag.createTag - status -- (SKIPPED)
     ✗ client -> tag.addTagToResource - status
      ↳  0% — ✓ 0 / ✗ 47
     ✗ client -> tag.removeTagToResource - status
      ↳  0% — ✓ 0 / ✗ 47
     ✓ client -> resource.uploadResource - status
     ✓ client -> resource.moveResource - status
     ✓ client -> resource.downloadResource - status

     checks.........................: 95.86% ✓ 4170      ✗ 180 
     data_received..................: 1.7 GB 4.4 MB/s
     data_sent......................: 812 MB 2.1 MB/s
     http_req_blocked...............: avg=1.02ms   min=210ns   med=271ns    max=301.35ms p(90)=581ns    p(95)=1.1µs   
     http_req_connecting............: avg=321.74µs min=0s      med=0s       max=19.94ms  p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=415.27ms min=2.56ms  med=364.1ms  max=16.18s   p(90)=624.35ms p(95)=762.88ms
       { expected_response:true }...: avg=421.78ms min=2.56ms  med=367.96ms max=16.18s   p(90)=630.45ms p(95)=765.31ms
     http_req_failed................: 4.01%  ✓ 180       ✗ 4301
     http_req_receiving.............: avg=10.98ms  min=28.73µs med=92.38µs  max=13.61s   p(90)=212.02µs p(95)=49.96ms 
     http_req_sending...............: avg=1.6ms    min=26.89µs med=91.06µs  max=635.84ms p(90)=163.94µs p(95)=201.16µs
     http_req_tls_handshaking.......: avg=681.32µs min=0s      med=0s       max=284.48ms p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=402.68ms min=2.43ms  med=362.09ms max=8.7s     p(90)=601.63ms p(95)=744ms   
     http_reqs......................: 4481   11.489714/s
     iteration_duration.............: avg=7.11s    min=1.27s   med=1.4s     max=46.95s   p(90)=15.92s   p(95)=18.7s   
     iterations.....................: 3246   8.323056/s
     vus............................: 1      min=0       max=75
     vus_max........................: 75     min=73      max=75


running (6m30.0s), 00/75 VUs, 3246 complete and 3 interrupted iterations
add_remove_tag_100             ✓ [======================================] 0/5 VUs    6m0s
create_remove_group_share_090  ✓ [======================================] 0/5 VUs    6m0s
create_space_080               ✓ [======================================] 0/5 VUs    6m0s
create_upload_rename_delete... ✓ [======================================] 0/5 VUs    6m0s
download_050                   ✓ [======================================] 0/5 VUs    6m0s
navigate_file_tree_020         ✓ [======================================] 00/10 VUs  6m0s
sync_client_110                ✓ [======================================] 00/20 VUs  6m0s
user_group_search_070          ✓ [======================================] 00/20 VUs  6m0s

butonic · 2024-08-06T12:30:01Z

The next steps for this are:

try bigger load test. the login of the environment does not seem to be prepared for 750 VUs. might be a scaling problem, might pe users not being provisioned ...
why are the sharing tests not working? all of them fail, so it is not a scaling issue
why are the tags tests not workeng? all of them fail, so it is not a scaling issue

before we can close this issue we need to evaluate how the 1h load tests behave. before the login problvems are fixed this is blocked.

wkloucek · 2024-08-06T12:52:57Z

try bigger load test. the login of the environment does not seem to be prepared for 750 VUs. might be a scaling problem, might pe users not being provisioned ...

According to https://github.com/owncloud-koko/deployment-documentation/tree/main/development/loadtest/de-environment each loadtest school has 5000 users configured.

Also Keycloak should be scaled as the one on PROD (CPU / RAM).

The Realm settings regarding brokering, etc should differ though because we don't really have another IDM that we can broker.

butonic · 2024-08-15T08:30:07Z

we need to backport all natsjskv registry fixes from #8589 (comment) to stable5.

butonic · 2024-09-10T12:20:07Z

I backported the nats-js-kv registry fixes in #10019
quite a bit ...

butonic added the Type:Bug label Mar 6, 2024

wkloucek mentioned this issue Apr 5, 2024

healthcheck not failing even when service registration is not successful #8783

Closed

tbsbdr added the Priority:p2-high Escalation, on top of current planning, release blocker label Apr 15, 2024

tbsbdr added this to Infinite Scale Team Board Apr 15, 2024

tbsbdr moved this to Prio 2 in Infinite Scale Team Board Apr 15, 2024

kulmann moved this from Prio 2 to Qualification in Infinite Scale Team Board May 27, 2024

kulmann moved this from Qualification to Backlog in Infinite Scale Team Board May 27, 2024

dj4oC added Priority:p1-urgent Consider a hotfix release with only that fix and removed Priority:p2-high Escalation, on top of current planning, release blocker labels Jun 25, 2024

dj4oC moved this from Backlog to Prio 1 in Infinite Scale Team Board Jun 25, 2024

butonic self-assigned this Jun 26, 2024

This was referenced Jun 27, 2024

[docs-only] ADR0029 - grpc in kubernetes #9488

Open

respect grpc service transport cs3org/reva#4744

Merged

set the configured protocol transport for service metadata #9490

Merged

butonic mentioned this issue Jun 28, 2024

ignore addresses for kubernetes registry #9492

Merged

tbsbdr removed the Status:On-Hold label Jul 22, 2024

micbar moved this from Prio 2 to Backlog in Infinite Scale Team Board Jul 29, 2024

micbar changed the title ~~Scaling oCIS in kubernetes causes requests to fail~~ [oCIS] Scaling oCIS in kubernetes causes requests to fail Jul 29, 2024

micbar changed the title ~~[oCIS] Scaling oCIS in kubernetes causes requests to fail~~ [oCIS] Scaling oCIS in kubernetes causes requests to fail Timebox 8PD Jul 29, 2024

butonic mentioned this issue Jul 30, 2024

Omit some lookups in gateway #9715

Merged

micbar moved this from Backlog to In progress in Infinite Scale Team Board Jul 30, 2024

butonic mentioned this issue Aug 1, 2024

reuse node id when registering services #9656

Merged

butonic removed their assignment Aug 6, 2024

butonic moved this from In progress to Backlog in Infinite Scale Team Board Aug 6, 2024

butonic moved this from Backlog to blocked in Infinite Scale Team Board Aug 6, 2024

butonic mentioned this issue Sep 2, 2024

Patch Release 5.0.7 #9962

Closed

16 tasks

butonic mentioned this issue Oct 16, 2024

Remove Deprecations #10309

Merged

wkloucek mentioned this issue Oct 23, 2024

storage-users high memory usage when S3ng multipart uploads are enabled #10398

Open

butonic mentioned this issue Nov 5, 2024

CPU and Disk usage remains high after uploading large file tree #10453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[oCIS] Scaling oCIS in kubernetes causes requests to fail Timebox 8PD #8589

[oCIS] Scaling oCIS in kubernetes causes requests to fail Timebox 8PD #8589

butonic commented Mar 6, 2024 •

edited

Loading

rhafer commented Mar 6, 2024

wkloucek commented Mar 11, 2024

micbar commented Apr 22, 2024

butonic commented May 28, 2024

butonic commented May 31, 2024 •

edited

Loading

butonic commented May 31, 2024

dragonchaser commented Jun 5, 2024

dj4oC commented Jun 25, 2024 •

edited

Loading

micbar commented Jun 25, 2024

dj4oC commented Jun 25, 2024

butonic commented Jun 25, 2024 •

edited

Loading

butonic commented Jun 26, 2024

blicknix commented Jun 27, 2024

butonic commented Jun 27, 2024 •

edited

Loading

butonic commented Jun 28, 2024 •

edited

Loading

wkloucek commented Jul 9, 2024 •

edited

Loading

wkloucek commented Jul 15, 2024 •

edited

Loading

butonic commented Jul 22, 2024

butonic commented Jul 22, 2024 •

edited

Loading

butonic commented Jul 31, 2024 •

edited

Loading

butonic commented Aug 1, 2024

butonic commented Aug 1, 2024

butonic commented Aug 1, 2024

butonic commented Aug 2, 2024

butonic commented Aug 2, 2024

butonic commented Aug 6, 2024 •

edited

Loading

butonic commented Aug 6, 2024 •

edited

Loading

butonic commented Aug 6, 2024

wkloucek commented Aug 6, 2024

butonic commented Aug 15, 2024

butonic commented Sep 10, 2024

[oCIS] Scaling oCIS in kubernetes causes requests to fail Timebox 8PD #8589

[oCIS] Scaling oCIS in kubernetes causes requests to fail Timebox 8PD #8589

Comments

butonic commented Mar 6, 2024 • edited Loading

1. when a new pod is added it does not seem to receive traffic

2. when a pod is shut down because kubernetes moves it to a different node or it is descheduled it still receives traffic

3. pod readyness and health endpoints do not reflect the actual state of the pod

4. the services are needlessly split into separate pods

rhafer commented Mar 6, 2024

wkloucek commented Mar 11, 2024

micbar commented Apr 22, 2024

butonic commented May 28, 2024

butonic commented May 31, 2024 • edited Loading

butonic commented May 31, 2024

dragonchaser commented Jun 5, 2024

dj4oC commented Jun 25, 2024 • edited Loading

micbar commented Jun 25, 2024

dj4oC commented Jun 25, 2024

butonic commented Jun 25, 2024 • edited Loading

butonic commented Jun 26, 2024

blicknix commented Jun 27, 2024

butonic commented Jun 27, 2024 • edited Loading

butonic commented Jun 28, 2024 • edited Loading

wkloucek commented Jul 9, 2024 • edited Loading

wkloucek commented Jul 15, 2024 • edited Loading

butonic commented Jul 22, 2024

butonic commented Jul 22, 2024 • edited Loading

butonic commented Jul 31, 2024 • edited Loading

butonic commented Aug 1, 2024

butonic commented Aug 1, 2024

butonic commented Aug 1, 2024

butonic commented Aug 2, 2024

butonic commented Aug 2, 2024

butonic commented Aug 6, 2024 • edited Loading

butonic commented Aug 6, 2024 • edited Loading

butonic commented Aug 6, 2024

wkloucek commented Aug 6, 2024

butonic commented Aug 15, 2024

butonic commented Sep 10, 2024

butonic commented Mar 6, 2024 •

edited

Loading

butonic commented May 31, 2024 •

edited

Loading

dj4oC commented Jun 25, 2024 •

edited

Loading

butonic commented Jun 25, 2024 •

edited

Loading

butonic commented Jun 27, 2024 •

edited

Loading

butonic commented Jun 28, 2024 •

edited

Loading

wkloucek commented Jul 9, 2024 •

edited

Loading

wkloucek commented Jul 15, 2024 •

edited

Loading

butonic commented Jul 22, 2024 •

edited

Loading

butonic commented Jul 31, 2024 •

edited

Loading

butonic commented Aug 6, 2024 •

edited

Loading

butonic commented Aug 6, 2024 •

edited

Loading