-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[oCIS] Scaling oCIS in kubernetes causes requests to fail Timebox 8PD #8589
Comments
Hm, I don't remember what exactly I mentioned, but the biggest issues with shutdown were IIRC related to running ocis in single binary mode, because reva just does an When running as separate services there is already the possiblity to do a more graceful shutdown for the reva services. By default reva does this only when shutdown via SIGQUIT. When a setting |
Please also see https://github.com/owncloud/enterprise/issues/6441:
For a specific Kubernetes environment with Cilium: If we could just configure hostnames / DNS names and not use the micro registry, we probably could leverage Cilium for load balancing: https://docs.cilium.io/en/stable/network/servicemesh/envoy-load-balancing/ (but it's in beta state) Please also be aware of the "retry" concept: https://github.com/grpc/grpc-go/blob/master/examples/features/retry/README.md |
@butonic @kobergj @dragonchaser I think we should start working on 2) |
What is the current state of this. We found a few bugs that explain why the search service was not scaling. AFAICT we need to reevaluate this with a load test. |
There are two options. 1. use native GRPC mechanisms to retry. 2. generate go micro clients for the CS3 API. I'd vote the latter, because go micro already retries requests that time out and we want to move some services into ocis anyway. A first step could be to generate go micro clients for the gateway so our graph service can use them to make CS3 calls against the gateway. another step would be to bring ocdav to ocis ... and then replace all grpc clients with go micre generated clients. This is a ton of work. 😞 Note that using the native GRPC client and teaching it to retry services also requires configuring which calls should be retried. Maybe we can just tell the grpc client in the reva pool to retry all requests? Then we would still have two ways of making requests ... I'm not sure if we can use the native grpc retry mechanism, becasue we are using a single ip addresse that has been resolved with a micro selector. AFAICT the the grpc client cannot use DNS to find the next ip. Two worlds are colliding here ... 💥 |
Furthermore, I still want to be able to use an in memory transport, which we could use when embracing go micro further. |
#9321 we have to discuss about this |
Priority increasing due to multiple customers are effected |
@dj4oC Can you please provide more info from the other customers too? |
The customer @grischdian & @blicknix is reporting, that after kubectl patch oCIS does not work because new requests still try to reach old pods. kubectl deploy on the other hand does work, because the registration is done from scratch (new pods all over). Unfortunately we cannot export logs due to security constraints. Deployment is done on with OpenShift and ArgoCD. |
um
Did you mean apply? What |
@dj4oC @grischdian @blicknix the built in nats in the ocis helm chart cannot be scaled. you have to keep the replica at 1. if you need a redundant deployment use a dedicated nats cluster. running multiple nats instances from the ocis chart causes a split brain situation where service lookups might return stale data. this is related to kubernetes scale up / down, but we tackled scale up and should pick up new pods properly. This issue is tracking scale down problems, which we can address by retrying calls. Unfortuately, that is a longer path because we need to touch a lot of code.
|
We only have one nats pod in den environment as it is only a dev environment. So no split brain. |
I think I have found a way to allow using the native grpc-go Thick Client round robin load balancing using the This works without ripping out the go micro service registry but we need to test these changes with helm charts that use headless services und configure the grpc protocol to be 🤔 hm and we may have to register the service with its domain name ... not the external ip ... urgh ... needs more work. |
@wkloucek @d7oc what were the problems when ocis was using the kubernetes service registry? AFAIK etcd was under heavy load. @dragonchaser mentioned that it is possible to set up an etcd per namespace to shard the load. when every school uses ~40 pods and every pod registers a watcher on the kubernetes api (provided by etcd) and reregisters itself every 30 sec that does create some load. I don't know if the go micro kubernetes registry subscribes to ALL pod events or if it is even possible to only receive events for a single namespace. I can imagine when every pod change needs to be propagated to every watcher that that might cause load problems. So if you can shed some light on why the kubernetes registry was 'bad' I'd be delighted. |
Having no SLA for the control plane would be a argument for me to not use the "Kubernetes" service registry. If the control plane had a 5 minutes downtime, this would create a roughly equal a oCIS downtime. Especially if you have no control over WHEN the control plane maintenance is performed, this might be a blocker, since this might conflict with your SLAs for the oCIS workload. Having the service registry component like NATS for the "nats-js-kv" service registry running on the Kubernetes workers, provides a good separation between workload and control plane. |
Looking at #9535 might explain some things:
yeah, because only one pod might receive load ever, because the registry only holds one registered service instance
yeah, because only one service instance is known at all. So next() will always use the same one until the TTL expires |
ok, to double check we reproduced the broken nats-js-kv registry behaviour:
continuously propfind einsteins home like this (replace your storageid):
watch gateway logs to see which pod receives the traffic:
increase the number of replicas for the gateway deployment to 3 and observe the above log output. in 5.0.6 all requests remain on the same pod. |
with ocis-rolling@master upscaling picks up new pods and distributes the load to the gateway pods properly. when scaling down we see intermittient 401 and some 500 responses to the propfind for ~10 sec. then all requests return back to 207. Note that in the 10secs there is not a single 207. Presumably, because the auth services cannot connect to the gateway as well, explaining the short 4ms 401 responses. we will verify and dig into the scale down tomorrow ... |
For reference: the server side connection management with |
so now I see the storageusers pods being OOM killed as described in #9656 (comment) I currently think we are running into kubernetes/kubernetes#43916 (comment) edit: @wkloucek pointed out that we have to disable mime multipart uploads because it allocates too much memory: driver: s3ng
driverConfig:
s3ng:
metadataBackend: messagepack
endpoint: ...
region: ...
bucket: ...
putObject:
# -- Disable multipart uploads when copying objects to S3
disableMultipart: true now tests are more stable:
hmmm but I still got a kill:
|
setting |
forcing a guaranteed memory limit by setting it to the same as request also does not stop kubernetes from OOMKilling things
it might not be the storage users pod ... I need to better understand the events:
Also running the tests with more than 150VUs fails ... I need to check if enough users are available ... |
nats-js-kv-registry still seems broken. we tried disabling the cache but still see old ips show up ... 😞 |
hm a micro Selector always uses a cache with a default TTL of 1 minute: // NewSelector creates a new default selector.
func NewSelector(opts ...Option) Selector {
sopts := Options{
Strategy: Random,
}
for _, opt := range opts {
opt(&sopts)
}
if sopts.Registry == nil {
sopts.Registry = registry.DefaultRegistry
}
s := ®istrySelector{
so: sopts,
}
s.rc = s.newCache()
return s
} and we use that selector at least in our proxy/pkg/router/router.go: reg := registry.GetRegistry()
sel := selector.NewSelector(selector.Registry(reg)) |
we fixed more issues
|
these fixes bring us to a more reasonable loadtest. Sharing and tagging seem broken, though. tagging is a known issue AFAIK but sharing used to work.
|
The next steps for this are:
before we can close this issue we need to evaluate how the 1h load tests behave. before the login problvems are fixed this is blocked. |
According to https://github.com/owncloud-koko/deployment-documentation/tree/main/development/loadtest/de-environment each loadtest school has 5000 users configured. Also Keycloak should be scaled as the one on PROD (CPU / RAM). The Realm settings regarding brokering, etc should differ though because we don't really have another IDM that we can broker. |
we need to backport all natsjskv registry fixes from #8589 (comment) to stable5. |
I backported the nats-js-kv registry fixes in #10019 |
During loadtests we seem to be losing requests. We have identified several possible causes:
1. when a new pod is added it does not seem to receive traffic
This might be caused by clients not picking up the new service. One reason would be that the same grpc connection is reused. We need to make sure that every service uses the a selector.Next() call to get a fresh client from the registry.
2. when a pod is shut down because kubernetes moves it to a different node or it is descheduled it still receives traffic
This might be caused by latency. The client got a grpc client with selector.Next() but then the pod was killed before the request reached it. We should retry requests, but the grpc built in retry mechanism would need to know all possible services. That is not how the reva pool works.
We could configure the grpc connection to retry requests:
but they would just try the same ip. To actually send requests to different servers, aka client side load balancing we would have to add sth. like:
The load balancing works based on name resolving.
We could add all this to the reva pool ... or we use a go micro grpc client that already implements a pool, integrates with the service registry and can do retry, backoff and whatnot. But this requires generating micro glients for the cs3 api using
github.com/go-micro/generator/cmd/protoc-gen-micro
3. pod readyness and health endpoints do not reflect the actual state of the pod
Currently, the
/healthz
and/readyz
endpoints are independent from the actual service implementation. But some services need some time to be ready or flush all requests on shutdown. This also needs to be investigated.For ready we could use a channel to communicate between the actual handler and the debug handler.
And AFAIR @rhafer mentioned we need to take care of shutdown functions ... everywhere.
4. the services are needlessly split into separate pods
Instead of startinf a pod for every service we should aggregate all processes that are involved in translating a request until they reach a storage provider:
The services should use localhost or even unix sockets to talk to each other. go can very efficiently use the resources in a pod an handle requests concurrently. We really only create a ton of overhead that stresses the kubernetes APIs and can be reduced.
The text was updated successfully, but these errors were encountered: