Chore: Figure out appropriate requests and limits for Claudie services #935

katapultcloud · 2023-07-04T13:48:19Z

Description

Requests and limits should be adjusted as they seems to take way more than they actually need causing overprovisioning of hardware. I've viewed consumption of requests of each service (some not included) in GKE observability console and this is what I came up with.

Memory

	Util	Recommended	Current
ansibler	0.8%	100Mi	768Mi
builder	1.41%	50Mi	200Mi
dynamodb	12%	200Mi	512Mi
kube-eleven	1%	100Mi	500Mi
kuber	17%	100Mi	200Mi
mongodb	68%	stays	300Mi
terraformer	0.6%	200Mi	1200Mi

CPU

	Util	Recommended	Current
ansibler	0.1%	100m	700m
builder		stays	80m
dynamodb		stays	100m
kube-eleven	0.02%	100m	500m
kuber	0.02%	50m	300m
mongodb	6%	stays	100m
terraformer	0.03%	100m	700m

However, the statistics in GKE console are not great and I'd like to monitor services for some time 1-2 weeks before setting these in stone.

Exit criteria

Install kube metrics and prometheus
observe for 1-2 weeks
set requests and limits accordingly taking spikes into account

MarioUhrik · 2023-07-04T14:15:44Z

Have these recommendations taken spikes into account ?

We've had several rounds of requests/limit tuning already, and there's a reason why they are roughly as you've found them

katapultcloud · 2023-07-04T15:55:21Z

@MarioUhrik I don't trust GKE observability tooling to provide the accurate stats including spikes, I think they average them out quite aggressively. At the bottom I mentioned what should be the correct steps. The tables show what I would consider with GKE tooling, but without further investigation using kube metrics and prometheus we should not proceed.

MarioUhrik · 2023-07-04T15:56:16Z

Sounds good, thanks

katapultcloud · 2023-07-07T12:40:48Z

monitor e2e cluster for 1-2 weeks
monitor mgmt cluster for 1-2 weeks

JKBGIT1 · 2023-08-22T07:58:39Z

There are some gathered data from monitoring stack, when the pipeline on e2e cluster ran.

Memory biggest spikes for last 24h

day	ansibler	autoscaler	builder	dynamodb	kube-eleven	kuber	mongodb	terraformer
2.8.2023	~600MiB	~60MiB	~10MiB	~120MiB	~100MiB	~100MiB	~240MiB	~1.45GiB
7.8.2023	~605MiB	~73MiB	~10MiB	~128MiB	~100MiB	~100MiB	~230MiB	~1GiB
14.8.2023	~600MiB	~75MiB	~12MiB	~120MiB	~100MiB	~190MiB	~250MiB	~1.12GiB
18.8.2023	~650MiB	~60MiB	~17MiB	~110MiB	~140MiB	~120MiB	~200MiB	~1.13GiB

CPU biggest spikes for last 24h

day	ansibler	autoscaler	builder	dynamodb	kube-eleven	kuber	mongodb	terraformer
2.8.2023	~1400m	-	~1.6m	~35m	~160m	~350m	~14m	~1300m
7.8.2023	~1100m	-	~1.2m	~14m	~95m	~210m	~10m	~1140m
14.8.2023	~1560m	~27m	~4m	~180m	~246m	~476m	~54m	~1060m
18.8.2023	~1610m	~19m	~1.75m	~140m	~180m	~470m	~95m	~1520m

Based on the spikes I have proposed some requests and limits changes, but I am not sure, whether they are relevant.

CPU

	ansibler	autoscaler	builder	dynamodb	kube-eleven	kuber	mongodb	terraformer
curr request	700m	100m	80m	100m	500m	300m	100m	700m
curr limit	1024m	100m	160m	200m	700m	500m	150m	1024m
new request	1100m	50m	5m	-	250m	-	-	1024m
new limit	1500m	75m	10m	-	350m	-	-	1500m

Memory

	ansibler	autoscaler	builder	dynamodb	kube-eleven	kuber	mongodb	terraformer
curr request	500Mi	300Mi	200Mi	512Mi	150Mi	200Mi	300Mi	1024Mi
curr limit	900Mi	300Mi	400Mi	1Gi	300Mi	400Mi	500Mi	1200Mi
new request	600Mi	80Mi	15Mi	120Mi	120Mi	150Mi	250Mi	-
new limit	750Mi	120Mi	25Mi	150Mi	180Mi	250Mi	450Mi	1500Mi

see tables in excel

JKBGIT1 · 2023-09-29T13:47:02Z

We have discussed new requests and limits with @katapultcloud and @cloudziu on a call. You can see them in table below. BTW we have decided to remove limits on CPU and keep only requests.

Based on the spikes I have proposed some requests and limits changes, but I am not sure, whether they are relevant.

CPU

	ansibler	autoscaler	builder	dynamodb	kube-eleven	kuber	mongodb	terraformer
new request	200m	-	10m	100m	100m	100m	100m	200m
new limit	-	-	-	-	-	-	-	-

Memory

	ansibler	autoscaler	builder	dynamodb	kube-eleven	kuber	mongodb	terraformer
new request	600Mi	100Mi	15Mi	120Mi	120Mi	100Mi	200Mi	1200Mi
new limit	800Mi	120Mi	25Mi	200Mi	160Mi	200Mi	300Mi	1500Mi

katapultcloud added the chore A chore is updating dependencies, etc; no significant code changes label Jul 4, 2023

katapultcloud added the groomed Task that everybody agrees to pass the gatekeeper label Jul 7, 2023

JKBGIT1 self-assigned this Jul 12, 2023

JKBGIT1 mentioned this issue Sep 29, 2023

Feature: Create graph for OOMkills in our e2e monitoring #1043

Closed

1 task

JKBGIT1 linked a pull request Oct 2, 2023 that will close this issue

Chore/set new request and limits on services #1055

Merged

JKBGIT1 closed this as completed in #1055 Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chore: Figure out appropriate requests and limits for Claudie services #935

Chore: Figure out appropriate requests and limits for Claudie services #935

katapultcloud commented Jul 4, 2023

MarioUhrik commented Jul 4, 2023 •

edited

Loading

katapultcloud commented Jul 4, 2023

MarioUhrik commented Jul 4, 2023

katapultcloud commented Jul 7, 2023

JKBGIT1 commented Aug 22, 2023

JKBGIT1 commented Sep 29, 2023

Chore: Figure out appropriate requests and limits for Claudie services #935

Chore: Figure out appropriate requests and limits for Claudie services #935

Comments

katapultcloud commented Jul 4, 2023

Description

Exit criteria

MarioUhrik commented Jul 4, 2023 • edited Loading

katapultcloud commented Jul 4, 2023

MarioUhrik commented Jul 4, 2023

katapultcloud commented Jul 7, 2023

JKBGIT1 commented Aug 22, 2023

JKBGIT1 commented Sep 29, 2023

MarioUhrik commented Jul 4, 2023 •

edited

Loading