Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SKS-2157: Cache placement group to reduce Tower API requests #161

Merged
merged 3 commits into from
Dec 7, 2023

Conversation

haijianyang
Copy link
Contributor

@haijianyang haijianyang commented Dec 5, 2023

Issue

在一个SKS创建的所有工作负载集群包含一百几十个节点的环境中,获取Tower放置组API的调用量会很大(包含 CAPE 和其他的方式的)。
image

CAPE 每个节点关联了一个放置组,放置组是节点组之间共享的。

当前每个节点 Reconcile 的时候都会通过 Tower API 获取放置组的信息(两次),一次是确认放置组是否存在,一次是确认节点对应的虚拟机是否还在放置组。

CAPE 默认 10-15m 就 resync,这个时候所有的节点都会 Reconcile 一次,也就意味着需要从 Tower 获取两次放置组信息,这样会重复从 Tower 获取放置组信息。

Change

多个节点是共享放置组的,所以放置组读多写少,考虑使用缓存,减少对 Tower API 的访问。

如果修改了放置组,需要及时删除放置组缓存。并且在删除放置组的时候也应该马上删除放置组缓存,防止马上创建同名放置组的时候使用到过期的缓存。

Test

连续八次通过 E2E

优化前:
image

优化后:
image

出现突然访问暴增的原因是 Tower 服务不可用(例如更新、重启等)。Tower 不可用之后,调用 Tower 接口马上就返回 503 了,所以很快,导致 CAPE 不断的退避重试(此时缓存可能无效)。
image

相关讨论 slack。访问突然暴增的问题可以不在当前 PR 处理。

@haijianyang haijianyang requested a review from jessehu December 5, 2023 07:24
Copy link

codecov bot commented Dec 5, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (04ef554) 59.35% compared to head (47e612d) 59.58%.

Files Patch % Lines
pkg/service/vm.go 0.00% 9 Missing ⚠️
controllers/tower_cache.go 92.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #161      +/-   ##
==========================================
+ Coverage   59.35%   59.58%   +0.23%     
==========================================
  Files          19       19              
  Lines        3523     3556      +33     
==========================================
+ Hits         2091     2119      +28     
- Misses       1283     1288       +5     
  Partials      149      149              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

controllers/tower_cache.go Outdated Show resolved Hide resolved
@@ -38,7 +38,7 @@ const (
placementGroupSilenceTime = time.Minute * 5
)

var vmTaskErrorCache = cache.New(5*time.Minute, 10*time.Minute)
var memoryCache = cache.New(5*time.Minute, 10*time.Minute)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memoryCache -> inMemoryCache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也加个注释说明下这个cache是做什么用的

@haijianyang
Copy link
Contributor Author

更新验证结果和最新讨论。剩余问题可以等 SDK 处理,或者另外单独处理。

@haijianyang haijianyang requested a review from jessehu December 7, 2023 07:01
@jessehu
Copy link
Collaborator

jessehu commented Dec 7, 2023

出现突然访问暴增的原因是 Tower 服务不可用(例如更新、重启等)。Tower 不可用之后,调用 Tower 接口马上就返回 503 了,所以很快,导致 CAPE 不断的退避重试(此时缓存可能无效)。

这个问题,记个ticket 放到1.3处理吧

@haijianyang haijianyang merged commit 319a4fb into smartxworks:master Dec 7, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants