-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SKS-2157: Cache placement group to reduce Tower API requests #161
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #161 +/- ##
==========================================
+ Coverage 59.35% 59.58% +0.23%
==========================================
Files 19 19
Lines 3523 3556 +33
==========================================
+ Hits 2091 2119 +28
- Misses 1283 1288 +5
Partials 149 149 ☔ View full report in Codecov by Sentry. |
controllers/vm_limiter.go
Outdated
@@ -38,7 +38,7 @@ const ( | |||
placementGroupSilenceTime = time.Minute * 5 | |||
) | |||
|
|||
var vmTaskErrorCache = cache.New(5*time.Minute, 10*time.Minute) | |||
var memoryCache = cache.New(5*time.Minute, 10*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memoryCache -> inMemoryCache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
也加个注释说明下这个cache是做什么用的
更新验证结果和最新讨论。剩余问题可以等 SDK 处理,或者另外单独处理。 |
这个问题,记个ticket 放到1.3处理吧 |
Issue
在一个SKS创建的所有工作负载集群包含一百几十个节点的环境中,获取Tower放置组API的调用量会很大(包含 CAPE 和其他的方式的)。
CAPE 每个节点关联了一个放置组,放置组是节点组之间共享的。
当前每个节点 Reconcile 的时候都会通过 Tower API 获取放置组的信息(两次),一次是确认放置组是否存在,一次是确认节点对应的虚拟机是否还在放置组。
CAPE 默认 10-15m 就 resync,这个时候所有的节点都会 Reconcile 一次,也就意味着需要从 Tower 获取两次放置组信息,这样会重复从 Tower 获取放置组信息。
Change
多个节点是共享放置组的,所以放置组读多写少,考虑使用缓存,减少对 Tower API 的访问。
如果修改了放置组,需要及时删除放置组缓存。并且在删除放置组的时候也应该马上删除放置组缓存,防止马上创建同名放置组的时候使用到过期的缓存。
Test
连续八次通过 E2E
优化前:
优化后:
出现突然访问暴增的原因是 Tower 服务不可用(例如更新、重启等)。Tower 不可用之后,调用 Tower 接口马上就返回 503 了,所以很快,导致 CAPE 不断的退避重试(此时缓存可能无效)。
相关讨论 slack。访问突然暴增的问题可以不在当前 PR 处理。