-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA: refactor utils related to NodeInfos #7479
Conversation
/assign @MaciekPytel |
f013ec0
to
5a9ff97
Compare
5a9ff97
to
3a7c44f
Compare
/assign @BigDarkClown |
3a7c44f
to
e0024b0
Compare
e0024b0
to
a7eea5e
Compare
@@ -34,7 +34,7 @@ import ( | |||
"k8s.io/autoscaler/cluster-autoscaler/core/scaledown/planner" | |||
scaledownstatus "k8s.io/autoscaler/cluster-autoscaler/core/scaledown/status" | |||
"k8s.io/autoscaler/cluster-autoscaler/core/scaleup" | |||
orchestrator "k8s.io/autoscaler/cluster-autoscaler/core/scaleup/orchestrator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes
a7eea5e
to
89a5259
Compare
} | ||
nodeInfo, err := simulator.BuildNodeInfoForNode(sanitizedNode, podsForNodes[node.Name], daemonsets, p.forceDaemonSets) | ||
templateNodeInfo, caErr := simulator.TemplateNodeInfoFromExampleNodeInfo(nodeInfo, id, daemonsets, p.forceDaemonSets, taintConfig) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be if caErr != nil {
here?
(also do we need to define a new caErr
variable here instead of just re-using err
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh my bad, good catch!
As to why we need 2 variables:
GetNodeInfo
returns theerror
interface, soerr
has typeerror
.TemplateNodeInfoFromExampleNodeInfo
returns theerrors.AutoscalerError
interface.- We need to return the
errors.AutoscalerError
interface fromProcess()
. - We could technically assign the
TemplateNodeInfoFromExampleNodeInfo
error of typeerrors.AutoscalerError
toerr
, sinceerrors.AutoscalerError
is a superset oferror
. But then we'd have to wrap it inerrors.AutoscalerError
right back to return it, which is IMO confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK, all makes sense!
for _, slice := range n.LocalResourceSlices { | ||
newSlices = append(newSlices, slice.DeepCopy()) | ||
} | ||
return NewNodeInfo(n.Node().DeepCopy(), newSlices, newPods...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the NewNodeInfo
constructor only sets a node object if the passed in node
is not nil:
if node != nil {
result.schedNodeInfo.SetNode(node)
}
... invoking n.Node().DeepCopy(),
inline like this might be (theoretically) subject to a nil pointer exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, you can ignore this comment
// Node returns overall information about this node.
func (n *NodeInfo) Node() *v1.Node {
if n == nil {
return nil
}
return n.node
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this should be okay because DeepCopy()
gracefully handles nil receivers. Added a comment for clarity and more test cases to cover this scenario.
id := nodeGroup.Id() | ||
baseNodeInfo, err := nodeGroup.TemplateNodeInfo() | ||
if err != nil { | ||
return nil, errors.ToAutoscalerError(errors.CloudProviderError, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this error response too generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, relying on the CloudProvider error being descriptive enough doesn't sound like a good idea. Added a prefix.
This also required making AddPrefix()
wrap the error instead of just modifying the message and implementing Unwrap()
so that we can use errors.Is()
to check if SanitizedTemplateNodeInfoFromNodeGroup
failed because of cloudprovider.ErrNotImplemented
. Added a separate commit for that to this PR, and changed the check in MixedTemplateNodeInfoProvider
from ==
to errors.Is()
.
// TemplateNodeInfoFromNodeGroupTemplate returns a template NodeInfo object based on NodeGroup.TemplateNodeInfo(). The template is sanitized, and only | ||
// contains the pods that should appear on a new Node from the same node group (e.g. DaemonSet pods). | ||
func TemplateNodeInfoFromNodeGroupTemplate(nodeGroup nodeGroupTemplateNodeInfoGetter, daemonsets []*appsv1.DaemonSet, taintConfig taints.TaintConfig) (*framework.NodeInfo, errors.AutoscalerError) { | ||
id := nodeGroup.Id() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to assign this to a var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, this must be a leftover from old code - removed.
return TemplateNodeInfoFromExampleNodeInfo(baseNodeInfo, id, daemonsets, true, taintConfig) | ||
} | ||
|
||
// TemplateNodeInfoFromExampleNodeInfo returns a template NodeInfo object based on a real example NodeInfo from the cluster. The template is sanitized, and only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I love the term "example" here. Would TemplateNodeInfoFromNode
or TemplateNodeInfoFromRealNode
or TemplateNodeInfoFromRealNodeInfo
work? Then we'd document like this
// TemplateNodeInfoFromNode returns a template NodeInfo object based on a NodeInfo from a real node on the cluster. The template is sanitized, and only
// We need to sanitize the node before determining the DS pods, since taints are checked there, and
// we might need to filter some out during sanitization.
sanitizedNode := sanitizeNodeInfo(realNode, newNodeNameBase, randSuffix, &taintConfig)
// No need to sanitize the expected pods again - they either come from sanitizedNode and were sanitized above,
etc.
I think my observation is that the word "example" suggests something non-real, mock object, something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point about "example" sugesting something not real, I haven't really thought about it this way. I went with Bartek's naming suggestions which are very similar to yours. WDYT?
func sanitizePod(pod *apiv1.Pod, nodeName, nameSuffix string) *apiv1.Pod { | ||
sanitizedPod := pod.DeepCopy() | ||
sanitizedPod.UID = uuid.NewUUID() | ||
sanitizedPod.Name = fmt.Sprintf("%s-%s", pod.Name, nameSuffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is nice that we are introducing pod name sanitization, but I am not the fan of just adding a suffix here. As we are not constrained here, what would you think about doing something like just adding a random suffix always? This would also simplify the interface a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained why I think consistent suffixes within a NodeInfo are important in the response to your comment on sanitizeNodeInfo
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I am convinced :)
return sanitizeNodeInfo(template, template.Node().Name, suffix, nil) | ||
} | ||
|
||
func sanitizeNodeInfo(nodeInfo *framework.NodeInfo, newNodeNameBase string, namesSuffix string, taintConfig *taints.TaintConfig) *framework.NodeInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this interface, mainly the newNodeNameBase and namesSuffix. It seems confusing and overworded, I need to fully understand the internal implementation to be able to get what it does to node and pod names. I think a better option would be to just go with newNodeName
and change the logic for pod names to not depend on the namesSuffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By interface do you mean the sanitizeNodeInfo
signature? I know it's not ideal, but the function is very short and local (I intentionally kept it unexported and left these details out from the public functions).
IMO having the suffix passed separately so that we can apply the same one to all of the names we're changing is very helpful for debugging. If you're looking at a template NodeInfo (e.g. in the CA debugging snapshot, or in a live debugger), you can easily see which pods were sanitized together with the Node (they have the same suffix as the Node), and distinguish them from "real" pods which got added later. This gets even more helpful when we start sanitizing DRA objects with a bunch of names that all need a suffix. I actually added this after losing an hour debugging a unit test and getting confused by the suffixes.
The local sanitizeNodeInfo
being slightly more confusing seems to me like a fair price to pay for the added debuggability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree these are very good arguments, this will improve stuff from the debugging perspective. +1 from me.
|
||
// FreshNodeInfoFromTemplateNodeInfo duplicates the provided template NodeInfo, returning a fresh NodeInfo that can be injected into the cluster snapshot. | ||
// The NodeInfo is sanitized (names, UIDs are changed, etc.), so that it can be injected along other copies created from the same template. | ||
func FreshNodeInfoFromTemplateNodeInfo(template *framework.NodeInfo, suffix string) *framework.NodeInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of naming of all the functions now. Fresh/Template/Example does not mean much, they just add to confusion. As the user of these I would still need to go to implementation and checkout what it does exactly, the interface is not great because of that.
I think if we restrict ourselves to few key-words ("Template", "Sanitized", "DeepCopy"), we would make it easier to understand what is going on.
My proposal:
# Template methods, best with the same interface [(nodeGroup)/(nodeInfo, ngName)[, DSs, forceDs, taintConfig
TemplateNodeInfoFromNodeGroupTemplate -> SanitizedTemplateNodeInfoFromNodeGroup
TemplateNodeInfoFromExampleNodeInfo -> SanitizedTemplateNodeInfoFromNodeInfo
# DeepCopy methods
FreshNodeInfoFromTemplateNodeInfo -> nodeInfo.SanitizedDeepCopy(nodeName)
nodeInfo.DeepCopy stays the same
What do you think? IMO it would be more self descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, "sanitize" is basically a recognized term in CA while "fresh" and "example" aren't. Went with your naming suggestions!
I'd prefer to keep all of the sanitization logic in a single place though (especially because otherwise we'd have to export sanitizeNodeInfo
which I want to avoid), so I renamed FreshNodeInfoFromTemplateNodeInfo
to NodeInfoSanitizedDeepCopy
instead of making it a NodeInfo method. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that while having nodeInfo.SanitizedDeepCopy
would be nice, it is better to keep it in the same place instead of splitting the sanitization into multiple places. LGTM.
…implement Unwrap() This allows using errors.Is() to check if an AutoscalerError wraps a sentinel error (e.g. cloudprovider.ErrNotImplemented) when a prefix is added to it.
89a5259
to
eb5bfd1
Compare
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate, and scheduler_utils.DeepCopyTemplateNode all had very similar logic for sanitizing and copying NodeInfos. They're all consolidated to one file in simulator, sharing common logic. DeepCopyNodeInfo is changed to be a framework.NodeInfo method. MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to correlate Nodes to scheduled pods, instead of using a live Pod lister. This means that the snapshot now has to be properly initialized in a bunch of tests.
eb5bfd1
to
eb26816
Compare
Changes LGTM, I've added hold but feel free to remove once you are comfortable to do so. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BigDarkClown, towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jackfrancis Just waiting for your LGTM now then |
/lgtm |
/hold cancel |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. There were multiple very similar utils related to copying and sanitizing NodeInfos scattered around the CA codebase. Instead of adding similar DRA handling to all of them separately, they're consolidated into a single location that will be later adapted to handle DRA.
Which issue(s) this PR fixes:
The CA/DRA integration is tracked in kubernetes/kubernetes#118612, this is just part of the implementation.
Special notes for your reviewer:
The first commit in the PR is just a squash of #7466, and it shouldn't be a part of this review. The PR will be rebased on top of master after #7466 is merged.
This is intended to be a no-op refactor. It was extracted from #7350 after #7447, and #7466.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: