-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-9233, RSDK-8904 - only call updateWeakDependents once per reconfiguration level #4585
Conversation
Mostly looks good. I'd like to get more detail on the expected behavior written down somewhere. So if we have a robot config where:
For the following scenarios:
What's the order of the following events:
|
added tests and updated comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, some initial comments.
Co-authored-by: Benjamin Rewis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming tests pass; thanks!
// (that also has an explicit dependency on one of the components). | ||
// The following scenario is expected: | ||
// 1) The two bases will configure first in parallel (which will each bump the logical clock). | ||
// 2) The completeConfig loop will check to see if there is a need to updateWeakDependents by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We run completeConfig at the end of each "level" of reconfigure? At the end of each resource?
I always assumed "complete" inferred the last step before localrobot.Reconfigure
returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking more closely, I'm not sure what @cheukt means by "The completeConfig loop." completeConfig
is, indeed, one of the final steps of Reconfigure/reconfigure
for a local robot. completeConfig
also runs every 5s if any resources are still in NodeStateConfiguring
or NodeStateUnhealthy
.
With @cheukt's changes, updateWeakDependents
no longer runs as part of getDependencies
. It now runs in the following situations:
- Every 5s when any changes in remote resources or resources are still in
NodeStateConfiguring
/NodeStateUnhealthy
- Right after initial construction
- Upon orphaned resource removal
- After
completeConfig
inreconfigure
- Potentially at the end of each level in
completeConfig
a. This one is new and only happens when "any resources in the level needs reconfiguration and depends on weak dependents, and the resource graph has updated since the last time weak dependents were updated"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned offline that I was worried one of these 5 updateWeakDependents
calls might be unnecessary. Now, I can't convince myself that we can remove any of them 😆 . In any case, let's not worry about the other updateWeakDependents
callsites in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly makes sense to me, thanks! Maybe I asked this question in the wrong test. I think the case I've asked you to describe is when base1
strongly depends on weak1
. And in this test was just have weak1
strongly and weakly depending on base1
.
Potentially at the end of each level in completeConfig
That reads to me as:
- First we reconfigure base1+base2
- Then we reconfigure weak1
Which sounds fine because we want to update weak1 because things it depended on changed.
But then the sub-bullet says:
a. This one is new and only happens when "any resources in the level needs reconfiguration and depends on weak dependents, and the resource graph has updated since the last time weak dependents were updated"
Editorializing to make sure we're on the same page:
any resources in the level needs reconfiguration (e.g:
base1
) and depends on weak dependents (e.g:weak1
), and the resource graph has updated since the last time weak dependents (e.g:base1
) were updated
That to me reads as we're reconfiguring the weak dependents because something else depends on them. Normally we want to reconfigure "leaf nodes" first, then work our way up (or down?) to the things that depended on those leaf nodes. But my interpretation of this algorithm is that we're intentionally updating things in the opposite order that we normally do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially at the end of each level in completeConfig
is inaccurate based on the code - I'm pretty sure updateWeakDependents
is being run in the beginning of the loop, and that is the intention because of the case Dan is describing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure updateWeakDependents is being run in the beginning of the loop, and that is the intention because of the case Dan is describing
You're referencing this call to updateWeakDependents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is linking correctly, but I think we have the same idea. https://github.com/viamrobotics/rdk/pull/4585/files#diff-77fd7ff51c1b7804c67c4df0a9a672ea0e3842b4a54bbd8ef81af82ac95c4f22R657
// The following scenario is expected: | ||
// 1) The two bases will configure first in parallel (which will each bump the logical clock). | ||
// 2) The completeConfig loop will check to see if there is a need to updateWeakDependents by | ||
// checking the logical clock on also whether there are dependencies on weak dependents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "dependencies on weak dependents" mean "explicit/"strong" dependencies on weak dependents"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted this as "resources that have an API/model pair with weak dependencies." I believe @cheukt is using the words "weak dependents" to mean resources upon which other resources weakly depend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a suggested change on your earlier in-line comment to potentially change the language here and elsewhere for %s:dependencies on weak dependents/has weak dependencies/g
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my comment is meant to convey what @dgottlieb is saying. I'm using "weak dependents" to mean "resources with weak dependencies" but can see why it's confusing. Can't really think of a better name though, but I also feel that "resources with dependencies that have weak dependencies" is also just as confusing. Open to other phrasing!
// 1) The two bases will configure first in parallel (which will each bump the logical clock). | ||
// 2) The completeConfig loop will check to see if there is a need to updateWeakDependents by | ||
// checking the logical clock on also whether there are dependencies on weak dependents. | ||
// The check will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think on
-> and
? ("checking the logical clock on")
Which conditions fail? I think the "last round" is still at 0 and the "clock" is at 2 now? So this would fail because we haven't initialized weak1
yet? ("There are no components with weak dependencies?")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think on -> and?
Agree, it should be and
.
Which conditions fail?
Failure (no update of weak dependents from @cheukt 's new updateWeakDependents
call in completeConfig
) occurs when the "last round" is GTE the clock value or when no resources have weak dependencies.
I think the "last round" is still at 0 and the "clock" is at 2 now?
Yep.
So this would fail because we haven't initialized weak1 yet? ("There are no components with weak dependencies?")
Yep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this would fail because we haven't initialized weak1 yet? ("There are no components with weak dependencies?")
No, this failed because there are no dependencies on weak dependents in level 2, e.g. nothing depends on weak1
// checking the logical clock on also whether there are dependencies on weak dependents. | ||
// The check will fail. | ||
// 3) weak1 will configure. | ||
// 4) At the end of reconfiguration, weak1 will reconfigure once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logical clock will be at 3 now? And last round is still 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logical clock will be at 3 now?
4 now.
And last round is still 0?
Correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last round is 4 - adding a check and additional documentation to clarify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely misunderstanding last round; will need some clarification on its value offline.
|
||
// Reconfigure base2 | ||
// The following scenario is expected: | ||
// 1) The base2 will reconfigure (which will bump the logical clock). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would bet it's 7 at this point. I'll stop answering these ones, but we definitely need assertions on the logical clock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the assertions
test.That(t, weak1.reconfigCount, test.ShouldEqual, 2) | ||
|
||
// Reconfigure base2 | ||
// The following scenario is expected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "last weak dependency round" here? And when does it get bumped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added language around each scenario about the expected last weak dependency round
test.That(t, weakCfg3.Ensure(false, logger), test.ShouldBeNil) | ||
robot.Reconfigure(context.Background(), &weakCfg3) | ||
|
||
res, err = robot.ResourceByName(weak1Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above -- ideally changed to weakRes
.
test.That(t, err, test.ShouldBeNil) | ||
test.That(t, weak1.reconfigCount, test.ShouldEqual, 3) | ||
|
||
// check that getting dependencies for either resource does not increase reconfigCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend changing "getting dependencies" to "calling getDependencies
". And "does not increase reconfigCount" to "does not change weak1.reconfigCount
".
lRobot := robot.(*localRobot) | ||
node, ok := lRobot.manager.resources.Node(weak1Name) | ||
test.That(t, ok, test.ShouldBeTrue) | ||
lRobot.getDependencies(weak1Name, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "getting dependencies for either resource does not".
I was assuming that'd be "base1" and "base2". But instead we're checking "weak1" and "base1". I can't tell if those cases have specific logic we're trying to exercise? Or if "base2" is considered redundant? But I'm not sure why it would be. Given it's different from base1 in that it is not an explicit dependency for weak1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the infinite loop issue happened because getDependencies ended up calling reconfig on weak1 which chained infinitely. this check is basically a check that getDependencies do not have side effects on the resource graph, and the tests here are already set up. Can move to a different test if that makes more sense
robot/impl/resource_manager.go
Outdated
@@ -629,6 +629,41 @@ func (manager *resourceManager) completeConfig( | |||
levels := manager.resources.ReverseTopologicalSortInLevels() | |||
timeout := rutils.GetResourceConfigurationTimeout(manager.logger) | |||
for _, resourceNames := range levels { | |||
// if any resources in the level needs reconfiguration and depends on weak dependents, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if any resources in the level needs reconfiguration and depends on weak dependents, | |
// if any resources in the level need reconfiguration and have weak dependencies, |
The phrase "depend on weak dependents" is pretty confusing to me. Is my rephrasing there accurate with what you're meaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not accurate - what I mean is if any resource in the level need reconfiguration and depends on a resource that have weak dependencies. So if the resource depends on something that would be affected by updateWeakDependents.
I ended up using the same phrasing as the function name, but definitely can be confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying; I see now you're using GetAllParentsOf
on resources in the level and then checking whether those parents (resources that depend on the original resource) has weak dependencies. Now I'm more confused on how we arrived on that being part of the requirement for updateWeakDependents
; let's talk about in the meeting later.
// (that also has an explicit dependency on one of the components). | ||
// The following scenario is expected: | ||
// 1) The two bases will configure first in parallel (which will each bump the logical clock). | ||
// 2) The completeConfig loop will check to see if there is a need to updateWeakDependents by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned offline that I was worried one of these 5 updateWeakDependents
calls might be unnecessary. Now, I can't convince myself that we can remove any of them 😆 . In any case, let's not worry about the other updateWeakDependents
callsites in this PR.
|
||
// Reconfigure base1 | ||
// The following scenario is expected: | ||
// 1) The base1 will reconfigure (which will bump the logical clock). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To 5.
// 1) The base1 will reconfigure (which will bump the logical clock). | ||
// 2) The completeConfig loop will check to see if there is a need to updateWeakDependents by | ||
// checking the logical clock on also whether there are dependencies on weak dependents. | ||
// The check will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, too, @cheukt ; shouldn't the check pass? As @dgottlieb pointed out, it definitely looks like weak1
gets a second Reconfigure
... Step (4) here therefore sounds incorrect.
// 2) The completeConfig loop will check to see if there is a need to updateWeakDependents by | ||
// checking the logical clock on also whether there are dependencies on weak dependents. | ||
// The check will fail. | ||
// 3) weak1 will not reconfigure, as base1 was not newly built or error during reconfiguration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why (2) fails (if it does,) and I do not think the "newly built" or "reconfig error" pieces here have to do with it. Agree with @dgottlieb 's rewording, too: something like weak1 will not reconfigure, as base1 was neither newly built nor errored during reconfiguration.
// We need the following `robot.Reconfigure` to call `Reconfigure` on this `base1` | ||
// component. We change the `Attributes` field from the previous (nil) value to | ||
// accomplish that. | ||
Attributes: rutils.AttributeMap{"version": 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would work, too, but I have no strong opinion on what you currently have vs. @dgottlieb 's suggestion.
|
||
// Reconfigure base2 | ||
// The following scenario is expected: | ||
// 1) The base2 will reconfigure (which will bump the logical clock). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would bet it's 7 at this point. I'll stop answering these ones, but we definitely need assertions on the logical clock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I addressed everything, let me know if there are further qs
weak1, err := resource.AsType[*someTypeWithWeakAndStrongDeps](res) | ||
test.That(t, err, test.ShouldBeNil) | ||
// Assert that the weak dependency was tracked. | ||
test.That(t, weak1.resources, test.ShouldHaveLength, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is a bug that the weak dependency system creates and I'm not looking to fix here. weak1 will be called a few times during reconfiguration. if reconfigure is called during the "regular" reconfiguration loop (the one with level), the reconfigure will only have the strong dependency passed in. during updateWeakDependents however, we will call it with both bases . Because we do make sure to call updateWeakDependents before we exit out of reconfiguration, the list of dependencies should be 2.
Edit: the paragraph above is wrong. We do grab both weak and strong dependencies and pass them in at the same time - we dedupe using a map of [resource.Name]{resource}
// 1) The base1 will reconfigure (which will bump the logical clock). | ||
// 2) The completeConfig loop will check to see if there is a need to updateWeakDependents by | ||
// checking the logical clock on also whether there are dependencies on weak dependents. | ||
// The check will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding additional words to explain, but basically these checks fail because there is nothing that depends on weak dependents (ex: weak1) in the next level
// 2) The completeConfig loop will check to see if there is a need to updateWeakDependents by | ||
// checking the logical clock on also whether there are dependencies on weak dependents. | ||
// The check will fail. | ||
// 3) weak1 will not reconfigure, as base1 was not newly built or error during reconfiguration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
// checking the logical clock on also whether there are dependencies on weak dependents. | ||
// The check will fail. | ||
// 3) weak1 will not reconfigure, as base1 was not newly built or error during reconfiguration. | ||
// 4) At the end of reconfiguration, weak1 will reconfigure once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updateWeakDependents will reconfigure weak1. adding language to clarify
|
||
// Reconfigure base2 | ||
// The following scenario is expected: | ||
// 1) The base2 will reconfigure (which will bump the logical clock). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the assertions
lRobot := robot.(*localRobot) | ||
node, ok := lRobot.manager.resources.Node(weak1Name) | ||
test.That(t, ok, test.ShouldBeTrue) | ||
lRobot.getDependencies(weak1Name, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the infinite loop issue happened because getDependencies ended up calling reconfig on weak1 which chained infinitely. this check is basically a check that getDependencies do not have side effects on the resource graph, and the tests here are already set up. Can move to a different test if that makes more sense
// 2) The completeConfig loop will check to see if there is a need to updateWeakDependents by | ||
// checking the logical clock and also whether there are dependencies on weak dependents. | ||
// The check will fail. | ||
// 3) weak1 will not reconfigure, as base1 was not newly built or error during reconfiguration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did intend for the base1 bit there, since I wanted to give a reason as to why weak1 did not reconfigure. Can remove if it's confusing
test.That(t, weak1.reconfigCount, test.ShouldEqual, 2) | ||
|
||
// Reconfigure base2 | ||
// The following scenario is expected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added language around each scenario about the expected last weak dependency round
// We need the following `robot.Reconfigure` to call `Reconfigure` on this `base1` | ||
// component. We change the `Attributes` field from the previous (nil) value to | ||
// accomplish that. | ||
Attributes: rutils.AttributeMap{"version": 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
test.That(t, err, test.ShouldBeNil) | ||
weak1, err = resource.AsType[*someTypeWithWeakAndStrongDeps](res) | ||
test.That(t, err, test.ShouldBeNil) | ||
test.That(t, weak1.reconfigCount, test.ShouldEqual, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we run updateWeakDependents at the end of reconfiguration, which is why this has been bumped. let me know if the comments communicate the story better now.
Since this PR is pretty noisy already, I assume you two are alright with continuing to discuss the weak dependency system in person later this week? |
test.That(t, err, test.ShouldBeNil) | ||
test.That(t, weak1.reconfigCount, test.ShouldEqual, 2) | ||
|
||
test.That(t, lRobot.lastWeakDependentsRound.Load(), test.ShouldEqual, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So reconfiguring base1
calls SwapResource
and bumps the clock. But "weak reconfiguring" weak1
does not result in a SwapResource
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we don't call SwapResource when "weak reconfiguring."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we don't call SwapResource when "weak reconfiguring."
Why do we not call SwapResource
when weak reconfiguring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we don't want the logical clock to increase when weak reconfiguring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responded, @dgottlieb you said you had some changes ready so did not make further doc changes yet
revamped documentation once more, hopefully things are a bit clearer now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the state of this PR now? The current comments LGTM and I answered my remaining question (I think.)
test.That(t, err, test.ShouldBeNil) | ||
test.That(t, weak1.reconfigCount, test.ShouldEqual, 2) | ||
|
||
test.That(t, lRobot.lastWeakDependentsRound.Load(), test.ShouldEqual, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we don't want the logical clock to increase when weak reconfiguring.
Bug:
r.lastWeakDependentsRound.Load() <= node.UpdatedAt()
, meaning this will call updateWeakDependents in an infinite loopFix:
Since we can't guarantee that any resource depending on weak dependents will get "the most updated" weak dependent anyway, perhaps we should just guarantee that those weak dependents will be as updated as the last reconfiguration level