-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix order of COLD events for directors vs. VCL (and VMODs, implicitly) #4037
Fix order of COLD events for directors vs. VCL (and VMODs, implicitly) #4037
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On paper I think this change looks good and it's certainly something we could quickly merge to solve the problem this addresses.
I thought this was going to be short review, but revisiting the VCL event logic I'm wondering whether the semantics should be clarified again. Before this change it looks like we are marking the VCL as cold and then propagating the cold events further down and I think that in general that's a sane behavior.
On the other hand it looks like we do the same for warm events: we make the VCL warm and then propagate the warm signal. I feel like it should be the other way around: once everything is warm, "commit" the warmup transaction at the VCL level. Of course if the VMODs and VBEs can't warm up without a warm VCL, then it is inappropriate.
Of course this is not something we can just do without making sure that we don't break the loose contract we currently have with VMODs. Maybe vcl_send_event()
should be in charge of the dispatch.
We could start by encoding the desired behavior with types:
static const struct vcl_event {
const char *name;
int forward; /* before < 0, never == 0, after > 0 */
unsigned method;
unsigned have_msg;
// ...
} vcl_events[] = {{
// ...
}};
static const struct vcl_event *
vcl_event(enum vcl_event_e ev)
{
// assert
return (&vcl_events[ev]);
}
And with something like this we could nail down the forward logic and semantics in one central place.
I looked into a similar question before adding the PR, and in general I agree that the current logic looks inconsistent: We mark the VCL warm before it is, and we mark it cold before it is. But running I think that any proposal to complicate this logic should be motivated by relevant use cases. For this reason and my expectation of canned worms, I have not thought about your proposal in more detail yet. |
I think the ordering should be reversed, but it's obviously something which is badly underdocumented. |
FTR, OKed by bugwash |
vcldir_retire() only sends a VCL_EVENT_COLD when the vcl is warm. In other words, it (rightly) asserts that a COLD event (if any) has alrady been posted when a director is deleted on a COLD vcl. Yet, when the director is deleted upon a COLD vmod event, this assertion was wrong, because the COLD events for directors were only posted after vmod events. Given that vmods do things like deleting directors, it appears (more) correct to post VDI COLD events before VMOD COLD events.
3c84e9e
to
c44bd67
Compare
To avoid VRT_AddDirector() tripping on the "Dynamic Backends can only be added to warm VCLs" assertion, we need to ensure that we hold a VCL reference until all resolver threads are stopped, such that the VCL temperature remains VCL_COOLING and VRT_AddDirector() returns NULL instead of panicking. Yet because of the wrong COLD event order in VC#4037, we run into a catch-22: - we can not return the reference until all directors received a cold event, which only happens after the VMOD cold event, - but we can neither keep the reference until they are all done, because that trips another assertion in mgt that a cold event needs to succeed. Because we do not want to keep this legacy in the main branch, we add the workarounds specifically for 7.4. Ref varnishcache/varnish-cache#4037 Closes #108
So it turns out, that - still - we can deadlock raching between VCL events and director deletions (see varnishcache/varnish-cache#4140). This commit was originally written to fix #108, but it now enables us to take a different route again: Do not pthread_join() lookup threads, but rather detach them and make sure _they_ return the reference before terminating. The remainder of the commit message is an edit in light of the new insights: Our lookup threads may continue for a bit after a VCL_COLD event has been received: dom_event() sets dom->status to DYNAMIC_ST_DONE while the lookup thread may be blocking in the resolver or updating a domain. So, lookup threads might try to create backends even after the VCL has gone cold, in which case VRT_AddDirector() would panic with "Dynamic Backends can only be added to warm VCLs" - _unless_ a reference is present on the VCL to make the VCL state change go through VCL_TEMP_COOLING, in which case VRT_AddDirector() just returns NULL. vmod_dynamic took a VCL reference since 99912bc back in May 2019, but before the merge of varnishcache/varnish-cache#4037 it did not work correctly, because the VCL reference was returned before all threads had terminated. The naive idea to solve this would be to take/release a VCL reference in each lookup thread (or when starting/stopping it), but references can only be taken from the CLI context (ASSERT_CLI() in VRT_VCL_Prevent_Discard). Also, to support use of the .resolve() type method from vcl_init{}, we needed to depart from the original concept of running resolver threads only for warm vcls. We get out of this situation by using a reference on top of the VCL reference, which we can take even before we have the VCL reference. The VCL reference is returned once all internal "referef"s are returned. Ref #108 Probably fixes #117 also
@delthas reported an interesting issue with vmod_dynamic: When the vmod receives a cold event, it might have some directors left to delete, and when it deletes them, they do not receive a cold event, and a follow up assertion triggers because a warm director is deleted.
I believe the root cause is actually a wrong ordering in Varnish-Cache:
vcldir_retire()
only sends aVCL_EVENT_COLD
when the vcl is warm. In other words, it (rightly, IMHO) asserts that a COLD event (if any) has alrady been posted when a director is deleted on a COLD vcl.Yet, when the director is deleted upon a COLD vmod event, this assertion was wrong, because the COLD events for directors were only posted after vmod events.
Given that vmods do things like deleting directors, it appears (more) correct to post VDI COLD events before VMOD COLD events.