-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scheduling for config hash verficiation #203
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not sure about the intended algorithm. What happens if there's a single hash mismatch event and then nothing else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not sure about the intended algorithm. What happens if there's a single hash mismatch event and then nothing else?
So the reason there was a config hash mismatch is that the behavioral Faucet config file was updated by Forch while Forch received a CONFIG_CHANGE event. So once there is a config hash mismatch, Faucet will reload the config file and send a new CONFIG_CHANGE event. If the hashes match, then Forch will clear the I think the key assumption is that Faucet will always reload the config file if it is different and emit a CONFIG_CHANGE event. I mean, there will be always another CONFIG_CHANGE event after a single hash mismatch till the hashes match... |
"there will be always another CONFIG_CHANGE event after a single hash
mismatch till the hashes match" -- how do you know that? Faucet itself
doesn't know about the mismatch, it just knows that the config changed, so
it won't categorically send another config change after a hash failure. The
sequence of events you cite is the reason for *this* mismatch, but isn't
the reason for the error check in the first place. It's a distributed
system, and so there should be some mechanism to ensure eventual
consistency in the face of weird behavior. What happens if there's some
calculation error for some unknown reason on the Faucet side, and Forch
doesn't update the config again b/c there was no additional change? You'll
get a CONFIG_CHANGE event, the hashes will mismatch, and then... nothing.
…On Tue, Oct 27, 2020 at 5:41 PM Yufeng Duan ***@***.***> wrote:
Still not sure about the intended algorithm. What happens if there's a
single hash mismatch event and then nothing else?
So the reason there was a config hash mismatch is that the behavioral
Faucet config file was updated by Forch while Forch received a
CONFIG_CHANGE event. So once there is a config hash mismatch, Faucet will
reload the config file and send a new CONFIG_CHANGE event. If the hashes
match, then Forch will clear the config_hash_clash_start_time. Otherwise,
it means Forch updated the config again, and config_hash_clash_start_time
won't be cleared until a CONFIG_CHANGE event is received that makes the
hashes match.
I think the key assumption is that Faucet will always reload the config
file if it is different and emit a CONFIG_CHANGE event. I mean, there will
be always another CONFIG_CHANGE event after a single hash mismatch till the
hashes match...
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#203 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIEPD4K5ZUPP44QEM3QSO3SM5SCXANCNFSM4S7W6S5Q>
.
|
Added timer which starts when Forch detects the every first hash mismatch. Also comparing the hash again when timer times out.. The reason is that while running Eng's test, it happened that Forch wrote behavioral config in between Faucet reloading. For example what happened was:
T2.
T3.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added timer which starts when Forch detects the every first hash mismatch.
Also comparing the hash again when timer times out.. The reason is that while running Eng's test, it happened that Forch wrote behavioral config in between Faucet reloading. For example what happened was:
T1.
- Faucet reloaded config version 1
- Forch received an L2_EXPIRE event
- Forch wrote out behavior config version 2 due to the expiry event
- Forch received a CONFIG_CHANGE event with hash version 1
T2.
- Forch recived learning event, wrote out behavior config version 1
T3.
- Faucet reloaded config version 1, did nothing because config is unchanged from its perspective
I think the right answer here is that the timeout should be reset when forch writes a new config. It's a valid thing for forch to continually write out new configs, in which case the timer likely shouldn't expire. It's the other extreme case (form only one event).
forch/forchestrator.py
Outdated
self._config_hash_clashed = False | ||
self._config_hash_clash_timeout_sec = ( | ||
self._config.event_client.config_hash_clash_timeout_sec or | ||
int(os.getenv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why env variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed..
forch/forchestrator.py
Outdated
assert config_hash == config_info['hashes'], 'config hash info does not match' | ||
|
||
if config_hash == config_info['hashes']: | ||
self._attempt_cancel_config_hash_clash_timer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in these cases I'd leave out the "attempt" -- it's not semantically meaningful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using a seprate timer now.
forch/forchestrator.py
Outdated
def _attempt_start_config_hash_clash_timer(self): | ||
if self._config_hash_clash_timer and self._config_hash_clash_timer.is_alive(): | ||
return | ||
self._config_hash_clash_timer = threading.Timer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should reuse one of the existing heartbeat schedulers for this -- creating lots of little threads ends up being very messy in the long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using _faucet_state_scheduler now..
forch/forchestrator.py
Outdated
self._faucet_collector.process_dataplane_config_change(timestamp, faucet_dps) | ||
|
||
def _attempt_start_config_hash_clash_timer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be "start a timer" -- but rather just a mark of when the last config hash clash was
forch/forchestrator.py
Outdated
'Config hash does not match after %s seconds', self._config_hash_clash_timeout_sec) | ||
self._config_hash_clashed = True | ||
|
||
def _get_config_hash_clashed(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private getter is redundant -- just use the variable directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
forch/faucet_event_client.py
Outdated
@@ -212,9 +212,11 @@ def _dispatch_faucet_event(self, target, target_event): | |||
def _should_log_event(self, event): | |||
return event and os.getenv('FAUCET_EVENT_DEBUG') | |||
|
|||
def next_event(self, blocking=False): | |||
def next_event(self, get_config_hash_clashed, blocking=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not right -- the concept of a config hash shouldn't bleed down into the event client itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
I have run the test again and it seems that simply resetting the timer (setting clash_start_time to current time) could not solve the issue... The problem is that the timer is only cleared when there is a CONFIG_CHANGE event. In the test's case, there was never a new CONFIG_CHANGE event after the previous mismatch, as the config file got changed by Forch to the previous version in between the two Faucet reloading. I guess we need to compare the hashes when the timer times out, and clear the timer (setting clash_start_time to None) if the hashes match.. |
I'll look at the PR now, but the timer should be cleared *when forch writes
a new config. * So -- as long as forch *knows* it's updating the config
then there's no need for the error, it's only when forch *thinks* it's
settled and there's still an error do we need to flag the condition.
…On Wed, Oct 28, 2020 at 12:49 PM Yufeng Duan ***@***.***> wrote:
Added timer which starts when Forch detects the every first hash mismatch.
Also comparing the hash again when timer times out.. The reason is that
while running Eng's test, it happened that Forch wrote behavioral config in
between Faucet reloading. For example what happened was:
T1.
1. Faucet reloaded config version 1
2. Forch received an L2_EXPIRE event
3. Forch wrote out behavior config version 2 due to the expiry event
4. Forch received a CONFIG_CHANGE event with hash version 1
T2.
1. Forch recived learning event, wrote out behavior config version 1
T3.
1. Faucet reloaded config version 1, did nothing because config is
unchanged from its perspective
I think the right answer here is that the timeout should be reset when
forch writes a new config. It's a valid thing for forch to continually
write out new configs, in which case the timer likely shouldn't expire.
It's the other extreme case (form only one event).
I have run the test again and it seems that simply resetting the timer
(setting clash_start_time to current time) could not solve the issue... The
problem is that the timer is only cleared when there is a CONFIG_CHANGE
event. In the test's case, there was not a new CONFIG_CHANGE event, as the
file got changed by Forch to the previous version. I guess we need to
compare the hashes when the timer times out, and clear the timer (setting
clash_start_time to None) if the hashes match..
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#203 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIEPD6AIB6VV3IXM5DBAILSNBYTBANCNFSM4S7W6S5Q>
.
|
Ah, I see what you're saying now about how faucet doesn't send a new
update. Would something like this work, then, where when the time after
an intentional config change expires, the system compares the "last
received hash" with the "expected hash" -- and it's an error at that
point? Maybe that's getting back to what you were saying about needing to
recalculate at that point (really you just need to recalculate when it is
written out, but that can be done at the time of write or the time of
check). So basically the logic becomes:
1. Whenever forch writes a config, start/reset the timer.
2. After XXX seconds, check to see if the last received config hash
(from faucet) matches the last written config hash (from step #1). If
there's a mismatch, then error.
Does that work for the cases you've been tracking?
On Wed, Oct 28, 2020 at 1:21 PM Trevor Pering <[email protected]>
wrote:
… I'll look at the PR now, but the timer should be cleared *when forch
writes a new config. * So -- as long as forch *knows* it's updating
the config then there's no need for the error, it's only when forch
*thinks* it's settled and there's still an error do we need to flag the
condition.
On Wed, Oct 28, 2020 at 12:49 PM Yufeng Duan ***@***.***>
wrote:
> Added timer which starts when Forch detects the every first hash mismatch.
> Also comparing the hash again when timer times out.. The reason is that
> while running Eng's test, it happened that Forch wrote behavioral config in
> between Faucet reloading. For example what happened was:
> T1.
>
> 1. Faucet reloaded config version 1
> 2. Forch received an L2_EXPIRE event
> 3. Forch wrote out behavior config version 2 due to the expiry event
> 4. Forch received a CONFIG_CHANGE event with hash version 1
>
> T2.
>
> 1. Forch recived learning event, wrote out behavior config version 1
>
> T3.
>
> 1. Faucet reloaded config version 1, did nothing because config is
> unchanged from its perspective
>
> I think the right answer here is that the timeout should be reset when
> forch writes a new config. It's a valid thing for forch to continually
> write out new configs, in which case the timer likely shouldn't expire.
> It's the other extreme case (form only one event).
>
> I have run the test again and it seems that simply resetting the timer
> (setting clash_start_time to current time) could not solve the issue... The
> problem is that the timer is only cleared when there is a CONFIG_CHANGE
> event. In the test's case, there was not a new CONFIG_CHANGE event, as the
> file got changed by Forch to the previous version. I guess we need to
> compare the hashes when the timer times out, and clear the timer (setting
> clash_start_time to None) if the hashes match..
>
> —
> You are receiving this because your review was requested.
> Reply to this email directly, view it on GitHub
> <#203 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAIEPD6AIB6VV3IXM5DBAILSNBYTBANCNFSM4S7W6S5Q>
> .
>
|
Yes, I think that should work. So we do not need to start the timer when we detect a mismatch anymore, right? The new logic seems to be: "After Forch writes a config, within XXX seconds, we expect Faucet to reload it and apply the changes, and the new hashes should match. Or, if Faucet sees the config as the same with the previous one, the hash of the file should match the last hash sent out by Faucet". I will go ahead updating the PR.. |
Correct... The triggering event is the config write, not the bad hash. Also
covers the case where forch writes a config, but there is never a response.
…On Wed, Oct 28, 2020, 2:01 PM Yufeng Duan ***@***.***> wrote:
something like this work, then, where when the time after an intentional
config chang
Yes, I think that should work. So we do not need to start the timer when
we detect a mismatch anymore, right? The new logic seems to be: "After
Forch writes a config, within XXX seconds, we expect Faucet to reload it
and apply the changes, and the new hashes should match. Or, if Faucet sees
the config as the same with the previous one, the hash of the file should
match the last hash sent out by Faucet".
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#203 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIEPD2YZTTR263R5MSB5DTSNCBDTANCNFSM4S7W6S5Q>
.
|
updated the PR. PTAL.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update title (no longer max retry times)
Thanks for working on this -- I know it was a lot of back and forth, but in the end I think we got to the right solution that's more robust than before (e.g., the case of "no config event at all" was not checked previously!)
* Scheduling config hash verification each time Forch writes behavioral config
Making Forch retry config hash match checking and not raise an exception until a limit is reached.