Add scheduling for config hash verficiation #203

pomodorox · 2020-10-26T17:56:59Z

Making Forch retry config hash match checking and not raise an exception until a limit is reached.

forch/forchestrator.py

proto/forch_configuration.proto

grafnu

Still not sure about the intended algorithm. What happens if there's a single hash mismatch event and then nothing else?

forch/forchestrator.py

grafnu

Still not sure about the intended algorithm. What happens if there's a single hash mismatch event and then nothing else?

pomodorox · 2020-10-28T00:41:03Z

Still not sure about the intended algorithm. What happens if there's a single hash mismatch event and then nothing else?

So the reason there was a config hash mismatch is that the behavioral Faucet config file was updated by Forch while Forch received a CONFIG_CHANGE event. So once there is a config hash mismatch, Faucet will reload the config file and send a new CONFIG_CHANGE event. If the hashes match, then Forch will clear the config_hash_clash_start_time. Otherwise, it means Forch updated the config again, and config_hash_clash_start_time won't be cleared until a CONFIG_CHANGE event is received that makes the hashes match.

I think the key assumption is that Faucet will always reload the config file if it is different and emit a CONFIG_CHANGE event. I mean, there will be always another CONFIG_CHANGE event after a single hash mismatch till the hashes match...

grafnu · 2020-10-28T01:03:38Z

"there will be always another CONFIG_CHANGE event after a single hash mismatch till the hashes match" -- how do you know that? Faucet itself doesn't know about the mismatch, it just knows that the config changed, so it won't categorically send another config change after a hash failure. The sequence of events you cite is the reason for *this* mismatch, but isn't the reason for the error check in the first place. It's a distributed system, and so there should be some mechanism to ensure eventual consistency in the face of weird behavior. What happens if there's some calculation error for some unknown reason on the Faucet side, and Forch doesn't update the config again b/c there was no additional change? You'll get a CONFIG_CHANGE event, the hashes will mismatch, and then... nothing.

…

On Tue, Oct 27, 2020 at 5:41 PM Yufeng Duan ***@***.***> wrote: Still not sure about the intended algorithm. What happens if there's a single hash mismatch event and then nothing else? So the reason there was a config hash mismatch is that the behavioral Faucet config file was updated by Forch while Forch received a CONFIG_CHANGE event. So once there is a config hash mismatch, Faucet will reload the config file and send a new CONFIG_CHANGE event. If the hashes match, then Forch will clear the config_hash_clash_start_time. Otherwise, it means Forch updated the config again, and config_hash_clash_start_time won't be cleared until a CONFIG_CHANGE event is received that makes the hashes match. I think the key assumption is that Faucet will always reload the config file if it is different and emit a CONFIG_CHANGE event. I mean, there will be always another CONFIG_CHANGE event after a single hash mismatch till the hashes match... — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#203 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIEPD4K5ZUPP44QEM3QSO3SM5SCXANCNFSM4S7W6S5Q> .

pomodorox · 2020-10-28T09:37:33Z

Added timer which starts when Forch detects the every first hash mismatch.

Also comparing the hash again when timer times out.. The reason is that while running Eng's test, it happened that Forch wrote behavioral config in between Faucet reloading. For example what happened was:
T1.

Faucet reloaded config version 1
Forch received an L2_EXPIRE event
Forch wrote out behavior config version 2 due to the expiry event
Forch received a CONFIG_CHANGE event with hash version 1

T2.

Forch recived learning event, wrote out behavior config version 1

T3.

Faucet reloaded config version 1, did nothing because config is unchanged from its perspective

grafnu

Added timer which starts when Forch detects the every first hash mismatch.

Also comparing the hash again when timer times out.. The reason is that while running Eng's test, it happened that Forch wrote behavioral config in between Faucet reloading. For example what happened was:
T1.

Faucet reloaded config version 1

Forch received an L2_EXPIRE event

Forch wrote out behavior config version 2 due to the expiry event

Forch received a CONFIG_CHANGE event with hash version 1

T2.

Forch recived learning event, wrote out behavior config version 1

T3.

Faucet reloaded config version 1, did nothing because config is unchanged from its perspective

I think the right answer here is that the timeout should be reset when forch writes a new config. It's a valid thing for forch to continually write out new configs, in which case the timer likely shouldn't expire. It's the other extreme case (form only one event).

grafnu · 2020-10-28T14:44:52Z

forch/forchestrator.py

+        self._config_hash_clashed = False
+        self._config_hash_clash_timeout_sec = (
+            self._config.event_client.config_hash_clash_timeout_sec or
+            int(os.getenv(


why env variable?

grafnu · 2020-10-28T14:45:42Z

forch/forchestrator.py

-        assert config_hash == config_info['hashes'], 'config hash info does not match'
+
+        if config_hash == config_info['hashes']:
+            self._attempt_cancel_config_hash_clash_timer()


I think in these cases I'd leave out the "attempt" -- it's not semantically meaningful

Not using a seprate timer now.

grafnu · 2020-10-28T14:47:05Z

forch/forchestrator.py

+    def _attempt_start_config_hash_clash_timer(self):
+        if self._config_hash_clash_timer and self._config_hash_clash_timer.is_alive():
+            return
+        self._config_hash_clash_timer = threading.Timer(


you should reuse one of the existing heartbeat schedulers for this -- creating lots of little threads ends up being very messy in the long run.

Using _faucet_state_scheduler now..

grafnu · 2020-10-28T14:47:33Z

forch/forchestrator.py

        self._faucet_collector.process_dataplane_config_change(timestamp, faucet_dps)

+    def _attempt_start_config_hash_clash_timer(self):


this shouldn't be "start a timer" -- but rather just a mark of when the last config hash clash was

grafnu · 2020-10-28T14:48:12Z

forch/forchestrator.py

+                'Config hash does not match after %s seconds', self._config_hash_clash_timeout_sec)
+            self._config_hash_clashed = True
+
+    def _get_config_hash_clashed(self):


private getter is redundant -- just use the variable directly

grafnu · 2020-10-28T14:49:16Z

forch/faucet_event_client.py

@@ -212,9 +212,11 @@ def _dispatch_faucet_event(self, target, target_event):
    def _should_log_event(self, event):
        return event and os.getenv('FAUCET_EVENT_DEBUG')

-    def next_event(self, blocking=False):
+    def next_event(self, get_config_hash_clashed, blocking=False):


this is not right -- the concept of a config hash shouldn't bleed down into the event client itself

pomodorox · 2020-10-28T19:48:49Z

Added timer which starts when Forch detects the every first hash mismatch.
Also comparing the hash again when timer times out.. The reason is that while running Eng's test, it happened that Forch wrote behavioral config in between Faucet reloading. For example what happened was:
T1.

Faucet reloaded config version 1

Forch received an L2_EXPIRE event

Forch wrote out behavior config version 2 due to the expiry event

Forch received a CONFIG_CHANGE event with hash version 1

T2.

Forch recived learning event, wrote out behavior config version 1

T3.

Faucet reloaded config version 1, did nothing because config is unchanged from its perspective

I think the right answer here is that the timeout should be reset when forch writes a new config. It's a valid thing for forch to continually write out new configs, in which case the timer likely shouldn't expire. It's the other extreme case (form only one event).

I have run the test again and it seems that simply resetting the timer (setting clash_start_time to current time) could not solve the issue... The problem is that the timer is only cleared when there is a CONFIG_CHANGE event. In the test's case, there was never a new CONFIG_CHANGE event after the previous mismatch, as the config file got changed by Forch to the previous version in between the two Faucet reloading. I guess we need to compare the hashes when the timer times out, and clear the timer (setting clash_start_time to None) if the hashes match..

grafnu · 2020-10-28T20:22:09Z

I'll look at the PR now, but the timer should be cleared *when forch writes a new config. * So -- as long as forch *knows* it's updating the config then there's no need for the error, it's only when forch *thinks* it's settled and there's still an error do we need to flag the condition.

…

On Wed, Oct 28, 2020 at 12:49 PM Yufeng Duan ***@***.***> wrote: Added timer which starts when Forch detects the every first hash mismatch. Also comparing the hash again when timer times out.. The reason is that while running Eng's test, it happened that Forch wrote behavioral config in between Faucet reloading. For example what happened was: T1. 1. Faucet reloaded config version 1 2. Forch received an L2_EXPIRE event 3. Forch wrote out behavior config version 2 due to the expiry event 4. Forch received a CONFIG_CHANGE event with hash version 1 T2. 1. Forch recived learning event, wrote out behavior config version 1 T3. 1. Faucet reloaded config version 1, did nothing because config is unchanged from its perspective I think the right answer here is that the timeout should be reset when forch writes a new config. It's a valid thing for forch to continually write out new configs, in which case the timer likely shouldn't expire. It's the other extreme case (form only one event). I have run the test again and it seems that simply resetting the timer (setting clash_start_time to current time) could not solve the issue... The problem is that the timer is only cleared when there is a CONFIG_CHANGE event. In the test's case, there was not a new CONFIG_CHANGE event, as the file got changed by Forch to the previous version. I guess we need to compare the hashes when the timer times out, and clear the timer (setting clash_start_time to None) if the hashes match.. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#203 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIEPD6AIB6VV3IXM5DBAILSNBYTBANCNFSM4S7W6S5Q> .

grafnu · 2020-10-28T20:30:33Z

Ah, I see what you're saying now about how faucet doesn't send a new update. Would something like this work, then, where when the time after an intentional config change expires, the system compares the "last received hash" with the "expected hash" -- and it's an error at that point? Maybe that's getting back to what you were saying about needing to recalculate at that point (really you just need to recalculate when it is written out, but that can be done at the time of write or the time of check). So basically the logic becomes: 1. Whenever forch writes a config, start/reset the timer. 2. After XXX seconds, check to see if the last received config hash (from faucet) matches the last written config hash (from step #1). If there's a mismatch, then error. Does that work for the cases you've been tracking? On Wed, Oct 28, 2020 at 1:21 PM Trevor Pering <[email protected]> wrote:

…

I'll look at the PR now, but the timer should be cleared *when forch writes a new config. * So -- as long as forch *knows* it's updating the config then there's no need for the error, it's only when forch *thinks* it's settled and there's still an error do we need to flag the condition. On Wed, Oct 28, 2020 at 12:49 PM Yufeng Duan ***@***.***> wrote: > Added timer which starts when Forch detects the every first hash mismatch. > Also comparing the hash again when timer times out.. The reason is that > while running Eng's test, it happened that Forch wrote behavioral config in > between Faucet reloading. For example what happened was: > T1. > > 1. Faucet reloaded config version 1 > 2. Forch received an L2_EXPIRE event > 3. Forch wrote out behavior config version 2 due to the expiry event > 4. Forch received a CONFIG_CHANGE event with hash version 1 > > T2. > > 1. Forch recived learning event, wrote out behavior config version 1 > > T3. > > 1. Faucet reloaded config version 1, did nothing because config is > unchanged from its perspective > > I think the right answer here is that the timeout should be reset when > forch writes a new config. It's a valid thing for forch to continually > write out new configs, in which case the timer likely shouldn't expire. > It's the other extreme case (form only one event). > > I have run the test again and it seems that simply resetting the timer > (setting clash_start_time to current time) could not solve the issue... The > problem is that the timer is only cleared when there is a CONFIG_CHANGE > event. In the test's case, there was not a new CONFIG_CHANGE event, as the > file got changed by Forch to the previous version. I guess we need to > compare the hashes when the timer times out, and clear the timer (setting > clash_start_time to None) if the hashes match.. > > — > You are receiving this because your review was requested. > Reply to this email directly, view it on GitHub > <#203 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAIEPD6AIB6VV3IXM5DBAILSNBYTBANCNFSM4S7W6S5Q> > . >

pomodorox · 2020-10-28T21:01:30Z

something like this work, then, where when the time after an intentional config chang

Yes, I think that should work. So we do not need to start the timer when we detect a mismatch anymore, right? The new logic seems to be: "After Forch writes a config, within XXX seconds, we expect Faucet to reload it and apply the changes, and the new hashes should match. Or, if Faucet sees the config as the same with the previous one, the hash of the file should match the last hash sent out by Faucet".

I will go ahead updating the PR..

grafnu · 2020-10-28T21:04:16Z

Correct... The triggering event is the config write, not the bad hash. Also covers the case where forch writes a config, but there is never a response.

…

On Wed, Oct 28, 2020, 2:01 PM Yufeng Duan ***@***.***> wrote: something like this work, then, where when the time after an intentional config chang Yes, I think that should work. So we do not need to start the timer when we detect a mismatch anymore, right? The new logic seems to be: "After Forch writes a config, within XXX seconds, we expect Faucet to reload it and apply the changes, and the new hashes should match. Or, if Faucet sees the config as the same with the previous one, the hash of the file should match the last hash sent out by Faucet". — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#203 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIEPD2YZTTR263R5MSB5DTSNCBDTANCNFSM4S7W6S5Q> .

pomodorox · 2020-10-28T23:09:24Z

Correct... The triggering event is the config write, not the bad hash. Also
covers the case where forch writes a config, but there is never a response.

updated the PR. PTAL..

grafnu

Please update title (no longer max retry times)

Thanks for working on this -- I know it was a lot of back and forth, but in the end I think we got to the right solution that's more robust than before (e.g., the case of "no config event at all" was not checked previously!)

* Scheduling config hash verification each time Forch writes behavioral config

pomodorox added 3 commits October 26, 2020 10:43

Set config hash max retry times

d886e49

fix assert

c4a9f20

add logging

2c9d8a9

pomodorox requested a review from grafnu October 26, 2020 17:57

grafnu reviewed Oct 26, 2020

View reviewed changes

forch/forchestrator.py Outdated Show resolved Hide resolved

forch/forchestrator.py Outdated Show resolved Hide resolved

forch/forchestrator.py Outdated Show resolved Hide resolved

pomodorox added 3 commits October 26, 2020 17:27

changes

10504d9

build proto

8a84ca4

add logging

905015e

pomodorox requested a review from grafnu October 27, 2020 01:04

pomodorox added 2 commits October 26, 2020 18:05

fix testing

d32b7fc

build proto

459b33f

grafnu reviewed Oct 27, 2020

View reviewed changes

proto/forch_configuration.proto Outdated Show resolved Hide resolved

pomodorox added 5 commits October 26, 2020 23:45

change config hash retry to cooling time

71f0697

build proto

829421d

changes

4df5dad

lint

b0b8940

update logging

43c1b55

pomodorox force-pushed the confighash branch from e928741 to 43c1b55 Compare October 27, 2020 08:02

pomodorox requested a review from grafnu October 27, 2020 08:13

grafnu reviewed Oct 27, 2020

View reviewed changes

forch/forchestrator.py Outdated Show resolved Hide resolved

forch/forchestrator.py Outdated Show resolved Hide resolved

pomodorox added 3 commits October 27, 2020 10:12

update naming

670eb94

lint

8a0ec03

build proto

846a554

pomodorox requested a review from grafnu October 27, 2020 17:16

fix logging

d01f5fb

grafnu reviewed Oct 27, 2020

View reviewed changes

pomodorox added 2 commits October 27, 2020 23:03

add timer

6e3d45e

fix

267a9f8

pomodorox force-pushed the confighash branch from 3cba4d9 to 267a9f8 Compare October 28, 2020 06:37

pomodorox added 4 commits October 28, 2020 00:29

raise exception for config hash clash in main loop

231b829

increase timeout

bdc75e1

raise exception in event client

04aff69

compare hash when timeout

d35f482

pomodorox added 2 commits October 28, 2020 02:38

lint

7d557d4

changes

09140c1

pomodorox requested a review from grafnu October 28, 2020 09:40

grafnu reviewed Oct 28, 2020

View reviewed changes

pomodorox added 2 commits October 28, 2020 11:25

use existing heartbeat scheduler

87c23ec

lint

27c2570

pomodorox requested a review from grafnu October 28, 2020 19:49

pomodorox added 3 commits October 28, 2020 15:42

change timer trigger to config writing

70f69d1

build proto

e227df6

clear config writting time

eab9235

grafnu approved these changes Oct 29, 2020

View reviewed changes

pomodorox changed the title ~~Set max retry times for config hash match check~~ Add scheduler for config hash verficiation Oct 29, 2020

pomodorox changed the title ~~Add scheduler for config hash verficiation~~ Add scheduling for config hash verficiation Oct 29, 2020

Merge branch 'master' into confighash

99633cf

pomodorox merged commit 7e9a5f3 into faucetsdn:master Oct 29, 2020

grafnu pushed a commit to grafnu/forch that referenced this pull request Oct 29, 2020

Add scheduling for config hash verficiation (faucetsdn#203)

55eae2a

* Scheduling config hash verification each time Forch writes behavioral config

pomodorox deleted the confighash branch November 19, 2020 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scheduling for config hash verficiation #203

Add scheduling for config hash verficiation #203

pomodorox commented Oct 26, 2020

grafnu left a comment

grafnu left a comment

pomodorox commented Oct 28, 2020

grafnu commented Oct 28, 2020 via email

pomodorox commented Oct 28, 2020

grafnu left a comment

grafnu Oct 28, 2020

pomodorox Oct 28, 2020

grafnu Oct 28, 2020

pomodorox Oct 28, 2020

grafnu Oct 28, 2020

pomodorox Oct 28, 2020

grafnu Oct 28, 2020

grafnu Oct 28, 2020

pomodorox Oct 28, 2020

grafnu Oct 28, 2020

pomodorox Oct 28, 2020

pomodorox commented Oct 28, 2020 •

edited

Loading

grafnu commented Oct 28, 2020 via email

grafnu commented Oct 28, 2020 via email

pomodorox commented Oct 28, 2020 •

edited

Loading

grafnu commented Oct 28, 2020 via email

pomodorox commented Oct 28, 2020

grafnu left a comment

		self._faucet_collector.process_dataplane_config_change(timestamp, faucet_dps)

		def _attempt_start_config_hash_clash_timer(self):

Add scheduling for config hash verficiation #203

Add scheduling for config hash verficiation #203

Conversation

pomodorox commented Oct 26, 2020

grafnu left a comment

Choose a reason for hiding this comment

grafnu left a comment

Choose a reason for hiding this comment

pomodorox commented Oct 28, 2020

grafnu commented Oct 28, 2020 via email

pomodorox commented Oct 28, 2020

grafnu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pomodorox commented Oct 28, 2020 • edited Loading

grafnu commented Oct 28, 2020 via email

grafnu commented Oct 28, 2020 via email

pomodorox commented Oct 28, 2020 • edited Loading

grafnu commented Oct 28, 2020 via email

pomodorox commented Oct 28, 2020

grafnu left a comment

Choose a reason for hiding this comment

pomodorox commented Oct 28, 2020 •

edited

Loading

pomodorox commented Oct 28, 2020 •

edited

Loading