Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scheduling for config hash verficiation #203

Merged
merged 31 commits into from
Oct 29, 2020

Conversation

pomodorox
Copy link
Contributor

Making Forch retry config hash match checking and not raise an exception until a limit is reached.

@pomodorox pomodorox requested a review from grafnu October 26, 2020 17:57
forch/forchestrator.py Outdated Show resolved Hide resolved
forch/forchestrator.py Outdated Show resolved Hide resolved
forch/forchestrator.py Outdated Show resolved Hide resolved
@pomodorox pomodorox requested a review from grafnu October 27, 2020 01:04
Copy link
Collaborator

@grafnu grafnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure about the intended algorithm. What happens if there's a single hash mismatch event and then nothing else?

forch/forchestrator.py Outdated Show resolved Hide resolved
forch/forchestrator.py Outdated Show resolved Hide resolved
@pomodorox pomodorox requested a review from grafnu October 27, 2020 17:16
Copy link
Collaborator

@grafnu grafnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure about the intended algorithm. What happens if there's a single hash mismatch event and then nothing else?

@pomodorox
Copy link
Contributor Author

Still not sure about the intended algorithm. What happens if there's a single hash mismatch event and then nothing else?

So the reason there was a config hash mismatch is that the behavioral Faucet config file was updated by Forch while Forch received a CONFIG_CHANGE event. So once there is a config hash mismatch, Faucet will reload the config file and send a new CONFIG_CHANGE event. If the hashes match, then Forch will clear the config_hash_clash_start_time. Otherwise, it means Forch updated the config again, and config_hash_clash_start_time won't be cleared until a CONFIG_CHANGE event is received that makes the hashes match.

I think the key assumption is that Faucet will always reload the config file if it is different and emit a CONFIG_CHANGE event. I mean, there will be always another CONFIG_CHANGE event after a single hash mismatch till the hashes match...

@grafnu
Copy link
Collaborator

grafnu commented Oct 28, 2020 via email

@pomodorox
Copy link
Contributor Author

Added timer which starts when Forch detects the every first hash mismatch.

Also comparing the hash again when timer times out.. The reason is that while running Eng's test, it happened that Forch wrote behavioral config in between Faucet reloading. For example what happened was:
T1.

  1. Faucet reloaded config version 1
  2. Forch received an L2_EXPIRE event
  3. Forch wrote out behavior config version 2 due to the expiry event
  4. Forch received a CONFIG_CHANGE event with hash version 1

T2.

  1. Forch recived learning event, wrote out behavior config version 1

T3.

  1. Faucet reloaded config version 1, did nothing because config is unchanged from its perspective

@pomodorox pomodorox requested a review from grafnu October 28, 2020 09:40
Copy link
Collaborator

@grafnu grafnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added timer which starts when Forch detects the every first hash mismatch.

Also comparing the hash again when timer times out.. The reason is that while running Eng's test, it happened that Forch wrote behavioral config in between Faucet reloading. For example what happened was:
T1.

  1. Faucet reloaded config version 1
  2. Forch received an L2_EXPIRE event
  3. Forch wrote out behavior config version 2 due to the expiry event
  4. Forch received a CONFIG_CHANGE event with hash version 1

T2.

  1. Forch recived learning event, wrote out behavior config version 1

T3.

  1. Faucet reloaded config version 1, did nothing because config is unchanged from its perspective

I think the right answer here is that the timeout should be reset when forch writes a new config. It's a valid thing for forch to continually write out new configs, in which case the timer likely shouldn't expire. It's the other extreme case (form only one event).

self._config_hash_clashed = False
self._config_hash_clash_timeout_sec = (
self._config.event_client.config_hash_clash_timeout_sec or
int(os.getenv(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why env variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed..

assert config_hash == config_info['hashes'], 'config hash info does not match'

if config_hash == config_info['hashes']:
self._attempt_cancel_config_hash_clash_timer()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in these cases I'd leave out the "attempt" -- it's not semantically meaningful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using a seprate timer now.

def _attempt_start_config_hash_clash_timer(self):
if self._config_hash_clash_timer and self._config_hash_clash_timer.is_alive():
return
self._config_hash_clash_timer = threading.Timer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should reuse one of the existing heartbeat schedulers for this -- creating lots of little threads ends up being very messy in the long run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using _faucet_state_scheduler now..

self._faucet_collector.process_dataplane_config_change(timestamp, faucet_dps)

def _attempt_start_config_hash_clash_timer(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be "start a timer" -- but rather just a mark of when the last config hash clash was

'Config hash does not match after %s seconds', self._config_hash_clash_timeout_sec)
self._config_hash_clashed = True

def _get_config_hash_clashed(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private getter is redundant -- just use the variable directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -212,9 +212,11 @@ def _dispatch_faucet_event(self, target, target_event):
def _should_log_event(self, event):
return event and os.getenv('FAUCET_EVENT_DEBUG')

def next_event(self, blocking=False):
def next_event(self, get_config_hash_clashed, blocking=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not right -- the concept of a config hash shouldn't bleed down into the event client itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@pomodorox
Copy link
Contributor Author

pomodorox commented Oct 28, 2020

Added timer which starts when Forch detects the every first hash mismatch.
Also comparing the hash again when timer times out.. The reason is that while running Eng's test, it happened that Forch wrote behavioral config in between Faucet reloading. For example what happened was:
T1.

  1. Faucet reloaded config version 1
  2. Forch received an L2_EXPIRE event
  3. Forch wrote out behavior config version 2 due to the expiry event
  4. Forch received a CONFIG_CHANGE event with hash version 1

T2.

  1. Forch recived learning event, wrote out behavior config version 1

T3.

  1. Faucet reloaded config version 1, did nothing because config is unchanged from its perspective

I think the right answer here is that the timeout should be reset when forch writes a new config. It's a valid thing for forch to continually write out new configs, in which case the timer likely shouldn't expire. It's the other extreme case (form only one event).

I have run the test again and it seems that simply resetting the timer (setting clash_start_time to current time) could not solve the issue... The problem is that the timer is only cleared when there is a CONFIG_CHANGE event. In the test's case, there was never a new CONFIG_CHANGE event after the previous mismatch, as the config file got changed by Forch to the previous version in between the two Faucet reloading. I guess we need to compare the hashes when the timer times out, and clear the timer (setting clash_start_time to None) if the hashes match..

@pomodorox pomodorox requested a review from grafnu October 28, 2020 19:49
@grafnu
Copy link
Collaborator

grafnu commented Oct 28, 2020 via email

@grafnu
Copy link
Collaborator

grafnu commented Oct 28, 2020 via email

@pomodorox
Copy link
Contributor Author

pomodorox commented Oct 28, 2020

something like this work, then, where when the time after an intentional config chang

Yes, I think that should work. So we do not need to start the timer when we detect a mismatch anymore, right? The new logic seems to be: "After Forch writes a config, within XXX seconds, we expect Faucet to reload it and apply the changes, and the new hashes should match. Or, if Faucet sees the config as the same with the previous one, the hash of the file should match the last hash sent out by Faucet".

I will go ahead updating the PR..

@grafnu
Copy link
Collaborator

grafnu commented Oct 28, 2020 via email

@pomodorox
Copy link
Contributor Author

Correct... The triggering event is the config write, not the bad hash. Also
covers the case where forch writes a config, but there is never a response.

updated the PR. PTAL..

Copy link
Collaborator

@grafnu grafnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update title (no longer max retry times)

Thanks for working on this -- I know it was a lot of back and forth, but in the end I think we got to the right solution that's more robust than before (e.g., the case of "no config event at all" was not checked previously!)

@pomodorox pomodorox changed the title Set max retry times for config hash match check Add scheduler for config hash verficiation Oct 29, 2020
@pomodorox pomodorox changed the title Add scheduler for config hash verficiation Add scheduling for config hash verficiation Oct 29, 2020
@pomodorox pomodorox merged commit 7e9a5f3 into faucetsdn:master Oct 29, 2020
grafnu pushed a commit to grafnu/forch that referenced this pull request Oct 29, 2020
* Scheduling config hash verification each time Forch writes behavioral config
@pomodorox pomodorox deleted the confighash branch November 19, 2020 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants