Replies: 8 comments 6 replies
-
Reviewing
What if we handle these cases in three different consistency checks:
Some enhancements to
These are wild ideas discard them all if you see best. |
Beta Was this translation helpful? Give feedback.
-
In terms of what I think is doable from the issues listed, #437, #395, and #488 seem immediately doable. These three are fairly simple changes, all things considered. I even written a potential implementation for it already: def execute_consistency(self):
"""Execute consistency routine."""
# Stages:
# 1. For all enabled and not deployed circuits, try to deploy them
# 2. For all deployed circuits, make sure they have a failover path ready.
# 3. For all deployed circuits, check their traces are good, and if not, redeploy them.
circuits_to_trace = []
circuits_to_deploy = []
with contextlib.ExitStack() as stack:
for circuit in self.get_evcs_by_svc_level(enable_filter=False):
# If a circuit is active, then it has an implementation
if circuit.lock.acquire(blocking=False):
if circuit.is_recent_updated() or circuit.has_recent_removed_flow():
circuit.lock.release()
continue
stack.push(circuit.lock)
circuit.try_setup_failover_path()
if circuit.is_active():
circuits_to_trace.append(circuit)
elif circuit.is_enabled():
circuits_to_deploy.append(circuit)
# Passively try to deploy circuits that are enabled, but deactivated
for circuit in circuits_to_deploy:
circuit.deploy()
# Check the trace of active circuits
# Maybe we should check enabled but deactivated circuits as well? They could have
circuits_checked = EVCDeploy.check_list_traces(circuits_to_trace)
for circuit in circuits_to_trace:
is_checked = circuits_checked.get(circuit.id)
if is_checked:
log.info(f"{circuit} enabled, active, and trace good!")
circuit.execution_rounds = 0
else:
circuit.execution_rounds += 1
log.info(f"{circuit} enabled, active, but trace failed {circuit.execution_rounds} times.")
if circuit.execution_rounds > settings.WAIT_FOR_OLD_PATH:
log.info(f"{circuit} enabled, active, but trace failed too many times - redeploying")
circuit.deploy() However, I don't think doing them as soon as possible is the best decision. I think instead the effort should first be focused on resolving race conditions within |
Beta Was this translation helpful? Give feedback.
-
In order to implement #517 and #444, we need a way to keep track of past paths. There are few things we can do to help with this:
|
Beta Was this translation helpful? Give feedback.
-
This is mainly a comment on the changes to paths that I have discussed, and my plans on how we may implement it. There are a few goals with these changes:
To achieve these changes, I propose a few things:
These changes will affect how Going forward I'm going to be working on two things, outlining new/updated DB models, and prototyping moving logic to the DB. From there we can get a clearer picture of what the path ahead looks like. |
Beta Was this translation helpful? Give feedback.
-
Thanks @Ktmi for the detailed documentation and new issues. @italovalcy @viniarck @Alopalao Do you see any conflicts with the suggestions above? |
Beta Was this translation helpful? Give feedback.
-
Great ideas and lots of good stuff here. Two main points to be addressed, I'll keep it short (in the next two threaded comments): |
Beta Was this translation helpful? Give feedback.
-
The proposed change to fully decouple paths from EVCs is a massive data model change, although it's a good idea and has potential for other path reusabilibility optimizations in the future, and it's great to also have path status about its flow installation|removal state. For 2024.2 version, I'll recommend that you try to find and propose a more incremental and specific solution(s) that's still close to the current data model while trying to minimize significant changes to ensure the existing code stability and maintenance, while still ensuring that converge still be on par with current performance, and laser focus on ensuring that |
Beta Was this translation helpful? Give feedback.
-
Still needs to be specified how the active consistency check will ensure no false positives and if it'll be able to run It's also important to discuss if the consistency will only rely on |
Beta Was this translation helpful? Give feedback.
-
There are a few processes we would like to rework in regards to consistency check:
EVCs modified without holding lock, may lead to race conditions #512 - Protect EVCs during the consistency check process. It's possible that an EVC may be modified during consistency check, as the EVC lock isn't maintained throughout the process.
Replace consistency check with routine for checking traces #437 - Perform traces on all deployed (active) EVCs. Currently traces are only performed on inactive EVCs to check if they are deployed, in order to set them to active. However with the current implementation of
mef_eline
, all deployed EVCs are set as active, and maintain the active state between restarts.consistency: redeploy EVCs that are enabled and inactive and now have paths to be deployed #395 - Periodically try to deploy EVCs that are inactive. There could be cases where a path is available for an EVC, but the EVC wasn't deployed due to some unexpected factor.
Consistency check: remove orphan flows from an EVC #444 - Clearing orphan flows. EVCs may not have properly deleted flows from a previous path.
Optimize consistency check routine to avoid trying to activate EVCs unnecessarily #247 - Prevent EVCs from trying to redeploy too often. This is mainly a constraint on trying to deploy inactive EVCs.
Annoying log messages about fail to establish a failover path every 60s #488 - Reduce annoying log messages about problems that operators can not address.
Of these issues, the main problematic one I'm seeing is #444. Identifying orphan flows is a non trivial issue, being that we must identify flows pertaining to the EVC, that we might not know if they exist, or where they may exist.
Edit: New issue just dropped
old_path
, resulting inold_path
not being properly cleared. #517 - Multiple link down events can cause a race condition in regards to handlingold_path
. This can cause the old paths to not be cleared, and leave there flows hanging.Beta Was this translation helpful? Give feedback.
All reactions