Replies: 8 comments 18 replies
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
That's very interesting. Can we centralize this only for consistency check to be responsible for recoverying fixing it? and then, we can also provide a new endpoint that can schedule a new execution of the consistency check just so users can force/expedite a new run? This is partly also related to this requirement here #114 the original requiement will be covered on Let's also plan for e2e tests for this new endpoint. |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
@Alopalao great proposal and issue discussion. From my perspective, overall consider it approved +1 from me, you can move on with the implementation parts of where the agreement is clear. Any other parts where there's overlap with consistency check let's also include @Ktmi and make sure we're all on the same page. Thanks, team. |
Beta Was this translation helpful? Give feedback.
-
The scope of this epic is to handle errors detected by installing and deleting flows from EVC paths. Once this epic is finished the issue #476 will be closed.
Cause of incorrect installation and deletion of flows
TLDR: Requests for flows are sent by the number of switches used in an EVC. This number rises fast when dealing with 100s EVCs in a short time causing
Service Unavailable
.Why this error can happen
When managing hundreds of EVCs, it is possible to get the error
Service Unavailable
fromflow_manager
. The API request port fromflow_manager
is the one that breaks first because of the amount of requests sent from EVCs. For example:For an EVC from
00:01
(200) to00:02
(200) non-NNI:flow_manager
allows installation by switch (or all switches),mef_eline
sends a request per switch.From the previous example, the number of flows per path would be:
current_path
-> 2 requests to install 4 flows (2 00:01 and 2 00:02)failover_path
-> 3 requests to install 4 flows (1 00:01, 2 00:03 and 1 00:02)In total, to create this EVC, 5 requests are sent. This causes the error
Service Unavailable
when the number of EVCs is high in the 100s.Problem with the current handling of this error
TLDR: The current implementation tries to delete paths that errored by sending more requests which are very likely to get
Service Unavailable
. Path is deleted inmef_eline
but flows are still installed.More details in current approach
If an error
Service Unavailable
occurs, the next request will likely have the same error. If the installation of the EVC fails in the second switch for thecurrent_path
, for example:mef_eline
detects the error and tries to send more requests to delete everything related to EVC. But more EVCs does not help withService Unavailable
. So the EVC errors in all flowsThis left flows installed. The path was deleted which was the only reference of these flows existing.
Solution for errored paths and approach to minimize the occurrence of
Service Unavailable
Keep the failed path, add `status_reason` and try again later, related issue #511
The error
Service Unavailable
cannot be avoided. With enough number of requests, it will eventually happen.This area is to deal with this and any other error while installing/deleting flows.
Sending more requests when a path fails is not ideal. Instead, we could implement a new field for the errored EVC.
status_reason
(previously calledfield will be set as:error_status
)Where action specifies what needs to be done so the EVC is fixed, it can be
Redeploy|Delete
. Error is the message from the exception caught.New basic mechanic for path handling
In the case of path installation:
current_path
2.1. Path deletion errored, log error, set error in
status_error
, leave previous path unchanged, sync2.2. Return
status_reason
6.1. Path deletion errored, log error, set error in
status_error
, set new path, sync6.2. Return
In the case of path deletion:
current_path
orfailover_path
is being deleted1.1.1. Path does not exist
1.1.2. Return
2.1. Requests errored, log error
2.2. Raise exception
EVCPathNotDeleted
(New exception)Additional changes
EVCPathNotDeleted
to be raised when a deletion error happens.EVCPathNotDeleted
when deleting EVC. If an error happens, the EVC will not be deleted. It will just be disabled and deactivated.status_error
, issue UI: Show that an EVC has an error #510status_error
, issue Error handling when installing and deleting flows #495. More information in the next labelAPI requests dealing with EVC with failed paths, issue #495
Consistency check can handle EVC with
status_error
but the user can do it as well. (Changes are due)Add a POST request to schedule a consistency check for EVCs with detected errors so these are corrected. Depending on how consistency check version 2 evolves this issue will differ a bit in its implementation.
Additional changes
To avoid search for EVC, add a new variableself.failed_circuits
to keep track of any failed EVCs..../api/kytos/mef_eline/v2/evc/?failed_circuits=true
.Reduce the number of requests, related issue #514 (Implementing)
Service Unavailable
is tied to the number of requests. So sending a single request per path reduces significantly the possibility of this error showing again.A new request in
flow_manager
-> POST.../api/kytos/flow_manager/v2/flows/flows_by_switch
with content:Extra issues:
flow_manager
, issue. Implement the receiving of flows in a bulk of switches.Note: Scope from
epic_mef_eline_consistency_v2
is higher and can close this epic. Keeping an eye how that epic evolves.Beta Was this translation helpful? Give feedback.
All reactions