Skip to content

Commit

Permalink
Augmented blueprint EP031
Browse files Browse the repository at this point in the history
  • Loading branch information
viniarck committed Jul 26, 2023
1 parent 8851228 commit f78ce08
Showing 1 changed file with 47 additions and 20 deletions.
67 changes: 47 additions & 20 deletions docs/blueprints/EP031.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
- Italo Valcy <idasilva AT fiu DOT edu>
- Vinicius Arcanjo <vindasil AT fiu DOT edu>
:Created: 2022-08-24
:Updated: 2022-11-07
:Kytos-Version: 2022.3
:Updated: 2023-07-26
:Kytos-Version: 2023.2
:Status: Draft

****************************************
Expand Down Expand Up @@ -238,7 +238,7 @@ The goal for the **telemetry_int** napp is to enable telemetry for ALL EVCs. How

1 . The **telemetry_int** napp will start operating once **mef_eline** is loaded and EVCs and their flows are pushed to the data plane.

2. **telemetry_int** will listen for events *kytos/mef_eline.(redeployed_link_(up|down)|deployed)* and *kytos.mef_eline.created* issued by **mef_eline**.
2. **telemetry_int** will listen for events *kytos/mef_eline.(redeployed_link_(up|down)|deployed|undeployed)* and *kytos.mef_eline.created* issued by **mef_eline**.

3. For each EVC identified, **telemetry** will
1. use EVC's cookie to get all flow entries created by **flow_manager** IF telemetry is not already enabled.
Expand All @@ -249,7 +249,8 @@ V. Events
==========

1. Listening
1. *kytos/mef_eline.(removed|deployed)*
1. *kytos/mef_eline.(removed|deployed|undeployed)*
2. *kytos.topology.updated*

2. Issuing
1. *kytos.telemetry.enabled*
Expand All @@ -259,13 +260,10 @@ V. Events
VI. REST API
=============

- POST /telemetry_int/v1/evc/ body evc_ids: [] for bulk insertions, if empty, then enable all. If invalid or non-existing EVC_ID are provided, abort the entire operation with 4XX status code.
- POST /telemetry_int/v1/evc/<evc_id>: enable/create INT flows for an EVC_ID.
- DELETE /telemetry_int/v1/evc/ body evc_ids: [] for bulk removals, if empty, then remove all. If invalid or non-existing EVC_ID are provided, abort the entire operation with 4XX status code.
- DELETE /telemetry_int/v1/evc/<evc_id>: disable/remove INT flows for an EVC_ID.
- GET /telemetry_int/v1/evc list all INT-enabled EVCs.
- POST /telemetry_int/v1/consistency/ body evc_ids: []- Force the consistency routine to run for evc_id's provided. If none are provide, force for all EVCs.

- ``POST /telemetry_int/v1/evc/enable`` body evc_ids: [] for bulk insertions, if empty, then enable all. If invalid or non-existing EVC_ID are provided, abort the entire operation with 4XX status code.
- ``POST /telemetry_int/v1/evc/disable`` body evc_ids: [] for bulk removals, if empty, then remove all. If invalid or non-existing EVC_ID are provided, abort the entire operation with 4XX status code.
- ``GET /telemetry_int/v1/evc`` list all INT-enabled EVCs.
- ``GET /telemetry_int/v1/evc_compare`` list and compare which telemetry_int flows are still coherent with EVC metadata status

VII. Dependencies
=================
Expand All @@ -280,10 +278,10 @@ VII. New EVC attribute

The **telemetry_int** napp will leverage the EVC's metadata attribute to create a new item, called `telemetry`. This new item will be a dictionary will the following values:

* "enabled": [True|False]
* "source": dpid/name of the switch to be used as the INT Source switch (Future use).
* "sink": dpid/name of the switch to be used as the INT Sink switch (Future use).
* "last_enabled": timestamp of when the item "enabled" changed. 0 for never.
* "status": "enabled|disabled"
* "status_reason": ["some_error"]
* "last_enabled_at": datetime of when the item "enabled" changed. null for never.
* "last_disabled_at": datetime of when the item "enabled" changed. null for never.

IX. Failover integration
========================
Expand All @@ -298,15 +296,44 @@ The **telemetry_int** napp must use a different cookie ID to help understanding
XI. Consistency
===============

The **telemetry_int** napp might deploy a routine to evaluate the consistency of the telemetry flows as performed by the **mef_eline** napp. This implementation will be defined via field experience with Kytos.
The **telemetry_int** napp might deploy a routine to evaluate the consistency of the telemetry flows as performed by the **mef_eline** napp. This implementation will be defined via field experience with Kytos. Ideally, the consistency check should also rely on ``sdntrace_cp`` and follow the same pattern as ``mef_eline``, except that also when trying to trace, it should test both UDP and TCP payloads, if any fails, then it should eventually try to redeploy. As of ``sdntrace_cp`` version ``2023.1`` it still doesn't completely support ``goto_table`` neither ``instructions``, so it needs to be augmented just so ``telemetry_int`` can eventually also rely on it. In the meantime, once an telemetry INT EVC is deployed it'll be considered enabled, including after controller reload, which might not be the case, but since ``flow_manager`` also has a consistency eventually even if flows are accidentally deleted on switches, then it'll push missing flows again.

XII. Pacing
===========

The **telemetry_int** napp must wait a *settings.wait_to_deploy* interval before sending instructions to the flow_manager after EVCs are created/modified/redeployed to avoid overwhelming the switches. The goal is to create batch operations.

XI. Open Questions
==================
XIII. Implementation details ``v1``
===================================

The following requirements clarify certain details and expected behavior for ``telemetry_int`` v1 that will be shipped with Kytos-ng ``2023.2``:

- ``mef_eline`` EVC ``telemetry`` metadata is managed by ``telemetry_int``, **only ``telemetry_int`` is supposed to write or delete it**. If you enable or disable INT you should call ``POST /telemetry_int/v1/evc/enable`` or ``POST /telemetry_int/v1/evc/disable`` endpoints. ``telemetry_int`` will not listen for EVC metadata changes since it'll manage it.

- Once ``mef_eline`` creates and EVC, it'll can optionally request that INT should be provisioned. For this case, a ``telemetry_request: dict`` needs to be set in the metadata, currently no keys are needed, but as more options are supported in the future, they can be set. If ``telemetry_int`` can't provision ``telemetry_int``, then it'll set the ``telemetry: {"status": "disabled", "status_reason": ["<reason>"]}`` metadata, updating the status and filling out the reason accordingly.

- Currently, EVCs are always bidirectional. ``telemetry_int`` v1 iteration, will also follow the bidirectional flows as described in the prior sections. In the future, when ``mef_eline`` starts to support unidirectional flows, then following the flows should be mostly seamless, this facilitates implementation and code maintenance without having to try to derive the direction of all flows and maintain a structure that ``mef_eline`` still doesn't support.

- ``telemetry_int`` will require a looped link on each source sink for both intra and inter EVCs, if it's not present, then ``telemetry_int`` will not enable INT, which implies that in this v1 iteration, you'll need to always have a proxy port (check out EP033 for more information) associated with both UNIs since the EVC is bidirectional. Although the EVC is bidirectional, the looped ports are used unidirectionally for each INT source. This explicitness of always knowing that both UNIs will need a proxy port facilitates to keep track when a proxy port changes and performing a side effect.

- If an UNI's proxy port value changes to another port, then ``telemetry_int`` should reinstall the specific associated EVC sink flows accordingly. Similarly, if ``proxy_port`` is removed, it should remove all associated telemetry int flows. Essentially, changing a ``proxy_port`` metadata acts like an update as far as a EVC telemetry enabled is concerned.

- If any other NApp or client, end up accidentally deleting or overwriting ``telemetry`` metadata, it might result in flows being permanently installed in the database. If this ever happens, the following approaches can be used to fix it: a) ``POST /telemetry_int/v1/evc/enable`` and ``POST /telemetry_int/v1/evc/disable`` will allow a ``force`` boolean flag which will ignore if an EVC exist or not, so it'll either provision or decommission accordingly. b) It'll also expose a ``GET /telemetry_int/v1/evc_compare`` which will compare which ``telemetry_int`` flows still have the metadata enabled or not, and generate a list indicating inconsistencies, then you can use it with the option a) endpoints with ``force`` option to either enable or disable again. It will not try to auto remediate.

- When configuring the proxy port, it always needs to be the lower interface number (which is also guaranteed by LLDP loop detection), e.g., if you have a loop between interface port number 5 and 6, you need to configure 5 as the proxy port. By this convention, the lower port will be the outgoing port for an incoming NNI traffic.

- Once an EVC is redeployed, ``telemetry_int`` will also redeploy accordingly. Also, to ensure fast convergence when handling link down for EVCs that have failover, it's expected that a typical query to stored flows since it's querying indexed fields will not add significant latency, this point will be observed, and we'll see if it'll perform as expected or if more optimization will be needed from ``telemetry_int`` perspective.

- If a proxy port source or destination status changes to DOWN or DISABLED (or if the loop stoops being a loop) all of associated EVCs should be considered not active, similar to what has been implemented when a UNI changes its UP state. No flows should be removed. Currently, ``mef_eline`` doesn't allow an EVC to be deactivated, so this will need to be published via events, depending if an EVC also starts to use ``status`` and ``status_reason`` then interruptions can also be used.

- If an EVC is deleted or removed and it has INT enabled the flows should be removed.

- The only supported ``table_group`` for ``of_multi_table`` will be ``base``, which represents all flows that are specified on this blueprint to be on table 2. All the other flows will follow the ``table_group`` ``mef_eline`` uses. Also, since NoviWare's INT implementation requires ``send_report`` to be executed in table 0, and ``telemetry_int`` is following ``mef_eline`` then only table 0 should be allowed on ``of_multi_table`` when setting the pipeline if ``telemetry_int`` is also being set. So, in practice, in this iteration, you'll always need to have ``telemetry_int`` on table 0 + table X, where X > 0, and by default it will be on table 2 as documented.

XIV. Open Questions / Future Work
=================================

1. Who's going to monitor status of proxy ports to remove INT flows?
2. Error codes, for instance, flows were not instance, there is no proxy ports
1. Error codes, for instance, flows were not instance, there is no proxy ports
2. Support QFactor (where INT is also extended to the hosts). In this case, the source and the sink should behave like a INT hop only using the `add_int_metadata` action.
3. Support unidirectional EVCs
4. Potentially support a specific different "source" and "sink"

0 comments on commit f78ce08

Please sign in to comment.