bug: after a link went down with 300+ static EVCs, it didn't converge to `backup_path` #485

viniarck · 2024-07-25T16:29:53Z

Scenario:

INT Lab as a ring topology with 300 static EVCs with primary_path and backup_path between Novi01 and Novi06
Performed a link down on interface 11 of Novi01 (forcing primary to go down)
During network convergence mef_eline makes a lot of requests to flow_manager to delete, and doesn't handle the error completely this is also related to issue mef_eline high concurrency EVC deletion can leak flows (leave them installed) #483. Plus, in this case it stayed in that case without converging to the backup_path which is really problematic since it would mean full outage, and it would only recover in a next network link up event. See below one example of a problematic EVC that didn't converge.

It's expected that if increasing the api_concurrency_limit this is supposed to go away, I'll cover this test to confirm that it'll behave as expected. However, this is also already a sign that the default value isn't suitable to handle 300+ EVCs, so this needs to be better documented. Also, finally mef_eline should handle this error more completely, logging the error is the minimum expected in this case, but not doing anything after that it's problematic. It should probably mark it for a next deployment to try to recover from this failure (this isn't a consistency check, just part of the rest of error handling)

2024-07-25 13:10:58,456 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,458 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,458 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,458 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,461 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,463 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,463 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,493 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49008 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A02 HTTP/1.1" 202
2024-07-25 13:10:58,500 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49056 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 202
2024-07-25 13:10:58,508 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49526 - "GET /api/kytos/flow_manager/v2/stored_flows?state=installed&state=pending HTTP/1.1" 503
2024-07-25 13:10:58,509 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49558 - "GET /api/kytos/flow_manager/v2/stored_flows?state=installed&state=pending HTTP/1.1" 503
2024-07-25 13:10:58,511 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49588 - "GET /api/kytos/flow_manager/v2/stored_flows?state=installed&state=pending HTTP/1.1" 503
2024-07-25 13:10:58,513 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_82) Error removing flows from switch 00:00:00:00:00:00:00:02 forEVC EVC(d4caab37f41d43, evc89): Service U
navailable
2024-07-25 13:10:58,512 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49732 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A02 HTTP/1.1" 503
2024-07-25 13:10:58,515 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_253) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(44e9dd2ce52142, evc254): Service
 Unavailable
2024-07-25 13:10:58,513 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49758 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,519 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_248) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(f74e8067494148, evc247): Service
 Unavailable
2024-07-25 13:10:58,520 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_232) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(d97b5c7966994d, evc231): Service
 Unavailable
2024-07-25 13:10:58,516 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49784 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,522 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_255) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(0680b6a67c6d45, evc255): Service
 Unavailable
2024-07-25 13:10:58,523 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_300) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(0bcef7fe92bf44, evc295): Service
 Unavailable
2024-07-25 13:10:58,517 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49786 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,525 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_265) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(a2ffb85c69ec4d, evc264): Service
 Unavailable
2024-07-25 13:10:58,527 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_153) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(30b819b1572b4e, evc152): Service
 Unavailable
2024-07-25 13:10:58,520 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49842 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,530 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_296) Error removing flows from switch 00:00:00:00:00:00:00:02 forEVC EVC(5bcfdb8283f447, evc299): Service
 Unavailable
2024-07-25 13:10:58,521 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49936 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,532 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_148) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(5c168c6ec4ca4a, evc143): Service
 Unavailable

{
    "active": false,
    "archived": false,
    "backup_path": [
        {
            "id": "cf0f4071be426b3f745027f5d22bc61f8312ae86293c9b28e7e66015607a9260",
            "endpoint_a": {
                "id": "00:00:00:00:00:00:00:01:2",
                "name": "novi_port_2",
                "port_number": 2,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:01",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "cf0f4071be426b3f745027f5d22bc61f8312ae86293c9b28e7e66015607a9260"
            },
            "endpoint_b": {
                "id": "00:00:00:00:00:00:00:02:2",
                "name": "novi_port_2",
                "port_number": 2,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:02",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "cf0f4071be426b3f745027f5d22bc61f8312ae86293c9b28e7e66015607a9260"
            },
            "metadata": {},
            "active": true,
            "enabled": true,
            "status": "UP",
            "status_reason": []
        },
        {
            "id": "192e11d8563e9059d5d1d4bb6fd5f2e80bdf1bd56638e1a980ccb785a21ed372",
            "endpoint_a": {
                "id": "00:00:00:00:00:00:00:02:1",
                "name": "novi_port_1",
                "port_number": 1,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:02",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "192e11d8563e9059d5d1d4bb6fd5f2e80bdf1bd56638e1a980ccb785a21ed372"
            },
            "endpoint_b": {
                "id": "00:00:00:00:00:00:00:06:5",
                "name": "novi_port_5",
                "port_number": 5,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:06",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "192e11d8563e9059d5d1d4bb6fd5f2e80bdf1bd56638e1a980ccb785a21ed372"
            },
            "metadata": {},
            "active": true,
            "enabled": true,
            "status": "UP",
            "status_reason": []
        }
    ],
    "bandwidth": 0,
    "circuit_scheduler": [],
    "current_path": [],
    "dynamic_backup_path": false,
    "enabled": true,
    "failover_path": [],
    "id": "004fdf5c1c8b41",
    "metadata": {
        "telemetry_request": {},
        "telemetry": {
            "enabled": true,
            "status": "DOWN",
            "status_reason": [
                "redeployed_link_down_no_path"
            ],
            "status_updated_at": "2024-07-25T16:11:03"
        }
    },
    "name": "evc59",
    "primary_path": [
        {
            "id": "d7e3aade462df35a656320b717e8603a7fd624c4cfb7dcb72feafe95c6be1c96",
            "endpoint_a": {
                "id": "00:00:00:00:00:00:00:01:11",
                "name": "novi_port_11",
                "port_number": 11,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:01",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": false,
                "enabled": true,
                "status": "DOWN",
                "status_reason": [
                    "deactivated"
                ],
                "link": "d7e3aade462df35a656320b717e8603a7fd624c4cfb7dcb72feafe95c6be1c96"
            },
            "endpoint_b": {
                "id": "00:00:00:00:00:00:00:06:11",
                "name": "novi_port_11",
                "port_number": 11,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:06",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": false,
                "enabled": true,
                "status": "DOWN",
                "status_reason": [
                    "deactivated"
                ],
                "link": "d7e3aade462df35a656320b717e8603a7fd624c4cfb7dcb72feafe95c6be1c96"
            },
            "metadata": {},
            "active": false,
            "enabled": true,
            "status": "DOWN",
            "status_reason": [
                "deactivated"
            ]
        }
    ],
    "service_level": 0,
    "uni_a": {
        "tag": {
            "tag_type": "vlan",
            "value": 59
        },
        "interface_id": "00:00:00:00:00:00:00:01:15"
    },
    "uni_z": {
        "tag": {
            "tag_type": "vlan",
            "value": 59
        },
        "interface_id": "00:00:00:00:00:00:00:06:22"
    },
    "sb_priority": null,
    "execution_rounds": 0,
    "owner": null,
    "queue_id": -1,
    "primary_constraints": {},
    "secondary_constraints": {},
    "primary_links": [],
    "backup_links": [],
    "start_date": "2024-07-25T16:00:51",
    "creation_time": "2024-07-25T16:00:51",
    "request_time": "2024-07-25T16:00:51",
    "end_date": null,
    "flow_removed_at": "2024-07-25T16:11:00",
    "updated_at": "2024-07-25T16:11:03"
}

The text was updated successfully, but these errors were encountered:

viniarck · 2024-07-25T17:22:22Z

It's expected that if increasing the api_concurrency_limit this is supposed to go away,

With api_concurrency_limit and api thread pool as 512 it didn't result in issues.

The default value on kytos.conf as of July 25th is 160.

viniarck · 2024-07-26T14:19:45Z

It should probably mark it for a next deployment to try to recover from this failure (this isn't a consistency check, just part of the rest of error handling)

This needs to be assessed, discussed, and see if it'd work reliably, and or think about anything that can still be done.

viniarck · 2024-07-26T14:57:12Z

It's probably worth considering retries with tenacity and pacing the method that sends requests too

viniarck added bug Something isn't working priority_major Major priority labels Jul 25, 2024

viniarck mentioned this issue Jul 25, 2024

Reassess default kytos.conf api thread pool size and the related api_concurrency_limit kytos-ng/kytos#489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: after a link went down with 300+ static EVCs, it didn't converge to `backup_path` #485

bug: after a link went down with 300+ static EVCs, it didn't converge to `backup_path` #485

viniarck commented Jul 25, 2024 •

edited

Loading

viniarck commented Jul 25, 2024 •

edited

Loading

viniarck commented Jul 26, 2024 •

edited

Loading

viniarck commented Jul 26, 2024

bug: after a link went down with 300+ static EVCs, it didn't converge to backup_path #485

bug: after a link went down with 300+ static EVCs, it didn't converge to backup_path #485

Comments

viniarck commented Jul 25, 2024 • edited Loading

viniarck commented Jul 25, 2024 • edited Loading

viniarck commented Jul 26, 2024 • edited Loading

viniarck commented Jul 26, 2024

bug: after a link went down with 300+ static EVCs, it didn't converge to `backup_path` #485

bug: after a link went down with 300+ static EVCs, it didn't converge to `backup_path` #485

viniarck commented Jul 25, 2024 •

edited

Loading

viniarck commented Jul 25, 2024 •

edited

Loading

viniarck commented Jul 26, 2024 •

edited

Loading