Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: after a link went down with 300+ static EVCs, it didn't converge to backup_path #485

Open
viniarck opened this issue Jul 25, 2024 · 3 comments
Labels
bug Something isn't working priority_major Major priority

Comments

@viniarck
Copy link
Member

viniarck commented Jul 25, 2024

Scenario:

  • INT Lab as a ring topology with 300 static EVCs with primary_path and backup_path between Novi01 and Novi06
  • Performed a link down on interface 11 of Novi01 (forcing primary to go down)
  • During network convergence mef_eline makes a lot of requests to flow_manager to delete, and doesn't handle the error completely this is also related to issue mef_eline high concurrency EVC deletion can leak flows (leave them installed) #483. Plus, in this case it stayed in that case without converging to the backup_path which is really problematic since it would mean full outage, and it would only recover in a next network link up event. See below one example of a problematic EVC that didn't converge.

It's expected that if increasing the api_concurrency_limit this is supposed to go away, I'll cover this test to confirm that it'll behave as expected. However, this is also already a sign that the default value isn't suitable to handle 300+ EVCs, so this needs to be better documented. Also, finally mef_eline should handle this error more completely, logging the error is the minimum expected in this case, but not doing anything after that it's problematic. It should probably mark it for a next deployment to try to recover from this failure (this isn't a consistency check, just part of the rest of error handling)

2024-07-25 13:10:58,456 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,458 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,458 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,458 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,461 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,463 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,463 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,493 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49008 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A02 HTTP/1.1" 202
2024-07-25 13:10:58,500 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49056 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 202
2024-07-25 13:10:58,508 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49526 - "GET /api/kytos/flow_manager/v2/stored_flows?state=installed&state=pending HTTP/1.1" 503
2024-07-25 13:10:58,509 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49558 - "GET /api/kytos/flow_manager/v2/stored_flows?state=installed&state=pending HTTP/1.1" 503
2024-07-25 13:10:58,511 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49588 - "GET /api/kytos/flow_manager/v2/stored_flows?state=installed&state=pending HTTP/1.1" 503
2024-07-25 13:10:58,513 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_82) Error removing flows from switch 00:00:00:00:00:00:00:02 forEVC EVC(d4caab37f41d43, evc89): Service U
navailable
2024-07-25 13:10:58,512 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49732 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A02 HTTP/1.1" 503
2024-07-25 13:10:58,515 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_253) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(44e9dd2ce52142, evc254): Service
 Unavailable
2024-07-25 13:10:58,513 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49758 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,519 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_248) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(f74e8067494148, evc247): Service
 Unavailable
2024-07-25 13:10:58,520 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_232) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(d97b5c7966994d, evc231): Service
 Unavailable
2024-07-25 13:10:58,516 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49784 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,522 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_255) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(0680b6a67c6d45, evc255): Service
 Unavailable
2024-07-25 13:10:58,523 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_300) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(0bcef7fe92bf44, evc295): Service
 Unavailable
2024-07-25 13:10:58,517 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49786 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,525 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_265) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(a2ffb85c69ec4d, evc264): Service
 Unavailable
2024-07-25 13:10:58,527 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_153) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(30b819b1572b4e, evc152): Service
 Unavailable
2024-07-25 13:10:58,520 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49842 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,530 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_296) Error removing flows from switch 00:00:00:00:00:00:00:02 forEVC EVC(5bcfdb8283f447, evc299): Service
 Unavailable
2024-07-25 13:10:58,521 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49936 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,532 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_148) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(5c168c6ec4ca4a, evc143): Service
 Unavailable

20240725_131549

{
    "active": false,
    "archived": false,
    "backup_path": [
        {
            "id": "cf0f4071be426b3f745027f5d22bc61f8312ae86293c9b28e7e66015607a9260",
            "endpoint_a": {
                "id": "00:00:00:00:00:00:00:01:2",
                "name": "novi_port_2",
                "port_number": 2,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:01",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "cf0f4071be426b3f745027f5d22bc61f8312ae86293c9b28e7e66015607a9260"
            },
            "endpoint_b": {
                "id": "00:00:00:00:00:00:00:02:2",
                "name": "novi_port_2",
                "port_number": 2,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:02",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "cf0f4071be426b3f745027f5d22bc61f8312ae86293c9b28e7e66015607a9260"
            },
            "metadata": {},
            "active": true,
            "enabled": true,
            "status": "UP",
            "status_reason": []
        },
        {
            "id": "192e11d8563e9059d5d1d4bb6fd5f2e80bdf1bd56638e1a980ccb785a21ed372",
            "endpoint_a": {
                "id": "00:00:00:00:00:00:00:02:1",
                "name": "novi_port_1",
                "port_number": 1,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:02",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "192e11d8563e9059d5d1d4bb6fd5f2e80bdf1bd56638e1a980ccb785a21ed372"
            },
            "endpoint_b": {
                "id": "00:00:00:00:00:00:00:06:5",
                "name": "novi_port_5",
                "port_number": 5,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:06",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "192e11d8563e9059d5d1d4bb6fd5f2e80bdf1bd56638e1a980ccb785a21ed372"
            },
            "metadata": {},
            "active": true,
            "enabled": true,
            "status": "UP",
            "status_reason": []
        }
    ],
    "bandwidth": 0,
    "circuit_scheduler": [],
    "current_path": [],
    "dynamic_backup_path": false,
    "enabled": true,
    "failover_path": [],
    "id": "004fdf5c1c8b41",
    "metadata": {
        "telemetry_request": {},
        "telemetry": {
            "enabled": true,
            "status": "DOWN",
            "status_reason": [
                "redeployed_link_down_no_path"
            ],
            "status_updated_at": "2024-07-25T16:11:03"
        }
    },
    "name": "evc59",
    "primary_path": [
        {
            "id": "d7e3aade462df35a656320b717e8603a7fd624c4cfb7dcb72feafe95c6be1c96",
            "endpoint_a": {
                "id": "00:00:00:00:00:00:00:01:11",
                "name": "novi_port_11",
                "port_number": 11,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:01",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": false,
                "enabled": true,
                "status": "DOWN",
                "status_reason": [
                    "deactivated"
                ],
                "link": "d7e3aade462df35a656320b717e8603a7fd624c4cfb7dcb72feafe95c6be1c96"
            },
            "endpoint_b": {
                "id": "00:00:00:00:00:00:00:06:11",
                "name": "novi_port_11",
                "port_number": 11,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:06",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": false,
                "enabled": true,
                "status": "DOWN",
                "status_reason": [
                    "deactivated"
                ],
                "link": "d7e3aade462df35a656320b717e8603a7fd624c4cfb7dcb72feafe95c6be1c96"
            },
            "metadata": {},
            "active": false,
            "enabled": true,
            "status": "DOWN",
            "status_reason": [
                "deactivated"
            ]
        }
    ],
    "service_level": 0,
    "uni_a": {
        "tag": {
            "tag_type": "vlan",
            "value": 59
        },
        "interface_id": "00:00:00:00:00:00:00:01:15"
    },
    "uni_z": {
        "tag": {
            "tag_type": "vlan",
            "value": 59
        },
        "interface_id": "00:00:00:00:00:00:00:06:22"
    },
    "sb_priority": null,
    "execution_rounds": 0,
    "owner": null,
    "queue_id": -1,
    "primary_constraints": {},
    "secondary_constraints": {},
    "primary_links": [],
    "backup_links": [],
    "start_date": "2024-07-25T16:00:51",
    "creation_time": "2024-07-25T16:00:51",
    "request_time": "2024-07-25T16:00:51",
    "end_date": null,
    "flow_removed_at": "2024-07-25T16:11:00",
    "updated_at": "2024-07-25T16:11:03"
}
@viniarck viniarck added bug Something isn't working priority_major Major priority labels Jul 25, 2024
@viniarck
Copy link
Member Author

viniarck commented Jul 25, 2024

It's expected that if increasing the api_concurrency_limit this is supposed to go away,

With api_concurrency_limit and api thread pool as 512 it didn't result in issues.

The default value on kytos.conf as of July 25th is 160.

@viniarck
Copy link
Member Author

viniarck commented Jul 26, 2024

It should probably mark it for a next deployment to try to recover from this failure (this isn't a consistency check, just part of the rest of error handling)

This needs to be assessed, discussed, and see if it'd work reliably, and or think about anything that can still be done.

@viniarck
Copy link
Member Author

It's probably worth considering retries with tenacity and pacing the method that sends requests too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority_major Major priority
Projects
None yet
Development

No branches or pull requests

1 participant