-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maximum Retry Limit for Firmware Upgrades #109
base: develop
Are you sure you want to change the base?
Maximum Retry Limit for Firmware Upgrades #109
Conversation
…ached_instances` tables to CreateTables sql script
…irmware upgrades
…X_RETRY_LIMIT" env var
… is not reachable
…firmware-upgrades
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! A lot for such a simple check (conceptually) but I think the corner cases covered are all necessary additions so I have no complaints. All unit tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me!
resources/sql_scripts/update_scripts/firmware_max_retry_limit_update.sql
Show resolved
Hide resolved
services/addons/tests/firmware_manager/test_firmware_manager.py
Outdated
Show resolved
Hide resolved
…ng upsert statements
…eries Simplified consecutive firmware upgrade failure count tracking
… retries limit for firmware upgrades
…for RSUs during firmware upgrade process
return consecutive_failures >= max_retries | ||
|
||
|
||
def log_max_retries_reached_incident_for_rsu_to_postgres(rsu_ip): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it'd make sense to also log the attempted firmare version that was causing the issue? It's possible we'd see repeated failures from various versions and might be nice to have that version in here to know exactly which one failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one question, looking good!
PR Details
Description
Problem
The current firmware manager continuously attempts to upgrade the firmware for an RSU, regardless of how many failures occur. This behavior can clog the upgrade queue and waste CPU resources, especially when an RSU is unreachable.
Solution
To address this, a new environment variable has been introduced to set a maximum retry limit for firmware upgrades per RSU. Each time a firmware upgrade fails, the system will increase a 'consecutive failure count' for that RSU. Once the count reaches the maximum retry limit, the target_firmware_version in the rsus table will be updated to match the current firmware_version, effectively halting further upgrades for that RSU. The failure will also be logged in a new Postgres table.
Additionally, the firmware manager will now skip upgrade attempts for RSUs that have returned a negative ping result.
How Has This Been Tested?
Unit tests have been implemented to verify function calls and behavior of methods. Additionally, the firmware manager was spun up in a local docker network with an ubuntu container serving as a mock RSU with a
signedUpgrade.sh
script in the /bin directory that just printed a hard-coded failure alert. The IP address of the ubuntu container was set as the value for the IP address of a sample RSU and three firmware upgrades were executed manually by calling the relevant endpoint.The firmware manager modified the
target_firmware_version
tofirmware_version
as expected and logged the incident to the 'max_retry_limit_reached_instances' table.Types of changes
Checklist: