-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for persistent storage and retrieval of DPU reboot-cause #169
Conversation
scripts/process-reboot-cause
Outdated
@@ -69,6 +70,45 @@ def read_reboot_cause_files_and_save_state_db(): | |||
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i] | |||
os.remove(x) | |||
|
|||
def read_dpu_reboot_cause_files_and_save_chassis_state_db(): | |||
# Get platform using device_info.get_platform() | |||
platform_info = device_info.get_platform_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code to query DPUs from the platform.json should be part of src/sonic-py-common/sonic_py_common/device_info.py
. There is an implementation for getting the get_num_dpus, so we can put something like get_dpus_info
. This will prevent a code duplication in all places that need to access DPUs information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added "process-reboot-cause_test.py" and test_update_dpu_reboot_cause_to_chassis_state_db_update to cover this case.
tests/determine-reboot-cause_test.py
Outdated
|
||
# Assert that makedirs was called for the DPU directories | ||
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu0')) | ||
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu1')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_dpu_reboot_cause_files_and_save_chassis_state_d
is not covered by the tests. Please new test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv will do but can we defer it as the entire files doe not have a test yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy I think comment is addressed and test is added, right? please confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv Done. Added "process-reboot-cause_test.py" and test_update_dpu_reboot_cause_to_chassis_state_db_update to cover this case.
hardware implementation requirements
466b70d
to
17345aa
Compare
scripts/process-reboot-cause
Outdated
@@ -69,6 +71,84 @@ def read_reboot_cause_files_and_save_state_db(): | |||
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i] | |||
os.remove(x) | |||
|
|||
def get_dpus(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can we make use of existing APIs? @vvolam ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, get_dpu_info() defined in https://github.com/sonic-net/sonic-buildimage/pull/20724/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor done
scripts/process-reboot-cause
Outdated
try: | ||
# Assuming you have a way to list the files in the directory | ||
files = os.listdir(dpu_history_path) | ||
# Filter and sort the files based on your criteria (e.g., by modification time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy why eg? Are we not sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor cleaned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy I mean why eg? You are indeed sorting by modification time so just put "sorted by modification time"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor cleaned
scripts/process-reboot-cause
Outdated
_hash = f"{REBOOT_CAUSE_TABLE_NAME}|{data['gen_time']}" | ||
state_db.set(state_db.STATE_DB, _hash, 'cause', data.get('cause', '')) | ||
state_db.set(state_db.STATE_DB, _hash, 'time', data.get('time', '')) | ||
state_db.set(state_db.STATE_DB, _hash, 'user', data.get('user', '')) | ||
state_db.set(state_db.STATE_DB, _hash, 'comment', data.get('comment', '')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy function says chassis STATE_DB but here its STATE_DB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Fixing the DB
scripts/process-reboot-cause
Outdated
# Get sorted reboot cause files for the DPU | ||
reboot_files = get_sorted_reboot_cause_files(os.path.join(dpu_history_path, "history")) | ||
|
||
for reboot_file in reboot_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy How do we handle a case where NPU comes late so that DPU to NPU mid plane is not UP by the time this process starts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor As shown in the HLD the NPU-chassisd will fetch the reboot-cause from the DPU and persist it.
1dbf35e
to
c28e29c
Compare
@oleksandrivantsiv could you please review and approve |
scripts/process-reboot-cause
Outdated
return [] | ||
|
||
|
||
def read_dpu_reboot_cause_files_and_save_chassis_state_db(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can we break this into two functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor It is already done in two functions get_sorted_reboot_cause_files and saving to chassisStateDB using this function. The name of the function is misleading, so renamed it to save_dpu_reboot_cause_files_to_chassis_state_db()
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@rameshraghupathy can you share the test result on fixed chassis (non-smartswtich). |
@prgeor Done |
Adding support for persistent storage and retrieval of DPU reboot-cause
The output with the new code on a Non-SmartSwitch
root@sonic:/usr/local/bin# show reboot-cause
Power Loss
root@sonic:/usr/local/bin# show reboot-cause all
Usage: show reboot-cause [OPTIONS] COMMAND [ARGS]...
Try "show reboot-cause -h" for help.
Error: No such command "all".
root@sonic:/usr/local/bin# show reboot-cause history
Name Cause Time User Comment
2024_12_05_15_20_16 Power Loss N/A N/A Unknown (First boot of SONiC version azure_cisco_202311.17974-dirty-20241107.074130)
root@sonic:/usr/local/bin# show reboot-cause history all
Usage: show reboot-cause history [OPTIONS]
Try "show reboot-cause history -h" for help.
Error: Got unexpected extra argument (all)
root@sonic:/usr/local/bin# timed out waiting for input: auto-logout
show plat sum
cisco@sonic:~$ sudo su
root@sonic:/home/cisco# show plat sum
Platform: x86_64-8122_64eh_o-r0
HwSKU: Cisco-8122-O128
ASIC: cisco-8000
ASIC Count: 1
Serial Number: FLM2824099A
Model Number: 8122-64EH-O
Hardware Revision: 0.23