Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for persistent storage and retrieval of DPU reboot-cause #169

Merged
merged 79 commits into from
Feb 9, 2025

Conversation

rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented Oct 11, 2024

Adding support for persistent storage and retrieval of DPU reboot-cause

  • =======================================
    The output with the new code on a Non-SmartSwitch
  • =======================================

root@sonic:/usr/local/bin# show reboot-cause
Power Loss
root@sonic:/usr/local/bin# show reboot-cause all
Usage: show reboot-cause [OPTIONS] COMMAND [ARGS]...
Try "show reboot-cause -h" for help.

Error: No such command "all".
root@sonic:/usr/local/bin# show reboot-cause history
Name Cause Time User Comment


2024_12_05_15_20_16 Power Loss N/A N/A Unknown (First boot of SONiC version azure_cisco_202311.17974-dirty-20241107.074130)
root@sonic:/usr/local/bin# show reboot-cause history all
Usage: show reboot-cause history [OPTIONS]
Try "show reboot-cause history -h" for help.

Error: Got unexpected extra argument (all)
root@sonic:/usr/local/bin# timed out waiting for input: auto-logout

  • ==================
    show plat sum
  • ==================
    cisco@sonic:~$ sudo su
    root@sonic:/home/cisco# show plat sum
    Platform: x86_64-8122_64eh_o-r0
    HwSKU: Cisco-8122-O128
    ASIC: cisco-8000
    ASIC Count: 1
    Serial Number: FLM2824099A
    Model Number: 8122-64EH-O
    Hardware Revision: 0.23

@@ -69,6 +70,45 @@ def read_reboot_cause_files_and_save_state_db():
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i]
os.remove(x)

def read_dpu_reboot_cause_files_and_save_chassis_state_db():
# Get platform using device_info.get_platform()
platform_info = device_info.get_platform_info()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code to query DPUs from the platform.json should be part of src/sonic-py-common/sonic_py_common/device_info.py. There is an implementation for getting the get_num_dpus, so we can put something like get_dpus_info. This will prevent a code duplication in all places that need to access DPUs information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "process-reboot-cause_test.py" and test_update_dpu_reboot_cause_to_chassis_state_db_update to cover this case.


# Assert that makedirs was called for the DPU directories
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu0'))
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu1'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_dpu_reboot_cause_files_and_save_chassis_state_d is not covered by the tests. Please new test for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv will do but can we defer it as the entire files doe not have a test yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I think comment is addressed and test is added, right? please confirm.

Copy link
Contributor Author

@rameshraghupathy rameshraghupathy Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv Done. Added "process-reboot-cause_test.py" and test_update_dpu_reboot_cause_to_chassis_state_db_update to cover this case.

@@ -69,6 +71,84 @@ def read_reboot_cause_files_and_save_state_db():
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i]
os.remove(x)

def get_dpus():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can we make use of existing APIs? @vvolam ?

Copy link
Contributor Author

@rameshraghupathy rameshraghupathy Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor sure, will do, there is a dependency with @vvolam's commit and will do it once that is committed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor done

try:
# Assuming you have a way to list the files in the directory
files = os.listdir(dpu_history_path)
# Filter and sort the files based on your criteria (e.g., by modification time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy why eg? Are we not sure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor cleaned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I mean why eg? You are indeed sorting by modification time so just put "sorted by modification time"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor cleaned

Comment on lines 141 to 145
_hash = f"{REBOOT_CAUSE_TABLE_NAME}|{data['gen_time']}"
state_db.set(state_db.STATE_DB, _hash, 'cause', data.get('cause', ''))
state_db.set(state_db.STATE_DB, _hash, 'time', data.get('time', ''))
state_db.set(state_db.STATE_DB, _hash, 'user', data.get('user', ''))
state_db.set(state_db.STATE_DB, _hash, 'comment', data.get('comment', ''))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy function says chassis STATE_DB but here its STATE_DB?

Copy link
Contributor Author

@rameshraghupathy rameshraghupathy Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Fixing the DB

# Get sorted reboot cause files for the DPU
reboot_files = get_sorted_reboot_cause_files(os.path.join(dpu_history_path, "history"))

for reboot_file in reboot_files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy How do we handle a case where NPU comes late so that DPU to NPU mid plane is not UP by the time this process starts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor As shown in the HLD the NPU-chassisd will fetch the reboot-cause from the DPU and persist it.

@kperumalbfn
Copy link

@oleksandrivantsiv could you please review and approve

return []


def read_dpu_reboot_cause_files_and_save_chassis_state_db():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can we break this into two functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor It is already done in two functions get_sorted_reboot_cause_files and saving to chassisStateDB using this function. The name of the function is misleading, so renamed it to save_dpu_reboot_cause_files_to_chassis_state_db()

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor
Copy link
Contributor

prgeor commented Feb 7, 2025

@rameshraghupathy can you share the test result on fixed chassis (non-smartswtich).
please update the result in pr description

@rameshraghupathy
Copy link
Contributor Author

@rameshraghupathy can you share the test result on fixed chassis (non-smartswtich). please update the result in pr description

@prgeor Done

@prgeor prgeor merged commit bb0a31c into sonic-net:master Feb 9, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants