Skip to content

CLI support for SmartSwitch PMON #3271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 181 commits into from
Jan 27, 2025
Merged

Conversation

rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented Apr 14, 2024

What I did

Enhanced the following CLIs to support SmartSwitch PMON as described in the PMON HLD documentation "https://github.com/sonic-net/SONiC/blob/d19d8933a43d0a31a4f3b2310f4336f289bca340/doc/smart-switch/pmon/smartswitch-pmon.md"

CLIs:
Added new module "DPUX" support for 1 and 2 below
1. "config chassis module startup DPUX" , where X could be 0, to the maximum number of DPUs-1 in the SmartSwitch chassis
2. "config chassis module shutdown DPUX"

Extended the following CLIs to support the new module "DPUX" and also provided the "all" option to display the reboot-cause of the "NPU" and all "DPU" modules.
1. "show reboot-cause" will remain the same and added "show reboot-cause all"
2. "show reboot-cause history" will remain the same and added "show reboot-cause history <module-name>", where module name could be DPUX, and all.

Added a new sub command "show system-health dpu <module-name>", where module name could be DPUX, and all. This new subcommand will provide additional DPU state details as mentioned in the HLD

How I did it

  1. Kept the original CLI output unaltered
  2. Added sub command to support SmartSwitch "DPUs"
  3. Added additional code in chassisd, and in platform modules.py, chassis.py to support it
  4. Updated the DB tables as mentioned in the PMON HLD

Fixes: sonic-net/sonic-buildimage#21372

How to verify it

  1. Build an image with the required files (refer to the other upstream PRs and the platform PRs)
    Require files:
    - This PR including reboot_cause.py, chassis_modules.py, system_health.py)
    - The other PR including module_base.py, chassis_base.py, docker-pmon.supervisord.conf.j2, chassisd, mock_module_base.py, and the appropriate database_config.json
    - Platform "platform-cisco-8000" supporting PMON (module.py, chassis.py, inventory.py, pmon_daemon_control.json, and the required DB changes)
  2. Run the CLIs and see the new output

CLI output

root@sonic:~# show reboot-cause
Unknown

root@MtFuji:/home/cisco# show reboot-cause all
Device Name Cause Time User


NPU 2024_12_11_01_54_02 reboot Wed Dec 11 01:48:07 AM UTC 2024 cisco
DPU0 2024_12_11_01_56_45 reboot Wed Dec 11 01:56:45 AM UTC 2024

cisco@MtFuji-dut:~$ show reboot-cause history
Name Cause Time User Comment


2024_12_19_14_00_33 reboot Thu Dec 19 01:55:41 PM UTC 2024 cisco N/A
2024_12_15_08_21_38 reboot Sun Dec 15 08:16:36 AM UTC 2024 cisco N/A

cisco@MtFuji-dut:~$ show reboot-cause history all
Device Name Cause Time User Comment


NPU 2024_12_19_14_00_33 reboot Thu Dec 19 01:55:41 PM UTC 2024 cisco N/A
NPU 2024_12_15_08_21_38 reboot Sun Dec 15 08:16:36 AM UTC 2024 cisco N/A
DPU1 2024_12_19_14_36_47 Non-Hardware Thu Dec 19 02:36:47 PM UTC 2024 Switch rebooted DPU
DPU0 2024_12_19_14_03_24 reboot Thu Dec 19 02:03:24 PM UTC 2024 N/A
DPU0 2024_12_17_15_57_29 reboot Tue Dec 17 03:57:29 PM UTC 2024 N/A

root@sonic:~# show reboot-cause history DPU0
Device Name Cause Time User Comment


DPU0 2024_12_19_14_03_24 reboot Thu Dec 19 02:03:24 PM UTC 2024 N/A
DPU0 2024_12_17_15_57_29 reboot Tue Dec 17 03:57:29 PM UTC 2024 N/A

@oleksandrivantsiv
Copy link
Collaborator

Can you please add UT for the new functions?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Contributor

vvolam commented Jan 6, 2025

@rameshraghupathy could you check the description once as the command and sub-command mentioned are same. Please make corrections as needed?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Contributor

vvolam commented Jan 7, 2025

@rameshraghupathy could you check the description once as the command and sub-command mentioned are same. Please make corrections as needed?

@rameshraghupathy Except this, remaining looks good to me

@rameshraghupathy
Copy link
Contributor Author

@rameshraghupathy could you check the description once as the command and sub-command mentioned are same. Please make corrections as needed?

@vvolam There was a formatting issue due to which the preview wasn't showing "". Fixed it.

vvolam
vvolam previously approved these changes Jan 7, 2025
Copy link
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy in all these sample examples can you specify where the CLI is being executed. NPU or in the DPU host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Added samples, in missing places

show reboot-cause all
```

- Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy please also capture the CLI output specifically when run inside DPU (even though its same as fixed chassis)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Added the CLI output when run on DPU

key = key + suffix
keys = chassis_state_db.keys(chassis_state_db.CHASSIS_STATE_DB, key)
if not keys:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy how will the user know the CLI failed? How will the script know if the return code is not zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Added an error message to indicate the DPU_STATE table doesn't exist for the module in the DB.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor prgeor merged commit 97c20cc into sonic-net:master Jan 27, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Smartswitch][reboot-cause] Invalid reboot cause on First boot
8 participants