-
Notifications
You must be signed in to change notification settings - Fork 709
CLI support for SmartSwitch PMON #3271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can you please add UT for the new functions? |
addressed. 2. The DPU reboot-cause data is fetched directly fromn the chassis_state_db now
temporarily bypassing the check
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@rameshraghupathy could you check the description once as the command and sub-command mentioned are same. Please make corrections as needed? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@rameshraghupathy Except this, remaining looks good to me |
@vvolam There was a formatting issue due to which the preview wasn't showing "". Fixed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy in all these sample examples can you specify where the CLI is being executed. NPU or in the DPU host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Added samples, in missing places
show reboot-cause all | ||
``` | ||
|
||
- Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy please also capture the CLI output specifically when run inside DPU (even though its same as fixed chassis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Added the CLI output when run on DPU
key = key + suffix | ||
keys = chassis_state_db.keys(chassis_state_db.CHASSIS_STATE_DB, key) | ||
if not keys: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy how will the user know the CLI failed? How will the script know if the return code is not zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Added an error message to indicate the DPU_STATE table doesn't exist for the module in the DB.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
What I did
Enhanced the following CLIs to support SmartSwitch PMON as described in the PMON HLD documentation "https://github.com/sonic-net/SONiC/blob/d19d8933a43d0a31a4f3b2310f4336f289bca340/doc/smart-switch/pmon/smartswitch-pmon.md"
CLIs:
Added new module "DPUX" support for 1 and 2 below
1. "config chassis module startup DPUX" , where X could be 0, to the maximum number of DPUs-1 in the SmartSwitch chassis
2. "config chassis module shutdown DPUX"
Extended the following CLIs to support the new module "DPUX" and also provided the "all" option to display the reboot-cause of the "NPU" and all "DPU" modules.
1. "show reboot-cause" will remain the same and added "show reboot-cause all"
2. "show reboot-cause history" will remain the same and added "show reboot-cause history <module-name>", where module name could be DPUX, and all.
Added a new sub command "show system-health dpu <module-name>", where module name could be DPUX, and all. This new subcommand will provide additional DPU state details as mentioned in the HLD
How I did it
Fixes: sonic-net/sonic-buildimage#21372
How to verify it
Require files:
- This PR including reboot_cause.py, chassis_modules.py, system_health.py)
- The other PR including module_base.py, chassis_base.py, docker-pmon.supervisord.conf.j2, chassisd, mock_module_base.py, and the appropriate database_config.json
- Platform "platform-cisco-8000" supporting PMON (module.py, chassis.py, inventory.py, pmon_daemon_control.json, and the required DB changes)
CLI output
root@sonic:~# show reboot-cause
Unknown
root@MtFuji:/home/cisco# show reboot-cause all
Device Name Cause Time User
NPU 2024_12_11_01_54_02 reboot Wed Dec 11 01:48:07 AM UTC 2024 cisco
DPU0 2024_12_11_01_56_45 reboot Wed Dec 11 01:56:45 AM UTC 2024
cisco@MtFuji-dut:~$ show reboot-cause history
Name Cause Time User Comment
2024_12_19_14_00_33 reboot Thu Dec 19 01:55:41 PM UTC 2024 cisco N/A
2024_12_15_08_21_38 reboot Sun Dec 15 08:16:36 AM UTC 2024 cisco N/A
cisco@MtFuji-dut:~$ show reboot-cause history all
Device Name Cause Time User Comment
NPU 2024_12_19_14_00_33 reboot Thu Dec 19 01:55:41 PM UTC 2024 cisco N/A
NPU 2024_12_15_08_21_38 reboot Sun Dec 15 08:16:36 AM UTC 2024 cisco N/A
DPU1 2024_12_19_14_36_47 Non-Hardware Thu Dec 19 02:36:47 PM UTC 2024 Switch rebooted DPU
DPU0 2024_12_19_14_03_24 reboot Thu Dec 19 02:03:24 PM UTC 2024 N/A
DPU0 2024_12_17_15_57_29 reboot Tue Dec 17 03:57:29 PM UTC 2024 N/A
root@sonic:~# show reboot-cause history DPU0
Device Name Cause Time User Comment
DPU0 2024_12_19_14_03_24 reboot Thu Dec 19 02:03:24 PM UTC 2024 N/A
DPU0 2024_12_17_15_57_29 reboot Tue Dec 17 03:57:29 PM UTC 2024 N/A