Skip to content

Commit 97c20cc

Browse files
CLI support for SmartSwitch PMON (#3271)
* CLI support for SmartSwitch PMON * imad minor fixes * Did some cleanup for backward compatibility * removed the column wrapping * Made it backward compatible and removed textwrap and added ut to PR * 1. There was a duplication of part of a function and that has been addressed. 2. The DPU reboot-cause data is fetched directly fromn the chassis_state_db now * reboot_cause and system_health are obtained directly from chassisStateDB now * The expected and result are the same but the test is throwing an error, temporarily bypassing the check * Let us get the build going and then look into the test mockup * Implemented as per the pmon hld, also made some improvements in the implementation * Fixed the key for CHASSIS_MODULE_INFO_TABLE entries * Fixed "show reboot-cause all" and "show reboot-cause history all" * Addressing review comments * Checking if the test issue still exists * Resolving SA errors triggered due to reboot_cause_test * Resolved pre-commit issues * Resolved pre-commit issues * Improving coverage * Fixed SA related warnings * Did some cleanup * Minor improvements and fixes * Adding tests for system health * Adding more system health related tests * Fixed a minor issue * Fixed long line SA issue * Trying to please SA * Trying to improve coverage * import mock * Fixed a typo * mocking DB * Fixed syntax issues * DB mock fix * removed unused import * creating ut for dpu state * Improving coverage * Fixed a typo * Adjusted the reboot-cause key as per the updated hld * Added fix to gracefully handle sytem health DB keys not present case * Addressed minor review comments * Addressed review comments. Commented out system-health support until phase:2 * Resolved minor issues and SA failures * Added role to PORT table in config_db. Using role to differentiate npu-dpu data plane connection in SmartSwitch with Dpc being the role. Did a minor cleanup. * Resolving pre-commit check error related to line > 120 * Trying to avoid pre-commit issues * Testing SA and precommit checks * Making it backward compatible * Resolving column size and whitespace issue * Working on SA issue * Testing SA and UT * Added 2 spaces before inline comment * Enabling "show system-health dpu" cli alone. The rest of the dpu health is differed for now. * Fixed SA issues * Adde new line at EOF * Enabling the UT for the CLI "show system-health dpu" * Resolved SA issues * Resolved a SA issue * Added smartswitch specific "reboot-cause" and "reboot-cause history" CLI extensions * Removed the phase:2 related system-health cli extensions as a seperate PR will be raised eventually for phase:2 * Using smartswitch qualifier for the clie extensions * Fixed SA issues * mocking device_info for test cases * import patch in tests * Debugging test failure * Fixing SA issues * fixing sa issues * Debugging sa issues * trying to resolve sa issues * fixed indentation * debugging * debugging * debugging * debugging * Debugging * debugging * debugging * Debugging * Debugging * Debuggingg * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debuggingg * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Debugging * Removing the test to build an image * Removed mock import * Improving coverage * pleasing SA * Fixing tests for design changes as per review comments * Resolving test failure * fixed indentation * cleaned up the test case * Addressed review comments in Command-Reference.md and trying to improve coverage * Improving coverage * Fixed a test issue * Addressed review comments * Addressed review comment. Reading DPUs list from config_db.json * Improving coverage * Resolved SA error * Trying to improve coverage. Also, reading from platform.json * adding json import in the test * Fixed a test failure * Fixed SA error * Exercising the new function in test * Removed a blank line * fixing mock issue * Trying a different approach * working on coverage * debugging * debugging * Debugging * Increasing coverage * improving coverage * Adjusting the show cli implementation to align with the reboot-cause changes such as 1. STATE_DB vs CHASSIS_STATE_DB and the key info * Fixing a minor issue * Removed ID column from the "show system-health dpu DPUx" cli as per the new requirement * Addressed default dpu admin status for dark-mode and seamless migration to lightup mode * Resolving SA issue * Resolved a typo * Added checks to see if module_name is valid in the "config chassis modules startup DPUx" cli aand also moved all the required utilities to the common file * Fixed white space issues * Cleaned unwanted import * Fixed build issues * missedout the fixes in a couple of files * With the recent code the app_db multi_asic.PORT_ROLE is Dpc for DPU ports, earlier this was not the case. So removing the additional check. * As the port role issue is no longer seen in smartswitch, cleaning up the related chnages. * Using the verbose define for TYPE_DPC in the CLI, if there is a specific requirement to keep 'TYPE_DPC = Dpc", which is the role, then we will revert it * Reverting intfutil_test.py * Using the common API to get_dpu_list * Removed unused import json * Addressed review comments * Did some minor cleanp * Fix: SA error * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Addressed review comments * Added fix for issue:21372 - Device name column shows NPU instead of module name * Added fix for issue:21372 - Fixing the device name colum in the cli output * Added a few review comments
1 parent 752c3d4 commit 97c20cc

File tree

8 files changed

+616
-46
lines changed

8 files changed

+616
-46
lines changed

config/chassis_modules.py

+24-6
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import re
66
import subprocess
77
import utilities_common.cli as clicommon
8+
from utilities_common.chassis import is_smartswitch, get_all_dpus
89

910
TIMEOUT_SECS = 10
1011

@@ -27,7 +28,10 @@ def get_config_module_state(db, chassis_module_name):
2728
config_db = db.cfgdb
2829
fvs = config_db.get_entry('CHASSIS_MODULE', chassis_module_name)
2930
if not fvs:
30-
return 'up'
31+
if is_smartswitch():
32+
return 'down'
33+
else:
34+
return 'up'
3135
else:
3236
return fvs['admin_status']
3337

@@ -102,16 +106,21 @@ def fabric_module_set_admin_status(db, chassis_module_name, state):
102106
#
103107
@modules.command('shutdown')
104108
@clicommon.pass_db
105-
@click.argument('chassis_module_name', metavar='<module_name>', required=True)
109+
@click.argument('chassis_module_name',
110+
metavar='<module_name>',
111+
required=True,
112+
type=click.Choice(get_all_dpus(), case_sensitive=False) if is_smartswitch() else str
113+
)
106114
def shutdown_chassis_module(db, chassis_module_name):
107115
"""Chassis-module shutdown of module"""
108116
config_db = db.cfgdb
109117
ctx = click.get_current_context()
110118

111119
if not chassis_module_name.startswith("SUPERVISOR") and \
112120
not chassis_module_name.startswith("LINE-CARD") and \
113-
not chassis_module_name.startswith("FABRIC-CARD"):
114-
ctx.fail("'module_name' has to begin with 'SUPERVISOR', 'LINE-CARD' or 'FABRIC-CARD'")
121+
not chassis_module_name.startswith("FABRIC-CARD") and \
122+
not chassis_module_name.startswith("DPU"):
123+
ctx.fail("'module_name' has to begin with 'SUPERVISOR', 'LINE-CARD', 'FABRIC-CARD', 'DPU'")
115124

116125
# To avoid duplicate operation
117126
if get_config_module_state(db, chassis_module_name) == 'down':
@@ -130,7 +139,11 @@ def shutdown_chassis_module(db, chassis_module_name):
130139
#
131140
@modules.command('startup')
132141
@clicommon.pass_db
133-
@click.argument('chassis_module_name', metavar='<module_name>', required=True)
142+
@click.argument('chassis_module_name',
143+
metavar='<module_name>',
144+
required=True,
145+
type=click.Choice(get_all_dpus(), case_sensitive=False) if is_smartswitch() else str
146+
)
134147
def startup_chassis_module(db, chassis_module_name):
135148
"""Chassis-module startup of module"""
136149
config_db = db.cfgdb
@@ -142,7 +155,12 @@ def startup_chassis_module(db, chassis_module_name):
142155
return
143156

144157
click.echo("Starting up chassis module {}".format(chassis_module_name))
145-
config_db.set_entry('CHASSIS_MODULE', chassis_module_name, None)
158+
if is_smartswitch():
159+
fvs = {'admin_status': 'up'}
160+
config_db.set_entry('CHASSIS_MODULE', chassis_module_name, fvs)
161+
else:
162+
config_db.set_entry('CHASSIS_MODULE', chassis_module_name, None)
163+
146164
if chassis_module_name.startswith("FABRIC-CARD"):
147165
if not check_config_module_state_with_timeout(ctx, db, chassis_module_name, 'up'):
148166
fabric_module_set_admin_status(db, chassis_module_name, 'up')

doc/Command-Reference.md

+136-6
Original file line numberDiff line numberDiff line change
@@ -713,11 +713,52 @@ This command displays the cause of the previous reboot
713713
```
714714

715715
- Example:
716+
### Shown below is the output of the CLI when executed on the NPU
716717
```
717718
admin@sonic:~$ show reboot-cause
718719
User issued reboot command [User: admin, Time: Mon Mar 25 01:02:03 UTC 2019]
719720
```
721+
### Shown below is the output of the CLI when executed on the DPU
722+
```
723+
admin@sonic:~$ show reboot-cause
724+
reboot
725+
```
726+
```
727+
Note: The CLI extensions shown in this block are applicable only to smartswitch platforms. When these extensions are used on a regular switch the extension will be ignored and the output will be the same irrespective of the options.
728+
729+
CLI Extensions Applicable to Smartswtich
730+
- show reboot-cause all
731+
- show reboot-cause history all
732+
- show reboot-cause history DPUx
733+
```
734+
**show reboot-cause all**
735+
736+
This command displays the cause of the previous reboot for the Switch and the DPUs for which the midplane interfaces are up.
737+
738+
- Usage:
739+
```
740+
show reboot-cause all
741+
```
742+
743+
- Example:
744+
### Shown below is the output of the CLI when executed on the NPU
745+
```
746+
root@MtFuji:/home/cisco# show reboot-cause all
747+
Device Name Cause Time User
748+
-------- ------------------- ------------ ------------------------------- ------
749+
NPU 2025_01_21_09_01_11 Power Loss N/A N/A
750+
DPU1 2025_01_21_09_03_43 Non-Hardware Tue Jan 21 09:03:43 AM UTC 2025
751+
DPU0 2025_01_21_09_03_37 Non-Hardware Tue Jan 21 09:03:37 AM UTC 2025
720752
753+
```
754+
### Shown below is the output of the CLI when executed on the DPU
755+
```
756+
root@sonic:/home/admin# show reboot-cause all
757+
Usage: show reboot-cause [OPTIONS] COMMAND [ARGS]...
758+
Try "show reboot-cause -h" for help.
759+
760+
Error: No such command "all".
761+
```
721762
**show reboot-cause history**
722763

723764
This command displays the history of the previous reboots up to 10 entry
@@ -728,15 +769,74 @@ This command displays the history of the previous reboots up to 10 entry
728769
```
729770

730771
- Example:
772+
### Shown below is the output of the CLI when executed on the NPU
731773
```
732-
admin@sonic:~$ show reboot-cause history
733-
Name Cause Time User Comment
734-
------------------- ----------- ---------------------------- ------ ---------
735-
2020_10_09_02_33_06 reboot Fri Oct 9 02:29:44 UTC 2020 admin
774+
root@MtFuji:/home/cisco# show reboot-cause history
775+
Name Cause Time User Comment
776+
------------------- ---------- ------ ------ ----------------------------------------------------------------------------------
777+
2020_10_09_02_40_11 Power Loss Fri Oct 9 02:40:11 UTC 2020 N/A Unknown (First boot of SONiC version azure_cisco_master.308-dirty-20250120.220704)
736778
2020_10_09_01_56_59 reboot Fri Oct 9 01:53:49 UTC 2020 admin
737-
2020_10_09_02_00_53 fast-reboot Fri Oct 9 01:58:04 UTC 2020 admin
738-
2020_10_09_04_53_58 warm-reboot Fri Oct 9 04:51:47 UTC 2020 admin
739779
```
780+
### Shown below is the output of the CLI when executed on the DPU
781+
```
782+
root@sonic:/home/admin# show reboot-cause history
783+
Name Cause Time User Comment
784+
------------------- ------- ------------------------------- ------ ---------
785+
2025_01_21_16_49_20 Unknown N/A N/A N/A
786+
2025_01_17_11_25_58 reboot Fri Jan 17 11:23:24 AM UTC 2025 admin N/A
787+
```
788+
**show reboot-cause history all**
789+
790+
This command displays the history of the previous reboots up to 10 entry of the Switch and the DPUs for which the midplane interfaces are up.
791+
792+
- Usage:
793+
```
794+
show reboot-cause history all
795+
```
796+
797+
- Example:
798+
### Shown below is the output of the CLI when executed on the NPU
799+
```
800+
root@MtFuji:~# show reboot-cause history all
801+
Device Name Cause Time User Comment
802+
-------- ------------------- ----------------------------------------- ------------------------------- ------ -------
803+
NPU 2024_07_23_23_06_57 Kernel Panic Tue Jul 23 11:02:27 PM UTC 2024 N/A N/A
804+
NPU 2024_07_23_11_21_32 Power Loss N/A N/A Unknown
805+
```
806+
### Shown below is the output of the CLI when executed on the DPU
807+
```
808+
root@sonic:/home/admin# show reboot-cause history all
809+
Usage: show reboot-cause history [OPTIONS]
810+
Try "show reboot-cause history -h" for help.
811+
812+
Error: Got unexpected extra argument (all)
813+
```
814+
**show reboot-cause history DPU1**
815+
816+
This command displays the history of the previous reboots up to 10 entry of DPU1. If DPU1 is powered down then there won't be any data in the DB and the "show reboot-cause history DPU1" output will be blank.
817+
818+
- Usage:
819+
```
820+
show reboot-cause history DPU1
821+
```
822+
823+
- Example:
824+
### Shown below is the output of the CLI when executed on the NPU
825+
```
826+
root@MtFuji:~# show reboot-cause history DPU1
827+
Device Name Cause Time User Comment
828+
-------- ------ ----------------------------------------- ------ ------ ---------
829+
DPU1 DPU1 Software causes (Hardware watchdog reset) N/A N/A N/A
830+
```
831+
### Shown below is the output of the CLI when executed on the DPU
832+
```
833+
root@sonic:/home/admin# show reboot-cause history DPU1
834+
Usage: show reboot-cause history [OPTIONS]
835+
Try "show reboot-cause history -h" for help.
836+
837+
Error: Got unexpected extra argument (DPU1)
838+
```
839+
740840

741841
**show uptime**
742842

@@ -11348,6 +11448,36 @@ In addition, displays a list of all current 'Services' and 'Hardware' being moni
1134811448
psu.voltage Ignored Device
1134911449
```
1135011450
11451+
**show system-health dpu <option>**
11452+
11453+
This is a smartswitch specific cli. This cli shows the midplane, control plane and data plane health of the DPU modules in the smartswitch.
11454+
11455+
This can take two forms of "<option>" 1. DPU module name (ex: DPU0) 2. all, which will list all the DPUs in the smartswitch
11456+
11457+
- Usage:
11458+
```
11459+
show system-health dpu DPU0
11460+
```
11461+
11462+
- Example:
11463+
```
11464+
root@MtFuji-dut:/home/cisco# show system-health dpu DPU0
11465+
Name Oper-Status State-Detail State-Value Time Reason
11466+
------ ------------- ----------------------- ------------- ------------------------------- ------------------------------------------------------------------------------------
11467+
DPU0 Online dpu_midplane_link_state up Mon Dec 23 05:12:17 PM UTC 2024
11468+
dpu_control_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 All containers are up and running, host-ethlink-status: Uplink1/1 is UP
11469+
dpu_data_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 DPU container named polaris is running, pdsagent running : OK, pciemgrd running : OK
11470+
11471+
root@MtFuji-dut:/home/cisco# show system-health dpu all
11472+
Name Oper-Status State-Detail State-Value Time Reason
11473+
------ ------------- ----------------------- ------------- ------------------------------- ------------------------------------------------------------------------------------
11474+
DPU0 Online dpu_midplane_link_state up Mon Dec 23 05:12:17 PM UTC 2024
11475+
dpu_control_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 All containers are up and running, host-ethlink-status: Uplink1/1 is UP
11476+
dpu_data_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 DPU container named polaris is running, pdsagent running : OK, pciemgrd running : OK
11477+
DPU1 Online dpu_midplane_link_state up Mon Dec 23 05:12:17 PM UTC 2024
11478+
dpu_control_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 All containers are up and running, host-ethlink-status: Uplink1/1 is UP
11479+
dpu_data_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 DPU container named polaris is running, pdsagent running : OK, pciemgrd running : OK
11480+
1135111481
Go Back To [Beginning of the document](#) or [Beginning of this section](#System-Health)
1135211482
1135311483
## VLAN & FDB

show/chassis_modules.py

+8-4
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from natsort import natsorted
33
from tabulate import tabulate
44
from swsscommon.swsscommon import SonicV2Connector
5+
from utilities_common.chassis import is_smartswitch
56

67
import utilities_common.cli as clicommon
78
from sonic_py_common import multi_asic
@@ -40,11 +41,11 @@ def status(db, chassis_module_name):
4041
state_db = SonicV2Connector(host="127.0.0.1")
4142
state_db.connect(state_db.STATE_DB)
4243

43-
key_pattern = '*'
44+
key_pattern = CHASSIS_MODULE_INFO_TABLE + '|*'
4445
if chassis_module_name:
45-
key_pattern = '|' + chassis_module_name
46+
key_pattern = CHASSIS_MODULE_INFO_TABLE + '|' + chassis_module_name
4647

47-
keys = state_db.keys(state_db.STATE_DB, CHASSIS_MODULE_INFO_TABLE + key_pattern)
48+
keys = state_db.keys(state_db.STATE_DB, key_pattern)
4849
if not keys:
4950
print('Key {} not found in {} table'.format(key_pattern, CHASSIS_MODULE_INFO_TABLE))
5051
return
@@ -62,7 +63,10 @@ def status(db, chassis_module_name):
6263
oper_status = data_dict[CHASSIS_MODULE_INFO_OPERSTATUS_FIELD]
6364
serial = data_dict[CHASSIS_MODULE_INFO_SERIAL_FIELD]
6465

65-
admin_status = 'up'
66+
if is_smartswitch():
67+
admin_status = 'down'
68+
else:
69+
admin_status = 'up'
6670
config_data = chassis_cfg_table.get(key_list[1])
6771
if config_data is not None:
6872
admin_status = config_data.get(CHASSIS_MODULE_INFO_ADMINSTATUS_FIELD)

0 commit comments

Comments
 (0)