Skip to content

CLI support for SmartSwitch PMON #3271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 181 commits into from
Jan 27, 2025
Merged

Conversation

rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented Apr 14, 2024

What I did

Enhanced the following CLIs to support SmartSwitch PMON as described in the PMON HLD documentation "https://github.com/sonic-net/SONiC/blob/d19d8933a43d0a31a4f3b2310f4336f289bca340/doc/smart-switch/pmon/smartswitch-pmon.md"

CLIs:
Added new module "DPUX" support for 1 and 2 below
1. "config chassis module startup DPUX" , where X could be 0, to the maximum number of DPUs-1 in the SmartSwitch chassis
2. "config chassis module shutdown DPUX"

Extended the following CLIs to support the new module "DPUX" and also provided the "all" option to display the reboot-cause of the "NPU" and all "DPU" modules.
1. "show reboot-cause" will remain the same and added "show reboot-cause all"
2. "show reboot-cause history" will remain the same and added "show reboot-cause history <module-name>", where module name could be DPUX, and all.

Added a new sub command "show system-health dpu <module-name>", where module name could be DPUX, and all. This new subcommand will provide additional DPU state details as mentioned in the HLD

How I did it

  1. Kept the original CLI output unaltered
  2. Added sub command to support SmartSwitch "DPUs"
  3. Added additional code in chassisd, and in platform modules.py, chassis.py to support it
  4. Updated the DB tables as mentioned in the PMON HLD

Fixes: sonic-net/sonic-buildimage#21372

How to verify it

  1. Build an image with the required files (refer to the other upstream PRs and the platform PRs)
    Require files:
    - This PR including reboot_cause.py, chassis_modules.py, system_health.py)
    - The other PR including module_base.py, chassis_base.py, docker-pmon.supervisord.conf.j2, chassisd, mock_module_base.py, and the appropriate database_config.json
    - Platform "platform-cisco-8000" supporting PMON (module.py, chassis.py, inventory.py, pmon_daemon_control.json, and the required DB changes)
  2. Run the CLIs and see the new output

CLI output

root@sonic:~# show reboot-cause
Unknown

root@MtFuji:/home/cisco# show reboot-cause all
Device Name Cause Time User


NPU 2024_12_11_01_54_02 reboot Wed Dec 11 01:48:07 AM UTC 2024 cisco
DPU0 2024_12_11_01_56_45 reboot Wed Dec 11 01:56:45 AM UTC 2024

cisco@MtFuji-dut:~$ show reboot-cause history
Name Cause Time User Comment


2024_12_19_14_00_33 reboot Thu Dec 19 01:55:41 PM UTC 2024 cisco N/A
2024_12_15_08_21_38 reboot Sun Dec 15 08:16:36 AM UTC 2024 cisco N/A

cisco@MtFuji-dut:~$ show reboot-cause history all
Device Name Cause Time User Comment


NPU 2024_12_19_14_00_33 reboot Thu Dec 19 01:55:41 PM UTC 2024 cisco N/A
NPU 2024_12_15_08_21_38 reboot Sun Dec 15 08:16:36 AM UTC 2024 cisco N/A
DPU1 2024_12_19_14_36_47 Non-Hardware Thu Dec 19 02:36:47 PM UTC 2024 Switch rebooted DPU
DPU0 2024_12_19_14_03_24 reboot Thu Dec 19 02:03:24 PM UTC 2024 N/A
DPU0 2024_12_17_15_57_29 reboot Tue Dec 17 03:57:29 PM UTC 2024 N/A

root@sonic:~# show reboot-cause history DPU0
Device Name Cause Time User Comment


DPU0 2024_12_19_14_03_24 reboot Thu Dec 19 02:03:24 PM UTC 2024 N/A
DPU0 2024_12_17_15_57_29 reboot Tue Dec 17 03:57:29 PM UTC 2024 N/A

@oleksandrivantsiv
Copy link
Collaborator

Can you please add UT for the new functions?

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Contributor

vvolam commented Jan 6, 2025

@rameshraghupathy could you check the description once as the command and sub-command mentioned are same. Please make corrections as needed?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Contributor

vvolam commented Jan 7, 2025

@rameshraghupathy could you check the description once as the command and sub-command mentioned are same. Please make corrections as needed?

@rameshraghupathy Except this, remaining looks good to me

@rameshraghupathy
Copy link
Contributor Author

@rameshraghupathy could you check the description once as the command and sub-command mentioned are same. Please make corrections as needed?

@vvolam There was a formatting issue due to which the preview wasn't showing "". Fixed it.

vvolam
vvolam previously approved these changes Jan 7, 2025
Copy link
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy in all these sample examples can you specify where the CLI is being executed. NPU or in the DPU host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Added samples, in missing places

show reboot-cause all
```

- Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy please also capture the CLI output specifically when run inside DPU (even though its same as fixed chassis)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Added the CLI output when run on DPU

key = key + suffix
keys = chassis_state_db.keys(chassis_state_db.CHASSIS_STATE_DB, key)
if not keys:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy how will the user know the CLI failed? How will the script know if the return code is not zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Added an error message to indicate the DPU_STATE table doesn't exist for the module in the DB.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor prgeor merged commit 97c20cc into sonic-net:master Jan 27, 2025
7 checks passed
nmoray pushed a commit to nmoray/sonic-utilities that referenced this pull request Jun 25, 2025
* CLI support for SmartSwitch PMON

* imad minor fixes

* Did some cleanup for backward compatibility

* removed the column wrapping

* Made it backward compatible and removed textwrap and added ut to PR

* 1. There was a duplication of part of a function and that has been
   addressed. 2. The DPU reboot-cause data is fetched directly fromn the
chassis_state_db now

* reboot_cause and system_health are obtained directly from chassisStateDB
now

* The expected and result are the same but the test is throwing an error,
temporarily bypassing the check

* Let us get the build going and then look into the test mockup

* Implemented as per the pmon hld, also made some improvements in the
implementation

* Fixed the key for CHASSIS_MODULE_INFO_TABLE entries

* Fixed "show reboot-cause all" and "show reboot-cause history all"

* Addressing review comments

* Checking if the test issue still exists

* Resolving SA errors triggered due to reboot_cause_test

* Resolved pre-commit issues

* Resolved pre-commit issues

* Improving coverage

* Fixed SA related warnings

* Did some cleanup

* Minor improvements and fixes

* Adding tests for system health

* Adding more system health related tests

* Fixed a minor issue

* Fixed long line SA issue

* Trying to please SA

* Trying to improve coverage

* import mock

* Fixed a typo

* mocking DB

* Fixed syntax issues

* DB mock fix

* removed unused import

* creating ut for dpu state

* Improving coverage

* Fixed a typo

* Adjusted the reboot-cause key as per the updated hld

* Added fix to gracefully handle sytem health DB keys not present case

* Addressed minor review comments

* Addressed review comments.  Commented out system-health support until
phase:2

* Resolved minor issues and SA failures

* Added role to PORT table in config_db.  Using role to differentiate
npu-dpu data plane connection in SmartSwitch with Dpc being the role.
Did a minor cleanup.

* Resolving pre-commit check error related to line > 120

* Trying to avoid pre-commit issues

* Testing SA and precommit checks

* Making it backward compatible

* Resolving column size and whitespace issue

* Working on SA issue

* Testing SA and UT

* Added 2 spaces before inline comment

* Enabling "show system-health dpu" cli alone.  The rest of the dpu health
is differed for now.

* Fixed SA issues

* Adde new line at EOF

* Enabling the UT for the CLI "show system-health dpu"

* Resolved SA issues

* Resolved a SA issue

* Added smartswitch specific "reboot-cause" and "reboot-cause history" CLI
extensions

* Removed the phase:2 related system-health cli extensions as a seperate
PR will be raised eventually for phase:2

* Using smartswitch qualifier for the clie extensions

* Fixed SA issues

* mocking device_info for test cases

* import patch in tests

* Debugging test failure

* Fixing SA issues

* fixing sa issues

* Debugging sa issues

* trying to resolve sa issues

* fixed indentation

* debugging

* debugging

* debugging

* debugging

* Debugging

* debugging

* debugging

* Debugging

* Debugging

* Debuggingg

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debuggingg

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Removing the test to build an image

* Removed mock import

* Improving coverage

* pleasing SA

* Fixing tests for design changes as per review comments

* Resolving test failure

* fixed indentation

* cleaned up the test case

* Addressed review comments in Command-Reference.md and trying to improve
coverage

* Improving coverage

* Fixed a test issue

* Addressed review comments

* Addressed review comment.  Reading DPUs list from config_db.json

* Improving coverage

* Resolved SA error

* Trying to improve coverage. Also, reading from platform.json

* adding json import in the test

* Fixed a test failure

* Fixed SA error

* Exercising the new function in test

* Removed a blank line

* fixing mock issue

* Trying a different approach

* working on coverage

* debugging

* debugging

* Debugging

* Increasing coverage

* improving coverage

* Adjusting the show cli implementation to align with the reboot-cause
changes such as 1. STATE_DB vs CHASSIS_STATE_DB and the key info

* Fixing a minor issue

* Removed ID column from the "show system-health dpu DPUx" cli as per the new requirement

* Addressed default dpu admin status for dark-mode and seamless migration
to lightup mode

* Resolving SA issue

* Resolved a typo

* Added checks to see if module_name is valid in the "config chassis
modules startup DPUx" cli aand also moved all the required utilities to
the common file

* Fixed white space issues

* Cleaned unwanted import

* Fixed build issues

* missedout the fixes in a couple of files

* With the recent code the app_db multi_asic.PORT_ROLE is Dpc for DPU
ports, earlier this was not the case. So removing the additional check.

* As the port role issue is no longer seen in smartswitch, cleaning up the
related chnages.

* Using the verbose define for TYPE_DPC in the CLI, if there is a specific
requirement to keep 'TYPE_DPC = Dpc", which is the role, then we will
revert it

* Reverting intfutil_test.py

* Using the common API to get_dpu_list

* Removed unused import json

* Addressed review comments

* Did some minor cleanp

* Fix: SA error

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Added fix for issue:21372 - Device name column shows NPU instead of module name

* Added fix for issue:21372 - Fixing the device name colum in the cli output

* Added a few review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Smartswitch][reboot-cause] Invalid reboot cause on First boot
8 participants