Skip to content

PMON High Level Design for SmartSwitch #1538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

rameshraghupathy
Copy link
Contributor

This commit is the HLD document for SmartSwitch PMON 202405 release and needs to be shared with the the community. This is the initial draft

This commit is the HLD document for SmartSwitch PMON 202405 release and needs to be shared with the the community.  This is the initial draft
@rameshraghupathy rameshraghupathy changed the title PMON High Level Design for SmartSwitch 202405 release PMON High Level Design for SmartSwitch Dec 4, 2023

Besides the state and previous_reboot_reason_from_host other fields will be updated by the DPU once it boots. The other fields will be updated by the switch from the information read from the hardware registers.

### 3.4. Midplane Interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rameshraghupathy,

Smart Switch IP address assignment flow for midplane interface and DPU PCI interfaces is described here. Please check the document.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, let us align with above doc

Copy link
Contributor Author

@rameshraghupathy rameshraghupathy Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this with a new PR #1584.
Please look into this new PR.
Will deprecate this PR .

* RMA
* Inventory

The purpose of this document is to provide a framework to share the state, health, alarms of the DPUs, manage the DPUs by providing support to monitor, gracefully shutdown, restart them and the associated peripherals such as thermal sensors, cooling devices, LEDs, etc.
Copy link

@manapalai manapalai Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my comments on overall doc.

New APIs:
get_dpu_state, (will take dpu ID or slot/offset as an argument) => use existing API get_oper_status(self)
get_dpu_health, (will take dpu ID or slot/offset as an argument) => how it is different from above API?
get_dpu_id, (will slot/offset as an argument) => it is covered in our spec and we reviewed this
get_dpu_ip, (will take dpu ID or slot/offset as an argument) => use existing API: get_midplane_ip(self)
get_dpu_mac (will take dpu ID or slot/offset as an argument) => use existing API get_base_mac(self)

changes to platform plugin/API:
=> why can’t we use existing module class? Simply treat DPU as module. For example, this doc talks about using some of APIs e.g. get_name(), and now we are introducing new APIs e.g. get_dpu_state(), this will create lot of confusions??

extend module.py to support dpu module:
class DpuCardModule(Module): why cant we reuse existing module class?

The IP address assignment:
Here the proposal seems to be platform dependent, while we proposed and reviewed a platform independent model, let us align.

DPU State and show platform dpu state :
why? We simply should keep it simple e.g. show chassis modules status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR with a new PR#1584 Can you please look into this #1584 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will deprecate this PR

Copy link
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy we would need platform.json of NPU/switch to have below mapping showing the NPU port to DPU port mapping. This will be used by services early in the system boot for midplane IP assignment

"DPUs" : [
    {
      "DPU0": {
                "Ethernet228": "Ehternet0",
                "Ethernet232": "Ethernet1"  
       }
    },
    {
       "DPU1": {
                "Ethernet236": "Ethernet0",
                "Ethernet240": "Ethernet1"
    },
 }```

On the DPU's platform.json, we can have 

DPU: {

  // Anything specific to DPU, else remain empty

}

@prgeor
Copy link
Contributor

prgeor commented Feb 13, 2024

@rameshraghupathy can we close this PR if not used?

@prgeor
Copy link
Contributor

prgeor commented Mar 1, 2024

@rameshraghupathy this since anotehr PR 1584 is in review instead

@prgeor prgeor closed this Mar 1, 2024
prgeor pushed a commit that referenced this pull request May 12, 2024
* PMON High Level Design for SmartSwitch 202405 release

This commit is the HLD document for SmartSwitch PMON 202405 release and needs to be shared with the the community.  This is the initial draft

* Updated IP address assignment section and did some minor enhancements

* Fixed the IP address link

* Fixed the ip add scheme link

* In the process of updating

* Document update in progress

* Update in progress

* Doc update in progress

* Doc update in progress

* Doc update in progress

* Updated document

* Updated version 0.2

* Did some clean up

* Did some cleanup

* Updated the new APIs section

* Minor changes to the new APIs based on Jan25 meeting

* Updated the thermal-mgmt-seq diagram

* updated the thermal-mgmt-seq diagram

* Updated the documents based on community call agreements and reviewed
slide

* Did some formatting

* Fixed some formatting issues

* Deleted unwanted image files

* Addressing review comments

* Addressing review comments

* Addressed review comments

* Create smartswitch-pmon.md.save

* Did some cleanup

* Improved formatting

* Fixed the format of the return value contents of get_system_eprom_info
for DPUs

* Cleaned up unwanted files

* Addressing review comments

* Updated the CLI snd API sections

* Address a review comment

* Added sequence diagrams and updated the dpu state/health sections and
power sequence

* Minor cleanup

* Addressed some review comments

* Addressed review comments

* Addressed some review comments

* Minor formatting change

* fixed get_module_dpu_data_port API

* Did some cleanup

* Fixed "show chassis modules status  DPU0" cli

* Cleaned up the "show system-health .." CLIs

* Did some formatting

* Added DPU_STATE definition

* Cleaned up the soft reboot section

* Brought DPU_SATE details under system-health CLI as planned

* Added schema for reboot-cause

* Update smartswitch-pmon.md

* updated the CLI output for DPU health

* Did sone cleanup

* Fixed typos

* updated the sequence diagram and made the sequence accurate.  Provided
more clarity to dpi_id and dpu_name

* AAddressed some comments with respect to reboot-cause

* Added an example for the health info object

* Addressed some more review comments and updated the thermal-mgmt-seq
diagram

* Fixed a typo

* fixed a type

* DPU_STATE definition updated, "TERPERATURE_INFO" table update in
ChassisStateDB has been called out, mentioned that the console
management design will be covered in another document.

* Addressed review comment son section 3.5 and 3.6

* Addressed review comment on 3.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants