-
Notifications
You must be signed in to change notification settings - Fork 1.2k
PMON High Level Design for SmartSwitch #1538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit is the HLD document for SmartSwitch PMON 202405 release and needs to be shared with the the community. This is the initial draft
|
||
Besides the state and previous_reboot_reason_from_host other fields will be updated by the DPU once it boots. The other fields will be updated by the switch from the information read from the hardware registers. | ||
|
||
### 3.4. Midplane Interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart Switch IP address assignment flow for midplane interface and DPU PCI interfaces is described here. Please check the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, let us align with above doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this with a new PR #1584.
Please look into this new PR.
Will deprecate this PR .
* RMA | ||
* Inventory | ||
|
||
The purpose of this document is to provide a framework to share the state, health, alarms of the DPUs, manage the DPUs by providing support to monitor, gracefully shutdown, restart them and the associated peripherals such as thermal sensors, cooling devices, LEDs, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are my comments on overall doc.
New APIs:
get_dpu_state, (will take dpu ID or slot/offset as an argument) => use existing API get_oper_status(self)
get_dpu_health, (will take dpu ID or slot/offset as an argument) => how it is different from above API?
get_dpu_id, (will slot/offset as an argument) => it is covered in our spec and we reviewed this
get_dpu_ip, (will take dpu ID or slot/offset as an argument) => use existing API: get_midplane_ip(self)
get_dpu_mac (will take dpu ID or slot/offset as an argument) => use existing API get_base_mac(self)
changes to platform plugin/API:
=> why can’t we use existing module class? Simply treat DPU as module. For example, this doc talks about using some of APIs e.g. get_name(), and now we are introducing new APIs e.g. get_dpu_state(), this will create lot of confusions??
extend module.py to support dpu module:
class DpuCardModule(Module): why cant we reuse existing module class?
The IP address assignment:
Here the proposal seems to be platform dependent, while we proposed and reviewed a platform independent model, let us align.
DPU State and show platform dpu state :
why? We simply should keep it simple e.g. show chassis modules status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the PR with a new PR#1584 Can you please look into this #1584 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will deprecate this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy we would need platform.json of NPU/switch to have below mapping showing the NPU port to DPU port mapping. This will be used by services early in the system boot for midplane IP assignment
"DPUs" : [
{
"DPU0": {
"Ethernet228": "Ehternet0",
"Ethernet232": "Ethernet1"
}
},
{
"DPU1": {
"Ethernet236": "Ethernet0",
"Ethernet240": "Ethernet1"
},
}```
On the DPU's platform.json, we can have
DPU: {
// Anything specific to DPU, else remain empty
}
@rameshraghupathy can we close this PR if not used? |
@rameshraghupathy this since anotehr PR 1584 is in review instead |
* PMON High Level Design for SmartSwitch 202405 release This commit is the HLD document for SmartSwitch PMON 202405 release and needs to be shared with the the community. This is the initial draft * Updated IP address assignment section and did some minor enhancements * Fixed the IP address link * Fixed the ip add scheme link * In the process of updating * Document update in progress * Update in progress * Doc update in progress * Doc update in progress * Doc update in progress * Updated document * Updated version 0.2 * Did some clean up * Did some cleanup * Updated the new APIs section * Minor changes to the new APIs based on Jan25 meeting * Updated the thermal-mgmt-seq diagram * updated the thermal-mgmt-seq diagram * Updated the documents based on community call agreements and reviewed slide * Did some formatting * Fixed some formatting issues * Deleted unwanted image files * Addressing review comments * Addressing review comments * Addressed review comments * Create smartswitch-pmon.md.save * Did some cleanup * Improved formatting * Fixed the format of the return value contents of get_system_eprom_info for DPUs * Cleaned up unwanted files * Addressing review comments * Updated the CLI snd API sections * Address a review comment * Added sequence diagrams and updated the dpu state/health sections and power sequence * Minor cleanup * Addressed some review comments * Addressed review comments * Addressed some review comments * Minor formatting change * fixed get_module_dpu_data_port API * Did some cleanup * Fixed "show chassis modules status DPU0" cli * Cleaned up the "show system-health .." CLIs * Did some formatting * Added DPU_STATE definition * Cleaned up the soft reboot section * Brought DPU_SATE details under system-health CLI as planned * Added schema for reboot-cause * Update smartswitch-pmon.md * updated the CLI output for DPU health * Did sone cleanup * Fixed typos * updated the sequence diagram and made the sequence accurate. Provided more clarity to dpi_id and dpu_name * AAddressed some comments with respect to reboot-cause * Added an example for the health info object * Addressed some more review comments and updated the thermal-mgmt-seq diagram * Fixed a typo * fixed a type * DPU_STATE definition updated, "TERPERATURE_INFO" table update in ChassisStateDB has been called out, mentioned that the console management design will be covered in another document. * Addressed review comment son section 3.5 and 3.6 * Addressed review comment on 3.5
This commit is the HLD document for SmartSwitch PMON 202405 release and needs to be shared with the the community. This is the initial draft