-
Notifications
You must be signed in to change notification settings - Fork 187
Add new platform management API. Currently with support for chassis, PSUs, fans and watchdog #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b039969
aa55f96
bfce7ad
d169853
bd6b6ed
8dfa89f
e026fea
d281571
7e1f1ec
7ab83cf
8e9aaaf
d605896
5c91399
5fbf599
147e830
f7c3810
9bc056f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,7 @@ | |
packages=[ | ||
'sonic_eeprom', | ||
'sonic_led', | ||
'sonic_platform_base', | ||
'sonic_psu', | ||
'sonic_sfp', | ||
], | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
# | ||
# chassis_base.py | ||
# | ||
# Base class for implementing a platform-specific class with which | ||
# to interact with a chassis device in SONiC. | ||
# | ||
|
||
import sys | ||
from . import device_base | ||
|
||
|
||
class ChassisBase(device_base.DeviceBase): | ||
""" | ||
Base class for interfacing with a platform chassis | ||
""" | ||
|
||
# Possible reboot causes | ||
REBOOT_CAUSE_POWER_LOSS = "power_loss" | ||
REBOOT_CAUSE_THERMAL_OVERLOAD_CPU = "thermal_overload_cpu" | ||
REBOOT_CAUSE_THERMAL_OVERLOAD_ASIC = "thermal_overload_asic" | ||
REBOOT_CAUSE_THERMAL_OVERLOAD_OTHER = "thermal_overload_other" | ||
REBOOT_CAUSE_INSUFFICIENT_FAN = "insufficient_fan" | ||
REBOOT_CAUSE_WATCHDOG = "watchdog" | ||
REBOOT_CAUSE_HARDWARE_OTHER = "hardware_other" | ||
REBOOT_CAUSE_NON_HARDWARE = "non_hardware" | ||
|
||
# List of ModuleBase-derived objects representing all modules | ||
# available on the chassis (for use with modular chassis) | ||
_module_list = [] | ||
|
||
# List of FanBase-derived objects representing all fans | ||
# available on the chassis | ||
_fan_list = [] | ||
|
||
# List of PsuBase-derived objects representing all power supply units | ||
# available on the chassis | ||
_psu_list = [] | ||
|
||
# Object derived from WatchdogBase for interacting with hardware watchdog | ||
_watchdog = None | ||
|
||
def get_base_mac(self): | ||
""" | ||
Retrieves the base MAC address for the chassis | ||
|
||
Returns: | ||
A string containing the MAC address in the format | ||
'XX:XX:XX:XX:XX:XX' | ||
""" | ||
raise NotImplementedError | ||
|
||
def get_reboot_cause(self): | ||
""" | ||
Retrieves the cause of the previous reboot | ||
|
||
Returns: | ||
A tuple (string, string) where the first element is a string | ||
containing the cause of the previous reboot. This string must be | ||
one of the predefined strings in this class. If the first string | ||
is "REBOOT_CAUSE_HARDWARE_OTHER", the second string can be used | ||
to pass a description of the reboot cause. | ||
""" | ||
raise NotImplementedError | ||
|
||
def get_component_versions(self): | ||
""" | ||
Retrieves platform-specific hardware/firmware versions for chassis | ||
componenets such as BIOS, CPLD, FPGA, etc. | ||
|
||
Returns: | ||
A string containing platform-specific component versions | ||
""" | ||
raise NotImplementedError | ||
|
||
############################################## | ||
# Module methods | ||
############################################## | ||
|
||
def get_num_modules(self): | ||
""" | ||
Retrieves the number of modules available on this chassis | ||
|
||
Returns: | ||
An integer, the number of modules available on this chassis | ||
""" | ||
return len(self._module_list) | ||
|
||
def get_all_modules(self): | ||
""" | ||
Retrieves all modules available on this chassis | ||
|
||
Returns: | ||
A list of objects derived from ModuleBase representing all | ||
modules available on this chassis | ||
""" | ||
return self._module_list | ||
|
||
def get_module(self, index): | ||
""" | ||
Retrieves module represented by (0-based) index <index> | ||
|
||
Args: | ||
index: An integer, the index (0-based) of the module to | ||
retrieve | ||
|
||
Returns: | ||
An object dervied from ModuleBase representing the specified | ||
module | ||
""" | ||
module = None | ||
|
||
try: | ||
module = self._module_list[index] | ||
except IndexError: | ||
sys.stderr.write("Module index {} out of range (0-{})\n".format( | ||
index, len(self._module_list)-1)) | ||
|
||
return module | ||
|
||
############################################## | ||
# Fan methods | ||
############################################## | ||
|
||
def get_num_fans(self): | ||
""" | ||
Retrieves the number of fans available on this chassis | ||
|
||
Returns: | ||
An integer, the number of fan modules available on this chassis | ||
""" | ||
return len(self._fan_list) | ||
|
||
def get_all_fans(self): | ||
""" | ||
Retrieves all fan modules available on this chassis | ||
|
||
Returns: | ||
A list of objects derived from FanBase representing all fan | ||
modules available on this chassis | ||
""" | ||
return self._fan_list | ||
|
||
def get_fan(self, index): | ||
""" | ||
Retrieves fan module represented by (0-based) index <index> | ||
|
||
Args: | ||
index: An integer, the index (0-based) of the fan module to | ||
retrieve | ||
|
||
Returns: | ||
An object dervied from FanBase representing the specified fan | ||
module | ||
""" | ||
fan = None | ||
|
||
try: | ||
fan = self._fan_list[index] | ||
except IndexError: | ||
sys.stderr.write("Fan index {} out of range (0-{})\n".format( | ||
index, len(self._fan_list)-1)) | ||
|
||
return fan | ||
jleveque marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
############################################## | ||
# PSU methods | ||
############################################## | ||
|
||
def get_num_psus(self): | ||
""" | ||
Retrieves the number of power supply units available on this chassis | ||
|
||
Returns: | ||
An integer, the number of power supply units available on this | ||
chassis | ||
""" | ||
return len(self._psu_list) | ||
|
||
def get_all_psus(self): | ||
""" | ||
Retrieves all power supply units available on this chassis | ||
|
||
Returns: | ||
A list of objects derived from PsuBase representing all power | ||
supply units available on this chassis | ||
""" | ||
return self._psu_list | ||
|
||
def get_psu(self, index): | ||
""" | ||
Retrieves power supply unit represented by (0-based) index <index> | ||
|
||
Args: | ||
index: An integer, the index (0-based) of the power supply unit to | ||
retrieve | ||
|
||
Returns: | ||
An object dervied from PsuBase representing the specified power | ||
supply unit | ||
""" | ||
psu = None | ||
|
||
try: | ||
psu = self._psu_list[index] | ||
except IndexError: | ||
sys.stderr.write("PSU index {} out of range (0-{})\n".format( | ||
index, len(self._psu_list)-1)) | ||
|
||
return psu | ||
|
||
############################################## | ||
# Other methods | ||
############################################## | ||
|
||
def get_watchdog(self): | ||
""" | ||
Retreives hardware watchdog device on this chassis | ||
|
||
Returns: | ||
An object derived from WatchdogBase representing the hardware | ||
watchdog device | ||
""" | ||
return _watchdog |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# | ||
# device_base.py | ||
# | ||
# Abstract base class for interfacing with a generic type of platform | ||
# peripheral device in SONiC | ||
# | ||
|
||
class DeviceBase(object): | ||
""" | ||
Abstract base class for interfacing with a generic type of platform | ||
peripheral device | ||
""" | ||
|
||
def get_presence(self): | ||
""" | ||
Retrieves the presence of the device | ||
|
||
Returns: | ||
bool: True if device is present, False if not | ||
""" | ||
raise NotImplementedError | ||
|
||
def get_model(self): | ||
""" | ||
Retrieves the model number (or part number) of the device | ||
|
||
Returns: | ||
string: Model/part number of device | ||
""" | ||
raise NotImplementedError | ||
|
||
def get_serial(self): | ||
""" | ||
Retrieves the serial number of the device | ||
|
||
Returns: | ||
string: Serial number of device | ||
""" | ||
raise NotImplementedError | ||
|
||
def get_status(self): | ||
""" | ||
Retrieves the operational status of the device | ||
|
||
Returns: | ||
A boolean value, True if device is operating properly, False if not | ||
""" | ||
raise NotImplementedError | ||
|
||
def get_change_event(self, timeout=0): | ||
""" | ||
Returns a dictionary containing all devices which have experienced a | ||
change | ||
|
||
Args: | ||
timeout: Timeout in milliseconds (optional). If timeout == 0, | ||
this method will block until a change is detected. | ||
|
||
Returns: | ||
(bool, dict): | ||
- True if call successful, False if not; | ||
- Dict where key is device ID and value is device event, | ||
status='1' represents device inserted, | ||
status='0' represents device removed. Ex. {'0': '1', '1': '0'} | ||
""" | ||
raise NotImplementedError | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just wanted to get confirmation that if a platform takes a Baseboard Management Controller (BMC) based approach (to handle thermals, events etc), these (and the other such APIs) can remain unimplemented.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The intention here is for daemons to receive notifications of changes in platform hardware state in order to update SONiC's view of device state. For instance, if a tranceiver, PSU or fan tray is added or removed. Can you provide an example of an event handled by the BMC which SONiC would not need to be made aware of? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see that the events referred to is specifically to track the presence/absence i..e OIR of the modules (essentially this API seems to be a generalization of the get_transceiver_change_event()). If the BMC implements the thermals, an addition/removal of FAN tray would be completely handled by the BMC itself - but atleast for generating a syslog this API needs to be implemented for SONiC.. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# | ||
# fan_base.py | ||
# | ||
# Abstract base class for implementing a platform-specific class with which | ||
# to interact with a fan module in SONiC | ||
# | ||
|
||
from . import device_base | ||
|
||
|
||
class FanBase(device_base.DeviceBase): | ||
""" | ||
Abstract base class for interfacing with a fan module | ||
""" | ||
|
||
# Possible fan directions (relative to port-side of device) | ||
FAN_DIRECTION_INTAKE = "intake" | ||
FAN_DIRECTION_EXHAUST = "exhaust" | ||
|
||
# Possible fan status LED colors | ||
STATUS_LED_COLOR_GREEN = "green" | ||
STATUS_LED_COLOR_RED = "red" | ||
STATUS_LED_COLOR_OFF = "off" | ||
|
||
def get_direction(self): | ||
""" | ||
Retrieves the direction of fan | ||
|
||
Returns: | ||
A string, either FAN_DIRECTION_INTAKE or FAN_DIRECTION_EXHAUST | ||
depending on fan direction | ||
""" | ||
raise NotImplementedError | ||
|
||
def get_speed(self): | ||
""" | ||
Retrieves the speed of fan as a percentage of full speed | ||
|
||
Returns: | ||
An integer, the percentage of full fan speed, in the range 0 (off) | ||
to 100 (full speed) | ||
""" | ||
raise NotImplementedError | ||
|
||
def get_target_speed(self): | ||
""" | ||
Retrieves the target (expected) speed of the fan | ||
|
||
Returns: | ||
An integer, the percentage of full fan speed, in the range 0 (off) | ||
to 100 (full speed) | ||
""" | ||
raise NotImplementedError | ||
|
||
def get_speed_tolerance(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm assuming this is only for the fan daemon logic usage? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is the initial intention. We may find other uses in the future, however. Is there a concern? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No concerns. I'm was just curious if the fand was in the roadmap. |
||
""" | ||
Retrieves the speed tolerance of the fan | ||
|
||
Returns: | ||
An integer, the percentage of variance from target speed which is | ||
considered tolerable | ||
""" | ||
raise NotImplementedError | ||
|
||
def set_speed(self, speed): | ||
""" | ||
Sets the fan speed | ||
|
||
Args: | ||
speed: An integer, the percentage of full fan speed to set fan to, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see everywhere that the fan speed is an int from 0 to 100. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would you prefer a 0.0 ... 1.0 float scale? This should allow for finer-grained control if necessary, while maintaining a more universal scale. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't mind either way, integer percentages 0 to 100 sounds good to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If all platforms used sysfs, the 0-255 scale would be fine. However, we'd like a universal scale that will work with other interfaces (IPMI, etc.). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have any strong preference. I just wanted to bring that topic to make sure it wasn't overlooked. |
||
in the range 0 (off) to 100 (full speed) | ||
|
||
Returns: | ||
A boolean, True if speed is set successfully, False if not | ||
""" | ||
raise NotImplementedError | ||
|
||
def set_status_led(self, color): | ||
""" | ||
Sets the state of the fan module status LED | ||
|
||
Args: | ||
color: A string representing the color with which to set the | ||
fan module status LED | ||
|
||
Returns: | ||
bool: True if status LED state is set successfully, False if not | ||
""" | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reboot cause can be persistent on some product and have to be explicitly cleared.
Otherwise reading this information after a reboot will show all the reload causes that happened previous times.
Some platform might also have limitation on the number of reboot causes that can be stored.
I would suggest either a default parameter
clear=False
to this method or to add a new one calledclear_reboot_cause
.Also I think this function should only be called once with
clear=True
and the result cached somewhere in a tmpfs. This means that future asks for reload cause information would go through the cache.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the need to clear the reboot cause after reading is platform-specific functionality, I believe this logic should be left to the platform-specific implementation of the function. If the platform requires clearing the reboot cause, it should be done in the implementation and a file saved to tmpfs. The function can check for the presence of the file first before attempting to read/clear the register. I prefer this to adding a
clear=True
parameter to the base method, and I think adding aclear_reboot_cause()
method is unnecessary, because there should be no need to explicitly clear the reboot cause.If you're concerned that not all platforms will support all reboot causes, that is expected. The base class will define all possible reboot causes that SONiC understands; the platform-specific implementation will simply return the subset which is applicable to that platform. If this doesn't address your concern, please elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume all the platforms to behave the same way with regard to the reboot causes. However this information would have to be crowd-sourced from other vendors as well for more relevant data.
I'm fine with letting the responsibility of clearing and caching the reboot cause to the plugin implementation.
Is there a place in SONiC where such cache files should go? Currently almost everything is backed by the flash including
/tmp
. The only place that is atmpfs
and would feel "suitable" seems to be/run
. Is there some guidelines somewhere regarding this? I would be interested to know the reason why/tmp
is not atmpfs
.My concern was a bit different. Your previous answer ended up answered my question but I'll elaborate a bit more.
Basically the concept of a reboot cause is to be stored on an external device until being read.
These devices usually have space restriction and can only save a given number of reboot cause after which they won't store anything until being cleared. For example if your device can store 2 reboot cause and you trigger 2 watchdogs and then 1 powerloss, you might not see the powerloss if you haven't cleared the reboot cause after each reboot.