You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/psud/PSU_daemon_design.md
+172-32
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# SONiC PSU Daemon Design #
2
2
3
-
### Rev 0.2 ###
3
+
### Rev 0.4 ###
4
4
5
5
### Revision ###
6
6
@@ -10,6 +10,7 @@
10
10
| 0.1 || Chen Junchao | Initial version |
11
11
| 0.2 | August 4th, 2022 | Stephen Sun | Update according to the current implementation |
12
12
| 0.3 | August 8th, 2022 | Or Farfara | Add input current, voltage and max power |
13
+
| 0.4 | August 18th, 2022 | Stephen Sun | PSU power threshold checking logic |
13
14
14
15
15
16
## 1. Overview
@@ -28,6 +29,17 @@ The purpose of PSU daemon is to collect platform PSU data and trigger proper act
28
29
- whether the PSU voltage exceeds the minimal and maximum thresholds
29
30
- whether the PSU temperature exceeds the threshold
30
31
- whether the total PSU power consumption exceeds the budget (modular switch only)
32
+
- whether PSU power consumption exceeds the PSU threshold
33
+
34
+
### 1.1 PSU power threshold check
35
+
36
+
#### 1.1.1 Why we need it
37
+
38
+
An Ethernet switch is typically equipped with more than one PSU for redundancy. It can be deployed in different scenarios with different types of xSFP modules, traffic type and traffic load under different temperature. All these factors affect the power consumption of an Ethernet switch.
39
+
40
+
On some platforms, the capacity of a single PSU is not large enough to afford all the components and xSFP modules running at the highest performance at the same time. In this case, we do not have redundancy any longer and users should be notified of that, which is achieved via periodically checking the current power of PSUs against their maximum allowed power, AKA, power thresholds.
41
+
42
+
On some platforms, the maximum allowed power of the PSUs is not fixed but a dynamic value depending on other factors like temperature of certain sensors on the switch.
31
43
32
44
## 2. PSU data collection
33
45
@@ -37,40 +49,103 @@ PSU daemon data collection flow diagram:
37
49
38
50
Now psud collects PSU data via platform API, and it also support platform plugin for backward compatible. All PSU data will be saved to redis database for further usage.
39
51
52
+
### 2.1 PSU data collection specific to PSU power exceeding check
53
+
54
+
We will leverage the existing framework of PSU daemon, adding corresponding logic to perform PSU power check.
55
+
56
+
Currently, PSU daemon is waken up periodically, executing the following flows (flows in bold are newly introduced by the feature):
57
+
58
+
1. Check the PSUs' physical entity information and update them into database
59
+
2. Check the PSUs' present and power good information and update them to database
60
+
-__It will check the capability of PSU power check via reading PSU power thresholds when a new PSU is detected.__
61
+
3. Check and update the PSUs' data
62
+
- Fetch voltage, current, power via calling platform API
63
+
-__Perform PSU power checking logic__
64
+
- Update all the information to database
65
+
66
+
We will detail the new flows in the following sections.
67
+
68
+
#### New PSU is detected
69
+
70
+
Basically, there are two scenarios in which a new PSU can be detected:
71
+
72
+
- On PSU daemon starting, all PSUs installed on the switch are detected
73
+
- On new PSU pulgged, the new PSU is detected
74
+
75
+
When one or more new PSUs is detected and power is good, PSU daemon tries retrieving the warning-suppress and critical threshold for each PSU installed on the switch.
76
+
77
+
The PSU power checking will not be checked for a PSU if `NotImplemented` exception is thrown or `None` is returned while either threshold is being retrieved
78
+
79
+
#### Alarm raising and clearing threshold
80
+
81
+
We use asymmetric thresholds between raising and clearing the alarm for the purpose of creating a hysteresis and avoiding alarm flapping.
82
+
83
+
- an alarm will be raised when a PSU's power is rising accross the critical threshold
84
+
- an alarm will be cleared when a PSU's power is dropping across the warning-suppress threshold
85
+
86
+
In case a unified power threshold is used, the alarm status can flap when the power fluctuates around the threshold. For example, in the following picture, the alarm is cleared every time the PSU power drops across the critical threshold and raised every time the PSU power rises across the critical threshold. By having two thresholds, the alarm won't be cleared and raised so frequently.
2. If flag `PSU power exceeded threshold` is `true`, compare the current power against the warning-suppress threshold
96
+
- If `current power` < `warning-suppress threshold`
97
+
- Set `PSU power exceeded threshold` to `false`
98
+
- Message in NOTICE level should be logged: `PSU <x>: current power <power> is below the warning-suppress threshold <threshold>` where
99
+
-`<x>` is the number of the PSU
100
+
-`<power>` is the current power of the PSU
101
+
-`<threshold>` is the warning-suppress threshold of the PSU
102
+
- Otherwise: no action
103
+
3. Otherwise, compare the current power against the critical threshold
104
+
- If `current power` >= `critical threshold`
105
+
- Set `PSU power exceeded threshold` to `true`
106
+
- Message in WARNING level should be logged: `PSU <x>: current power <power> is exceeding the critical threshold <threshold>` where
107
+
-`<x>` is the number of the PSU
108
+
-`<power>` is the current power of the PSU
109
+
-`<threshold>` is the warning-suppress threshold of the PSU
110
+
- Otherwise: no action
111
+
40
112
## 3. DB schema for PSU
41
113
42
114
PSU number is stored in chassis table. Please refer to this [document](https://github.com/sonic-net/SONiC/blob/master/doc/pmon/pmon-enhancement-design.md), section 1.5.2.
43
115
44
116
PSU information is stored in PSU table:
45
117
46
-
; Defines information for a psu
47
-
key = PSU_INFO|psu_name ; information for the psu
48
-
; field = value
49
-
presence = BOOLEAN ; presence state of the psu
50
-
model = STRING ; model name of the psu
51
-
serial = STRING ; serial number of the psu
52
-
revision = STRING ; hardware revision of the PSU
53
-
status = BOOLEAN ; status of the psu
54
-
change_event = STRING ; change event of the psu
55
-
fan = STRING ; fan_name of the psu
56
-
led_status = STRING ; led status of the psu
57
-
is_replaceable = STRING ; whether the PSU is replaceable
58
-
temp = 1*3.3DIGIT ; temperature of the PSU
59
-
temp_threshold = 1*3.3DIGIT ; temperature threshold of the PSU
60
-
voltage = 1*3.3DIGIT ; the output voltage of the PSU
61
-
voltage_min_threshold = 1*3.3DIGIT ; the minimal voltage threshold of the PSU
62
-
voltage_max_threshold = 1*3.3DIGIT ; the maximum voltage threshold of the PSU
63
-
current = 1*3.3DIGIT ; the current of the PSU
64
-
power = 1*3.3DIGIT ; the power of the PSU
65
-
input_voltage = 1*3.3DIGIT ; input voltage of the psu
66
-
input_current = 1*3.3DIGIT ; input current of the psu
67
-
max_power = 1*4.3DIGIT ; power capacity of the psu
68
-
118
+
; Defines information for a psu
119
+
key = PSU_INFO|psu_name ; information for the psu
120
+
; field = value
121
+
presence = BOOLEAN ; presence state of the psu
122
+
model = STRING ; model name of the psu
123
+
serial = STRING ; serial number of the psu
124
+
revision = STRING ; hardware revision of the PSU
125
+
status = BOOLEAN ; status of the psu
126
+
change_event = STRING ; change event of the psu
127
+
fan = STRING ; fan_name of the psu
128
+
led_status = STRING ; led status of the psu
129
+
is_replaceable = STRING ; whether the PSU is replaceable
130
+
temp = 1*3.3DIGIT ; temperature of the PSU
131
+
temp_threshold = 1*3.3DIGIT ; temperature threshold of the PSU
132
+
voltage = 1*3.3DIGIT ; the output voltage of the PSU
133
+
voltage_min_threshold = 1*3.3DIGIT ; the minimal voltage threshold of the PSU
134
+
voltage_max_threshold = 1*3.3DIGIT ; the maximum voltage threshold of the PSU
135
+
current = 1*3.3DIGIT ; the current of the PSU
136
+
power = 1*4.3DIGIT ; the power of the PSU
137
+
input_voltage = 1*3.3DIGIT ; input voltage of the psu
138
+
input_current = 1*3.3DIGIT ; input current of the psu
139
+
max_power = 1*4.3DIGIT ; power capacity of the psu
140
+
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold
141
+
power_warning_suppress_threshold = 1*4.3DIGIT ; The power warning-suppress threshold
142
+
power_critical_threshold = 1*4.3DIGIT ; The power critical threshold
69
143
70
144
Now psud only collect and update "presence" and "status" field.
71
145
72
146
## 4. PSU command
73
147
148
+
### 4.1 show platform psustatus
74
149
There is a sub command "psustatus" under "show platform"
75
150
76
151
```
@@ -95,14 +170,41 @@ Commands:
95
170
96
171
The current output for "show platform psustatus" looks like:
97
172
173
+
```
174
+
admin@sonic:~$ show platform psustatus
175
+
PSU Model Serial HW Rev Voltage (V) Current (A) Power (W) Status LED
PSU 1 MTEF-PSF-AC-A MT1629X14911 A3 12.08 5.19 62.62 WARNING green
178
+
PSU 2 MTEF-PSF-AC-A MT1629X14913 A3 12.01 4.38 52.50 OK green
179
+
```
180
+
181
+
The field `Status` represents the status of the PSU, which can be the following:
182
+
-`OK` which represents no alarm raised due to PSU power exceeding the threshold
183
+
-`Not OK` which can be caused by:
184
+
- power is not good, which means the PSU is present but no power (Eg. the power is down or power cable is unplugged)
185
+
-`WARNING` which can be caused by:
186
+
- power exceeds the PSU's power critical threshold
187
+
188
+
### 4.2 psuutil
189
+
190
+
`psuutil` fetches the information via calling platform API directly. Both warning-suppress and critical thresholds will be exposed in the output of psuutil status.
191
+
The "WARNING" state is not exposed because psuutil is a one-time command instead of a daemon, which means it does not store state information. It fetches information via calling platform API so it can not distinguish the following status:
192
+
193
+
1. The power exceeded the critical threshold but is in the range between the warning-suppress and critical thresholds, which means the alarm should be raised
194
+
2. The power didn't exceed the critical threshold and exceeds the warning-suppress threshold, which means the alarm should not be raised
195
+
196
+
An example of output
98
197
```
99
198
admin@sonic:~$ show platform psustatus
100
-
PSU Model Serial HW Rev Voltage (V) Current (A) Power (W) Status LED
PSU 1 MTEF-PSF-AC-A MT1843K17965 A4 12.02 3.62 43.56 38.00 58.00 OK green
202
+
PSU 2 MTEF-PSF-AC-A MT1843K17966 A4 12.04 4.25 51.12 38.00 58.00 OK green
203
+
104
204
```
105
205
206
+
In case neither threshold is supported on the platform, `N/A` will be displayed.
207
+
106
208
## 5. PSU LED management
107
209
108
210
The purpose of PSU LED management is to notify user about PSU event by PSU LED or syslog. Current PSU daemon psud need to monitor PSU event (PSU voltage out of range, PSU too hot) and trigger proper actions if necessary.
@@ -162,20 +264,58 @@ class PsuBase(device_base.DeviceBase):
162
264
163
265
defget_voltage_low_threshold(self):
164
266
raiseNotImplementedError
165
-
267
+
166
268
defget_input_voltage(self):
167
269
raiseNotImplementedError
168
-
270
+
169
271
defget_input_current(self):
170
272
raiseNotImplementedError
171
-
...
172
273
274
+
defget_psu_power_warning_suppress_threshold(self)
275
+
"""
276
+
The value can be volatile, so the caller should call the API each time it is used.
277
+
278
+
Returns:
279
+
A float number, the warning-suppress threshold of the PSU in watts.
280
+
"""
281
+
raiseNotImplementedError
282
+
283
+
defget_psu_power_critical_threshold(self)
284
+
"""
285
+
Retrieve the critical threshold of the power on this PSU
286
+
The value can be volatile, so the caller should call the API each time it is used.
287
+
288
+
Returns:
289
+
A float number, the critical threshold of the PSU in watts.
290
+
"""
291
+
raiseNotImplementedError
173
292
```
174
293
175
294
### 6. PSU daemon flow
176
295
177
296
Supervisord takes charge of this daemon. This daemon will loop every 3 seconds and get the data from psuutil/platform API and then write it the Redis DB.
178
297
179
-
- The psu_num will store in "chassis_info" table. It will just be invoked one time when system boot up or reload. The key is chassis_name, the field is "psu_num" and the value is from get_psu_num().
298
+
- The psu_num will store in "chassis_info" table. It will just be invoked one time when system boot up or reload. The key is chassis_name, the field is "psu_num" and the value is from get_psu_num().
180
299
- The psu_status and psu_presence will store in "psu_info" table. It will be updated every 3 seconds. The key is psu_name, the field is "presence" and "status", the value is from get_psu_presence() and get_psu_num().
181
300
- The daemon query PSU event every 3 seconds via platform API. If any event detects, it should set PSU LED color accordingly and trigger proper syslog.
301
+
302
+
### 7. Test cases
303
+
304
+
#### 7.1 Unit test cases added for PSU power exceeding checking
305
+
306
+
1. Neither `get_psu_power_warning_suppress_threshold` nor `get_psu_power_critical_threshold` is supported by platform API when a new PSU is identified
307
+
In `psu_status`, power exceeding check should be stored as `not supported` and no further function call.
308
+
2. Both `get_psu_power_warning_suppress_threshold` and `get_psu_power_critical_threshold` are supported by platform API when a new PSU is identified
309
+
In `psu_status`, power exceeding check should be stored as `supported`
310
+
3. PSU's power was less than the warning-suppress threshold and is in the range (warning-suppress threshold, critical threshold): no action
311
+
4. PSU's power was in range (warning-suppress threshold, critical threshold) and is greater than the critical threshold
312
+
1. if warning was raised, no action expected
313
+
2. if warning was not raised, a warning should be raised
314
+
5. PSU's power was less than the warning-suppress threshold and is greater than the critical threshold: a warning should be raised
315
+
6. PSU's power was greater than the critical threshold and is in range (warning-suppress threshold, critical threshold): no action
316
+
7. PSU's power was in range (warning-suppress threshold, critical threshold) and is less than the warning-suppress threshold:
317
+
1. if warning was raised, the warning should be cleared
318
+
2. if warning was not raised, no action
319
+
8. PSU's power was greater than the critical threshold and is less than the warning-suppress threshold: the warning-suppress should be cleared
0 commit comments