You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/event-alarm-framework/event-alarm-framework.md
+33-20Lines changed: 33 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -96,10 +96,8 @@ Such a change has an important metric called *severity* to indicate how critical
96
96
After the application recovers from the condition, that alarm is *cleared* by sending an event with a action: clear.
97
97
An operator could *acknowledge* an alarm. This indicates that the user is aware of the faulty condition.
98
98
99
-

100
-
101
-
Overall system LED state is deduced from the severities of alarms.
102
-
An acknowledged alarm is taken out of consideration from deciding system LED state.
99
+
Overall system LED state can be deduced from the severities of alarms.
100
+
An acknowledged alarm should be taken out of consideration from deciding system LED state.
103
101
104
102
Both events and alarms get recorded in the EVENT_DB.
105
103
@@ -116,10 +114,19 @@ Both events and alarms get recorded in the EVENT_DB.
116
114
117
115
In effect, ALARM table contains outstanding alarms that need to be cleared by those components who raised them.
118
116
This table is NOT persisted and its contents are cleared with a reload.
119
-
117
+
120
118
In summary, the framework provides both current and historical event status of software and physical entities of the system through ALARM and EVENT tables.
121
119
122
-
Statistics on number of alarms based on severity are maintained in ALARM_STATS table. An alarm that is cleared or acknowledged reduces the corresponding counter to be reduced by 1.
120
+
In addition to the above tables, the framework maintains various statisitcs.
121
+
122
+
1. Event Statistics Table
123
+
Statistics on number of events and alarms are maintained in EVENT_STATS table.
124
+
125
+
2. Alarm Statistics Table
126
+
Statistics on number of alarms based on severity are maintained in ALARM_STATS table.
127
+
When application raises an alarm, the counter corresponding to the alarm's severity is increased by 1.
128
+
When the alarm is cleared or acknowledged, the corresponding counter will be reduced by 1.
129
+
This table categorizes "active" alarms per severity.
123
130
124
131
As mentioned above, each event has an important characteristic: severity. SONiC uses following severities for events and alarms.
125
132
@@ -134,6 +141,8 @@ As mentioned above, each event has an important characteristic: severity. SONiC
134
141
- informational : Does not impact performance. NOT applicable to alarms.
135
142
( maps to log-notice )
136
143
144
+

145
+
137
146
By default every event will have a severity assigned by the component. The framework provides Event Profiles to customize severity of an event and also disable an event.
138
147
139
148
An example of event profile is as below:
@@ -165,7 +174,7 @@ This modified file can then be uploaded to the device.
165
174
Operator can select any of these custom event profiles to change default properties of events.
166
175
The selected profile is persistent across reboots and will be in effect until operator selects either default or another custom profile.
167
176
168
-
In addition to storing events in DB, framework forwards log messages corresponding to all the events to syslog.
177
+
In addition to storing events in EVENT_DB, framework forwards log messages corresponding to all the events to syslog.
169
178
Syslog message displays the type (alarm or event), action (raise, clear or acknowledge) - when the message corresponds to an event of an alarm, name of the event and detailed message.
170
179
171
180
gNMI clients can subscribe to receive events as they are raised. Subscribing through REST is being evaluated.
@@ -386,7 +395,7 @@ If the flag is set to true, it continues to process the event as follows:
386
395
- If action is RAISE_ALARM, add the record to ALARM table
387
396
- If action is CLEAR_ALARM, remove the entry from ALARM table
388
397
- If action is ACK_ALARM, update is_acknowledged flag of the corresponding raised entry in ALARM table
389
-
-Update system health status
398
+
-Alarm Statistics Table is updated
390
399
- Invoke logging API to send a formatted message to syslog
391
400
392
401
#### 3.1.2.1 Severity
@@ -402,56 +411,60 @@ The alarm consume method on receiving the event record, verifies the event actio
402
411
The counter in ALARM_STATS corresponding to the severity of the incoming alarm is increased by 1.
403
412
404
413
Eventd maintains a lookup map of sequence-id and pair of event-id and source fields.
405
-
An entry for the newly received event with state raised is added to this look up map.
414
+
An entry for the newly received event with action raise is added to this look up map.
406
415
407
-
If the state is ACK_ALARM, alarm consumer finds the raised record of the alarm in the ALARM table using the above lookup map and updates *is_acknowledged* flag to true.
408
-
If the state is CLEAR_ALARM, it removes the previous raised record of the alarm using above lookup map.
409
-
The counter in ALARM_STATS corresponding to the severity of the updated alarm is reduced by 1.
416
+
. If the action is ACK_ALARM, alarm consumer finds the raised record of the alarm in the ALARM table using the above lookup map and updates *is_acknowledged* flag to true.
417
+
. If the action is CLEAR_ALARM, it removes the previous raised record of the alarm using above lookup map.
418
+
The counter in ALARM_STATS corresponding to the severity of the updated alarm is reduced by 1.
419
+
. On acknowledging an alarm through CLI/REST/gNMI, ALARM_STATS is updated by reducing the corresponding severity counter by 1.
410
420
411
421
pmon can use ALARM_STATS to update system LED based on severities of outstanding alarms:
412
422
```
413
423
Red if any outstanding critical/major alarms, else Yellow if any minor/warning alarms, else Green.
414
424
```
415
425
An outstanding alarm is an alarm that is either not cleared or not acknowledged by the user yet.
416
426
417
-
On acknowledging an alarm through CLI/REST/gNMI, ALARM_STATS is updated by reducing the corresponding severity counter by 1.
418
-
This makes acknowledged alarm is taken out of consideration.
419
-
420
-
The following illustrates how severity of alarms in the table controls system LED.
427
+
The following illustrates how pmon can use ALARM_STATS table to control system LED.
421
428
422
429
| ALARM | SEVERITY | IS_ACK |
423
430
|:-----:|:----------:|:-------:|
424
431
||||
425
432
||||
433
+
426
434
Alarm table is empty. All counters in ALARM_STATS is 0. System LED is Green.
427
435
428
436
| ALARM | SEVERITY | IS_ACK |
429
437
|:-----:|:----------:|:-------:|
430
438
| ALM-1 | critical ||
431
439
| ALM-2 | minor ||
440
+
432
441
Alarm table now has two alarms. One with *critical* and other with *minor*. ALARM_STATS is updated as: Critical as 1 and Minor as 1. As There is atleast one alarm with *critical/major* severity, system LED is Red.
433
442
434
443
| ALARM | SEVERITY | IS_ACK |
435
444
|:-----:|:----------:|:-------:|
436
445
| ALM-2 | minor ||
446
+
437
447
The *critical* alarm is cleared by the application, so alarm consumer removes it from ALARM table, ALARM_STATS is updated and it reads: Critical as 0 and Minor as 1. As there is at least one* minor/warning* alarms in the table, system LED is Amber.
438
448
439
449
| ALARM | SEVERITY | IS_ACK |
440
450
|:-----:|:----------:|:-------:|
441
451
| ALM-2 | minor ||
442
452
| ALM-9 | major ||
453
+
443
454
Now there is an alarm with *critical/major* severity. ALARM_STATS now reads as: Major as 1 and Minor as 1. So, system LED is Red.
444
455
445
456
| ALARM | SEVERITY | IS_ACK |
446
457
|:-----:|:----------:|:-------:|
447
458
| ALM-2 | minor ||
448
459
| ALM-9 | major | true |
460
+
449
461
The *major* alarm is acknowledged by user, alarm consumer sets *is_acknolwedged* flag to true and reduces Major counter in ALARM_STATS by 1, ALARM_STATS now reads as: Major 0 and Minor 1. This particular alarm is taken out of consideration for system LED. There are no other *critica/major* alarms. There however, exists an alarm with *minor/warning* severity. System LED is Amber.
450
462
451
463
| ALARM | SEVERITY | IS_ACK |
452
464
|:-----:|:----------:|:-------:|
453
465
| ALM-2 | minor | true |
454
466
| ALM-9 | major | true |
467
+
455
468
The *minor* alarm is also acknowledged by user. ALARM_STATS reads: Major as 0, Minor as 0. So it is also taken out of consideration for system LED. System LED is Green.
@@ -491,15 +504,15 @@ Feb 09 21:44:07.487906 2021 sonic NOTICE eventd#eventd[21]: [EVENT], %TAM_SWITCH
491
504
```
492
505
Syslog message for an alarm raised by a sensor:
493
506
```
494
-
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (raised), %TEMPERATURE_EXCEEDED :- temperatureCrossedThreshold: Current temperature of sensor/2 is 76 degrees. Temperature threshold is 75 degrees.
507
+
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (raise), %TEMPERATURE_EXCEEDED :- temperatureCrossedThreshold: Current temperature of sensor/2 is 76 degrees. Temperature threshold is 75 degrees.
495
508
```
496
509
Syslog message when alarm is clared is as follows:
497
510
```
498
-
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (cleared), %TEMPERATURE_EXCEEDED :- temperatureCrossedThreshold: Current temperature of sensor/2 is 70 degrees. Temperature threshold is 75 degrees.
511
+
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (clear), %TEMPERATURE_EXCEEDED :- temperatureCrossedThreshold: Current temperature of sensor/2 is 70 degrees. Temperature threshold is 75 degrees.
499
512
```
500
513
Syslog message when alarm is acknowledged is as follows:
501
514
```
502
-
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (acknowledged), %TEMPERATURE_EXCEEDED :- acknowledgeAlarm: Alarm 'TEMPERATURE_EXCEEDED' is acknowledged by user 'admin'.
515
+
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (acknowledge), %TEMPERATURE_EXCEEDED :- acknowledgeAlarm: Alarm 'TEMPERATURE_EXCEEDED' is acknowledged by user 'admin'.
503
516
```
504
517
Operator can configure specifc syslog host to receive either syslog messages corresponding to events or general log messages.
505
518
Through CLI, operator can chose 'logging server <ip> [log|event]' command.
0 commit comments