Skip to content

Commit d72a14a

Browse files
committed
updated with few typos
Signed-off-by: Srinadh Penugonda <[email protected]>
1 parent 52c3e65 commit d72a14a

File tree

2 files changed

+33
-20
lines changed

2 files changed

+33
-20
lines changed
Loading

doc/event-alarm-framework/event-alarm-framework.md

Lines changed: 33 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -96,10 +96,8 @@ Such a change has an important metric called *severity* to indicate how critical
9696
After the application recovers from the condition, that alarm is *cleared* by sending an event with a action: clear.
9797
An operator could *acknowledge* an alarm. This indicates that the user is aware of the faulty condition.
9898

99-
![Alarm Life Cycle](event-alarm-framework-alarm-lifecycle.png)
100-
101-
Overall system LED state is deduced from the severities of alarms.
102-
An acknowledged alarm is taken out of consideration from deciding system LED state.
99+
Overall system LED state can be deduced from the severities of alarms.
100+
An acknowledged alarm should be taken out of consideration from deciding system LED state.
103101

104102
Both events and alarms get recorded in the EVENT_DB.
105103

@@ -116,10 +114,19 @@ Both events and alarms get recorded in the EVENT_DB.
116114

117115
In effect, ALARM table contains outstanding alarms that need to be cleared by those components who raised them.
118116
This table is NOT persisted and its contents are cleared with a reload.
119-
117+
120118
In summary, the framework provides both current and historical event status of software and physical entities of the system through ALARM and EVENT tables.
121119

122-
Statistics on number of alarms based on severity are maintained in ALARM_STATS table. An alarm that is cleared or acknowledged reduces the corresponding counter to be reduced by 1.
120+
In addition to the above tables, the framework maintains various statisitcs.
121+
122+
1. Event Statistics Table
123+
Statistics on number of events and alarms are maintained in EVENT_STATS table.
124+
125+
2. Alarm Statistics Table
126+
Statistics on number of alarms based on severity are maintained in ALARM_STATS table.
127+
When application raises an alarm, the counter corresponding to the alarm's severity is increased by 1.
128+
When the alarm is cleared or acknowledged, the corresponding counter will be reduced by 1.
129+
This table categorizes "active" alarms per severity.
123130

124131
As mentioned above, each event has an important characteristic: severity. SONiC uses following severities for events and alarms.
125132

@@ -134,6 +141,8 @@ As mentioned above, each event has an important characteristic: severity. SONiC
134141
- informational : Does not impact performance. NOT applicable to alarms.
135142
( maps to log-notice )
136143

144+
![Alarm Life Cycle](event-alarm-framework-alarm-lifecycle.png)
145+
137146
By default every event will have a severity assigned by the component. The framework provides Event Profiles to customize severity of an event and also disable an event.
138147

139148
An example of event profile is as below:
@@ -165,7 +174,7 @@ This modified file can then be uploaded to the device.
165174
Operator can select any of these custom event profiles to change default properties of events.
166175
The selected profile is persistent across reboots and will be in effect until operator selects either default or another custom profile.
167176

168-
In addition to storing events in DB, framework forwards log messages corresponding to all the events to syslog.
177+
In addition to storing events in EVENT_DB, framework forwards log messages corresponding to all the events to syslog.
169178
Syslog message displays the type (alarm or event), action (raise, clear or acknowledge) - when the message corresponds to an event of an alarm, name of the event and detailed message.
170179

171180
gNMI clients can subscribe to receive events as they are raised. Subscribing through REST is being evaluated.
@@ -386,7 +395,7 @@ If the flag is set to true, it continues to process the event as follows:
386395
- If action is RAISE_ALARM, add the record to ALARM table
387396
- If action is CLEAR_ALARM, remove the entry from ALARM table
388397
- If action is ACK_ALARM, update is_acknowledged flag of the corresponding raised entry in ALARM table
389-
- Update system health status
398+
- Alarm Statistics Table is updated
390399
- Invoke logging API to send a formatted message to syslog
391400

392401
#### 3.1.2.1 Severity
@@ -402,56 +411,60 @@ The alarm consume method on receiving the event record, verifies the event actio
402411
The counter in ALARM_STATS corresponding to the severity of the incoming alarm is increased by 1.
403412

404413
Eventd maintains a lookup map of sequence-id and pair of event-id and source fields.
405-
An entry for the newly received event with state raised is added to this look up map.
414+
An entry for the newly received event with action raise is added to this look up map.
406415

407-
If the state is ACK_ALARM, alarm consumer finds the raised record of the alarm in the ALARM table using the above lookup map and updates *is_acknowledged* flag to true.
408-
If the state is CLEAR_ALARM, it removes the previous raised record of the alarm using above lookup map.
409-
The counter in ALARM_STATS corresponding to the severity of the updated alarm is reduced by 1.
416+
. If the action is ACK_ALARM, alarm consumer finds the raised record of the alarm in the ALARM table using the above lookup map and updates *is_acknowledged* flag to true.
417+
. If the action is CLEAR_ALARM, it removes the previous raised record of the alarm using above lookup map.
418+
The counter in ALARM_STATS corresponding to the severity of the updated alarm is reduced by 1.
419+
. On acknowledging an alarm through CLI/REST/gNMI, ALARM_STATS is updated by reducing the corresponding severity counter by 1.
410420

411421
pmon can use ALARM_STATS to update system LED based on severities of outstanding alarms:
412422
```
413423
Red if any outstanding critical/major alarms, else Yellow if any minor/warning alarms, else Green.
414424
```
415425
An outstanding alarm is an alarm that is either not cleared or not acknowledged by the user yet.
416426

417-
On acknowledging an alarm through CLI/REST/gNMI, ALARM_STATS is updated by reducing the corresponding severity counter by 1.
418-
This makes acknowledged alarm is taken out of consideration.
419-
420-
The following illustrates how severity of alarms in the table controls system LED.
427+
The following illustrates how pmon can use ALARM_STATS table to control system LED.
421428

422429
| ALARM | SEVERITY | IS_ACK |
423430
|:-----:|:----------:|:-------:|
424431
| | | |
425432
| | | |
433+
426434
Alarm table is empty. All counters in ALARM_STATS is 0. System LED is Green.
427435

428436
| ALARM | SEVERITY | IS_ACK |
429437
|:-----:|:----------:|:-------:|
430438
| ALM-1 | critical | |
431439
| ALM-2 | minor | |
440+
432441
Alarm table now has two alarms. One with *critical* and other with *minor*. ALARM_STATS is updated as: Critical as 1 and Minor as 1. As There is atleast one alarm with *critical/major* severity, system LED is Red.
433442

434443
| ALARM | SEVERITY | IS_ACK |
435444
|:-----:|:----------:|:-------:|
436445
| ALM-2 | minor | |
446+
437447
The *critical* alarm is cleared by the application, so alarm consumer removes it from ALARM table, ALARM_STATS is updated and it reads: Critical as 0 and Minor as 1. As there is at least one* minor/warning* alarms in the table, system LED is Amber.
438448

439449
| ALARM | SEVERITY | IS_ACK |
440450
|:-----:|:----------:|:-------:|
441451
| ALM-2 | minor | |
442452
| ALM-9 | major | |
453+
443454
Now there is an alarm with *critical/major* severity. ALARM_STATS now reads as: Major as 1 and Minor as 1. So, system LED is Red.
444455

445456
| ALARM | SEVERITY | IS_ACK |
446457
|:-----:|:----------:|:-------:|
447458
| ALM-2 | minor | |
448459
| ALM-9 | major | true |
460+
449461
The *major* alarm is acknowledged by user, alarm consumer sets *is_acknolwedged* flag to true and reduces Major counter in ALARM_STATS by 1, ALARM_STATS now reads as: Major 0 and Minor 1. This particular alarm is taken out of consideration for system LED. There are no other *critica/major* alarms. There however, exists an alarm with *minor/warning* severity. System LED is Amber.
450462

451463
| ALARM | SEVERITY | IS_ACK |
452464
|:-----:|:----------:|:-------:|
453465
| ALM-2 | minor | true |
454466
| ALM-9 | major | true |
467+
455468
The *minor* alarm is also acknowledged by user. ALARM_STATS reads: Major as 0, Minor as 0. So it is also taken out of consideration for system LED. System LED is Green.
456469

457470
### 3.1.4 Event Receivers
@@ -481,7 +494,7 @@ if (ev_state.empty()) {
481494
// raise a syslog message
482495
syslog(LOG_MAKEPRI(ev_sev, SYSLOG_FACILITY),
483496
LOG_FORMAT,
484-
ev_type.c_str(), ev_state.c_str(), ev_id.c_str(), ev_msg.c_str(), ev_static_msg.c_str());
497+
ev_type.c_str(), ev_action.c_str(), ev_id.c_str(), ev_msg.c_str(), ev_static_msg.c_str());
485498
}
486499
487500
```
@@ -491,15 +504,15 @@ Feb 09 21:44:07.487906 2021 sonic NOTICE eventd#eventd[21]: [EVENT], %TAM_SWITCH
491504
```
492505
Syslog message for an alarm raised by a sensor:
493506
```
494-
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (raised), %TEMPERATURE_EXCEEDED :- temperatureCrossedThreshold: Current temperature of sensor/2 is 76 degrees. Temperature threshold is 75 degrees.
507+
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (raise), %TEMPERATURE_EXCEEDED :- temperatureCrossedThreshold: Current temperature of sensor/2 is 76 degrees. Temperature threshold is 75 degrees.
495508
```
496509
Syslog message when alarm is clared is as follows:
497510
```
498-
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (cleared), %TEMPERATURE_EXCEEDED :- temperatureCrossedThreshold: Current temperature of sensor/2 is 70 degrees. Temperature threshold is 75 degrees.
511+
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (clear), %TEMPERATURE_EXCEEDED :- temperatureCrossedThreshold: Current temperature of sensor/2 is 70 degrees. Temperature threshold is 75 degrees.
499512
```
500513
Syslog message when alarm is acknowledged is as follows:
501514
```
502-
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (acknowledged), %TEMPERATURE_EXCEEDED :- acknowledgeAlarm: Alarm 'TEMPERATURE_EXCEEDED' is acknowledged by user 'admin'.
515+
Feb 10 16:24:42.148610 2021 sonic ALERT eventd#eventd[125]: [ALARM] (acknowledge), %TEMPERATURE_EXCEEDED :- acknowledgeAlarm: Alarm 'TEMPERATURE_EXCEEDED' is acknowledged by user 'admin'.
503516
```
504517
Operator can configure specifc syslog host to receive either syslog messages corresponding to events or general log messages.
505518
Through CLI, operator can chose 'logging server <ip> [log|event]' command.

0 commit comments

Comments
 (0)