You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/event-alarm-framework/event-alarm-framework.md
+17-16Lines changed: 17 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -87,16 +87,16 @@ Such a change has an important metric called *severity* to indicate how critical
87
87
Out of memory, temperature crossing a threshold, and so on, are examples of conditions when the alarms are raised.
88
88
Such conditions are dynamic: a faulty software/hardware component encounters the above such condition and **may** come out of that situation when the condition is resolved.
89
89
90
-
Events are sent as the condition progresses through being raised and cleared in addition to operator acknowledging the condition.
90
+
Events are sent as the condition progresses through being raised and cleared in addition to operator acknowledging it.
91
91
So, these events have a field called *action*: raise, clear OR acknowledge.
92
92
93
93
Each of such events for an alarm is characterized by "action" in addition to "severity".
94
94
95
-
An application *raises* an alarm when it encounters a faulty condition by sending an event with a action: raise.
96
-
After the application recovers from the condition, that alarm is *cleared* by sending an event with a action: clear.
95
+
An application *raises* an alarm when it encounters a faulty condition by sending an event with action: *raise*.
96
+
After the application recovers from the condition, that alarm is *cleared* by sending an event with action: *clear*.
97
97
An operator could *acknowledge* an alarm. This indicates that the user is aware of the faulty condition.
98
98
99
-
Overall system LED state can be deduced from the severities of alarms.
99
+
System LED can be deduced from the severities of alarms.
100
100
An acknowledged alarm should be taken out of consideration from deciding system LED state.
101
101
102
102
Both events and alarms get recorded in the EVENT_DB.
@@ -109,10 +109,10 @@ Both events and alarms get recorded in the EVENT_DB.
109
109
2. Current Alarm Table
110
110
111
111
All events with an action field of *raise* get recorded in a table, by name, "ALARM" in addition to getting recorded in Event History Table ( only events corresponding to an alarm has state ).
112
-
When a component that raised the alarm clears it ( by sending an event with action *clear* ), the alarm record is removed from ALARM table.
112
+
When an application that raised the alarm clears it ( by sending an event with action *clear* ), the alarm record is removed from ALARM table.
113
113
An user acknowledging a particular alarm will NOT remove that alarm record from this table.
114
114
115
-
In effect, ALARM table contains outstanding alarms that need to be cleared by those components who raised them.
115
+
In effect, ALARM table contains outstanding alarms that need to be cleared by those applications who raised them.
116
116
This table is NOT persisted and its contents are cleared with a reload.
117
117
118
118
In summary, the framework provides both current and historical event status of software and physical entities of the system through ALARM and EVENT tables.
@@ -125,7 +125,7 @@ In addition to the above tables, the framework maintains various statisitcs.
125
125
126
126
2. Alarm Statistics Table
127
127
128
-
Statistics on number of alarms based on severity are maintained in ALARM_STATS table.
128
+
Statistics on number of alarms per severity are maintained in ALARM_STATS table.
129
129
When application raises an alarm, the counter corresponding to the alarm's severity is increased by 1.
130
130
When the alarm is cleared or acknowledged, the corresponding counter will be reduced by 1.
131
131
This table categorizes "active" alarms per severity.
@@ -173,7 +173,7 @@ An example of event profile is as below:
173
173
The framework maintains default event profile at /etc/sonic/evprofile/default.json.
174
174
Operator can download default event profile to a remote host.
175
175
This downloaded file can be modified by changing the severity or enable flag of event(s).
176
-
This modified file can then be uploaded to the device.
176
+
This modified file can then be uploaded to the device to /etc/sonic/evprofile/.
177
177
Operator can select any of these custom event profiles to change default properties of events.
178
178
The selected profile is persistent across reboots and will be in effect until operator selects either default or another custom profile.
179
179
@@ -239,14 +239,15 @@ Application owners need to identify various conditions that would be of interest
239
239
240
240
### 1.2.1 Basic Approach
241
241
The feature involves new development.
242
-
A new DB by name - EVENT_DB - is created using redis2 instance to "house" various tables used by the framework.
242
+
A new DB by name - EVENT_DB - is created using redis2 instance to host various tables of the framework.
243
243
Applications act as producers by writing to a table in EVENT_DB with the help of event notify library.
244
244
Eventd reads new record in the table and processes it:
245
245
It saves the entry in event history table; if the event has an action and if it is *raise*, record gets added to alarm table, severity counter in ALARM_STATS is increased.
246
246
If the received event action is *clear*, record in the ALARM table is removed and severity counter in ALARM_STATS of that alarm is reduced by 1.
247
247
If eventd receives an event with action *acknowledge*, severity counter in ALARM_STATS is reduced by 1.
248
248
Eventd then informs logging API to format the log message and send the message to syslog.
249
-
Any applications like pmon can subscribe to tables like ALARM_STATS to update its state.
249
+
250
+
Any applications like pmon can subscribe to tables like ALARM_STATS to act accordingly.
250
251
251
252
### 1.2.2 Container
252
253
A new container by name, eventd, is created to hold event consumer logic.
@@ -295,7 +296,7 @@ Developers of new events or alarms need to update this file by declaring name an
295
296
296
297
```
297
298
{
298
-
"__README__": "This is default map of events that eventd uses. Developer can modify this file and send SIGINT to eventd to make it read and use the updated file. Alternatively developer can test the new event by adding it to a custom event profile and use 'event profile <filename>' command to apply that profile without a eventd restart. Developer need to commit default.json file with the new event after testing it out. Supported severities are: 'critical', 'major', 'minor', 'warning' and 'informational'. Supported enable flag values are: 'true' and 'false'.",
299
+
"__README__": "This is default map of events that eventd reads on bootup and uses while events are raised. Developer can modify this file and send SIGINT to eventd during run-time to read and use the updated file. Alternatively developer can test the new event by adding it to a custom event profile and use 'event profile <filename>' command. This apples that profile without a eventd restart. Developer need to commit default.json file with the new event after testing it out. Supported severities are: 'critical', 'major', 'minor', 'warning' and 'informational'. Supported enable flag values are: 'true' and 'false'.",
299
300
300
301
"events": [
301
302
{
@@ -384,8 +385,8 @@ any conflicts across multiple applications trying to write to this table.
384
385
### 3.1.2 Event Consumer
385
386
The event consumer is a class in sonic-eventd container that processes the incoming record.
386
387
387
-
On bootup, event consumer reads */etc/sonic/evprofile/default.json* and builds an internal map of events, called *static_event_map*.
388
-
It then reads from EVENTPUBSUB table. This table contains records that are published by applications and waiting to be received by event consumer.
388
+
On intitialization, event consumer reads */etc/sonic/evprofile/default.json* and builds an internal map of events, called *static_event_map*.
389
+
It then reads from EVENTPUBSUB table. This table contains records that are published by applications and waiting to be received by eventd.
389
390
Whenever there is a new record, event consumer reads the record, processes and deletes it.
390
391
391
392
On reading the field value tuple, using the event-id in the record, event consumer fetches static information from *static_event_map*.
@@ -428,7 +429,7 @@ pmon can use ALARM_STATS to update system LED based on severities of outstanding
428
429
```
429
430
An outstanding alarm is an alarm that is either not cleared or not acknowledged by the user yet.
430
431
431
-
The following illustrates how ALARM table is updated as alarms are raised and how pmon can use ALARM_STATS table to control system LED.
432
+
The following illustrates how ALARM table is updated as alarms goes through their life cycle and how pmon can use ALARM_STATS table to control system LED.
432
433
433
434
| ALARM | SEVERITY | IS_ACK |
434
435
|:-----:|:----------:|:-------:|
@@ -448,7 +449,7 @@ Alarm table now has two alarms. One with *critical* and other with *minor*. ALAR
448
449
|:-----:|:----------:|:-------:|
449
450
| ALM-2 | minor ||
450
451
451
-
The *critical* alarm is cleared by the application, so alarm consumer removes it from ALARM table, ALARM_STATS is updated and it reads: Critical as 0 and Minor as 1. As there is at least one *minor/warning* alarms in the table, system LED is Amber.
452
+
The *critical* alarm is cleared by the application, so alarm consumer removes it from ALARM table, ALARM_STATS is updated as: Critical as 0 and Minor as 1. As there is at least one *minor/warning* alarms in the table, system LED is Amber.
452
453
453
454
| ALARM | SEVERITY | IS_ACK |
454
455
|:-----:|:----------:|:-------:|
@@ -462,7 +463,7 @@ Now there is an alarm with *critical/major* severity. ALARM_STATS now reads as:
462
463
| ALM-2 | minor ||
463
464
| ALM-9 | major | true |
464
465
465
-
The *major* alarm is acknowledged by user, alarm consumer sets *is_acknolwedged* flag to true and reduces Major counter in ALARM_STATS by 1, ALARM_STATS now reads as: Major 0 and Minor 1. The acknowledged major alarm is taken out of consideration for system LED. There are no other *critical/major* alarms. There however, exists an alarm with *minor/warning* severity. System LED is Amber.
466
+
The *major* alarm is acknowledged by user, alarm consumer sets *is_acknolwedged* flag to true and reduces Major counter in ALARM_STATS by 1, ALARM_STATS now reads as: Major 0 and Minor 1. This way, acknowledged major alarm has no effect on system LED. There are no other *critical/major* alarms. There however, exists an alarm with *minor/warning* severity. System LED is Amber.
0 commit comments