Skip to content

Commit 11a93d2

Browse files
[system-health] No longer check critical process/service status via monit (#9068)
HLD updated here: sonic-net/SONiC#887 #### Why I did it Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check. #### How I did it 1. Get container names from FEATURE table 2. For each container, collect critical process names from file critical_processes 3. Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit #### How to verify it 1. Add unit test case to cover it 2. Adjust sonic-mgmt cases to cover it 3. Manual test
1 parent 240596e commit 11a93d2

File tree

7 files changed

+624
-29
lines changed

7 files changed

+624
-29
lines changed

rules/system-health.mk

+1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ SYSTEM_HEALTH = system_health-1.0-py3-none-any.whl
44
$(SYSTEM_HEALTH)_SRC_PATH = $(SRC_PATH)/system-health
55
$(SYSTEM_HEALTH)_PYTHON_VERSION = 3
66
$(SYSTEM_HEALTH)_DEPENDS = $(SONIC_PY_COMMON_PY3) $(SONIC_CONFIG_ENGINE_PY3)
7+
$(SYSTEM_HEALTH)_DEBS_DEPENDS = $(LIBSWSSCOMMON) $(PYTHON3_SWSSCOMMON)
78
SONIC_PYTHON_WHEELS += $(SYSTEM_HEALTH)
89

910
export system_health_py3_wheel_path="$(addprefix $(PYTHON_WHEELS_PATH)/,$(SYSTEM_HEALTH))"

src/system-health/health_checker/manager.py

+12-9
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
from . import utils
2+
from .config import Config
3+
from .health_checker import HealthChecker
4+
from .service_checker import ServiceChecker
5+
from .hardware_checker import HardwareChecker
6+
from .user_defined_checker import UserDefinedChecker
7+
8+
19
class HealthCheckerManager(object):
210
"""
311
Manage all system health checkers and system health configuration.
@@ -10,7 +18,6 @@ def __init__(self):
1018
self._checkers = []
1119
self._state = self.STATE_BOOTING
1220

13-
from .config import Config
1421
self.config = Config()
1522
self.initialize()
1623

@@ -19,8 +26,6 @@ def initialize(self):
1926
Initialize the manager. Create service checker and hardware checker by default.
2027
:return:
2128
"""
22-
from .service_checker import ServiceChecker
23-
from .hardware_checker import HardwareChecker
2429
self._checkers.append(ServiceChecker())
2530
self._checkers.append(HardwareChecker())
2631

@@ -31,7 +36,6 @@ def check(self, chassis):
3136
:return: A tuple. The first element indicate the status of the checker; the second element is a dictionary that
3237
contains the status for all objects that was checked.
3338
"""
34-
from .health_checker import HealthChecker
3539
HealthChecker.summary = HealthChecker.STATUS_OK
3640
stats = {}
3741
self.config.load_config()
@@ -45,7 +49,6 @@ def check(self, chassis):
4549
self._do_check(checker, stats)
4650

4751
if self.config.user_defined_checkers:
48-
from .user_defined_checker import UserDefinedChecker
4952
for udc in self.config.user_defined_checkers:
5053
checker = UserDefinedChecker(udc)
5154
self._do_check(checker, stats)
@@ -71,20 +74,20 @@ def _do_check(self, checker, stats):
7174
else:
7275
stats[category].update(info)
7376
except Exception as e:
74-
from .health_checker import HealthChecker
77+
HealthChecker.summary = HealthChecker.STATUS_NOT_OK
7578
error_msg = 'Failed to perform health check for {} due to exception - {}'.format(checker, repr(e))
7679
entry = {str(checker): {
7780
HealthChecker.INFO_FIELD_OBJECT_STATUS: HealthChecker.STATUS_NOT_OK,
78-
HealthChecker.INFO_FIELD_OBJECT_MSG: error_msg
81+
HealthChecker.INFO_FIELD_OBJECT_MSG: error_msg,
82+
HealthChecker.INFO_FIELD_OBJECT_TYPE: "Internal"
7983
}}
8084
if 'Internal' not in stats:
8185
stats['Internal'] = entry
8286
else:
8387
stats['Internal'].update(entry)
8488

8589
def _is_system_booting(self):
86-
from .utils import get_uptime
87-
uptime = get_uptime()
90+
uptime = utils.get_uptime()
8891
if not self.boot_timeout:
8992
self.boot_timeout = self.config.get_bootup_timeout()
9093
booting = uptime < self.boot_timeout

0 commit comments

Comments
 (0)