Skip to content

system_health/test_system_health.py::test_service_checker_with_process_exit fails with "AssertionError: ... is not recorded" #7832

Closed
@kartik-arista

Description

@kartik-arista

Description

system_health/test_system_health.py::test_service_checker_with_process_exit

has started to fail in latest sonic-mgmt runs. This seems to be fallout from

sonic-net/sonic-buildimage#13497

Steps to reproduce the issue:
1.
2.
3.

Just run the test.

Describe the results you received:

duthosts = [<MultiAsicSonicHost cmp210-3>, <MultiAsicSonicHost cmp210-4>, <MultiAsicSonicHost cmp210-5>, <MultiAsicSonicHost cmp210>], enum_rand_one_per_hwsku_hostname = 'cmp210'

    @pytest.mark.disable_loganalyzer
    def test_service_checker_with_process_exit(duthosts, enum_rand_one_per_hwsku_hostname):
        duthost = duthosts[enum_rand_one_per_hwsku_hostname]
        wait_system_health_boot_up(duthost)
        with ConfigFileContext(duthost, os.path.join(FILES_DIR, IGNORE_DEVICE_CHECK_CONFIG_FILE)):
            processes_status = duthost.all_critical_process_status()
            containers = [x for x in list(processes_status.keys()) if "syncd" not in x and "database" not in x and
                          "bgp" not in x and "swss" not in x]
            logging.info('Test containers: {}'.format(containers))
            random.shuffle(containers)
            for container in containers:
                running_critical_process = processes_status[container]['running_critical_process']
                if not running_critical_process:
                    continue

                critical_process = random.sample(running_critical_process, 1)[0]
                with ProcessExitContext(duthost, container, critical_process):
                    # use wait_until to check if SYSTEM_HEALTH_INFO has expected content
                    # avoid waiting for too long or DEFAULT_INTERVAL is not long enough to refresh db
                    category = '{}:{}'.format(container, critical_process)
                    expected_value = "'{}' is not running".format(critical_process)
                    result = wait_until(WAIT_TIMEOUT, 10, 2, check_system_health_info, duthost, category, expected_value)
>                   assert result == True, '{} is not recorded'.format(critical_process)
E                   AssertionError: tlm_teamd is not recorded

Describe the results you expected:

Test should pass.

The root cause is that

expected_value = "'{}' is not running".format(critical_process)

No longer matches the string storbed the service health checker in STATE_DB. Adjusting the string to match the new string gets the test passing again.

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions