Skip to content

[SNMP-SUBAGENT] [202012] Observed a unusual spike in CPU usage because of blocking=True argument #8202

@vivekrnv

Description

@vivekrnv

Description

Steps to reproduce the issue:

  1. Load any 202012 image which was was built with this change PR
  2. Disable counterpoll, if not already disabled and reload the switch
  3. Wait for the SNMP Container to be up.

Describe the results you received:

Check the CPU usage of the SNMP docker and snmp-subagent process

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
24b286433cb7        mgmt-framework      0.01%               108.2MiB / 7.689GiB   1.37%               0B / 0B             26.5MB / 69.6kB     19
78e05cfe9404        snmp                100.66%             68.76MiB / 7.689GiB   0.87%               0B / 0B             0B / 139kB          9
f88c433f5867        telemetry           0.89%               141.4MiB / 7.689GiB   1.80%               0B / 0B             36MB / 86kB         34
36437a3448cd        dhcp_relay          0.03%               30.47MiB / 7.689GiB   0.39%               0B / 0B             795kB / 73.7kB      6
ea34908916a6        radv                0.03%               29.96MiB / 7.689GiB   0.38%               0B / 0B             561kB / 77.8kB      6
dcdd0f45efb8        pmon                0.94%               212MiB / 7.689GiB     2.69%               0B / 0B             72.4MB / 1.21MB     16
8fa306e016cc        lldp                6.52%               59.86MiB / 7.689GiB   0.76%               0B / 0B             7.14MB / 127kB      11
e010a19edc0b        syncd               4.44%               813.2MiB / 7.689GiB   10.33%              0B / 0B             34.9MB / 69.6kB     41
6fc14c869be8        teamd               0.04%               33MiB / 7.689GiB      0.42%               0B / 0B             5.78MB / 90.1kB     12
fedda16a902e        swss                0.13%               54.29MiB / 7.689GiB   0.69%               0B / 0B             17.4MB / 246kB      37
84aaea73fd3f        bgp                 0.05%               71.48MiB / 7.689GiB   0.91%               0B / 0B             16.8MB / 5.83MB     22
29fbbe40ea00        database            5.78%               93.08MiB / 7.689GiB   1.18%               0B / 0B             40.8MB / 98.3kB     12

admin@sonic:~$ top
top - 08:06:58 up  4:47,  2 users,  load average: 2.00, 1.67, 1.13
Tasks: 306 total,   2 running, 302 sleeping,   0 stopped,   2 zombie
%Cpu(s): 15.9 us,  2.4 sy,  0.0 ni, 81.5 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   7873.9 total,   2911.0 free,   3284.7 used,   1678.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   4232.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  74572 root      20   0  153692  57784  13900 R 100.0   0.7  12:56.35 python3
   7819 root      20   0 1330816 822260  46216 S  24.2  10.2  49:31.42 sx_sdk

admin@sonic:~$ ps -aux | grep 74572
root       74572 99.1  0.7 153692 57784 pts/0    Rl   07:53  13:29 python3 -m sonic_ax_impl
admin      75908  0.0  0.0   6692   828 pts/1    S+   08:07   0:00 grep 74572

After doing some profiling,

GIL: 0.00%, Active: 100.00%, Threads: 2

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%   319.0s    319.1s   get_all (swsscommon/swsscommon.py:1661)
  0.00%   0.00%    2.89s     2.89s   update_rem_if_mgmt (sonic_ax_impl/mibs/ieee802_1ab.py:542)

It is very clear that the get_all method of swsscommon.Sonicv2Connector is taking a significant amount of time, when used with blocking set to True (waits for around 60 sec if the data is not present in the Redis). And this seems to busy wait as inferred from the logic here LINK and thus the reason for CPU Usage spike.

This behavior is particular seen after that commit because, that commit essentially delays the initialization of FLEX_COUNTERS and thereby the COUNTERS:oid* until the counterpoll is enabled.

But the MIB classes in the snmp-subagent periodically poll these DB with blocking set to True.

As a fix, we can set the blocking set to False in all these MIB classes. This shouldn't affect how they operate as they periodically poll the redis anyway and if the data is missed at this instant, that can eventually be fetched within the next 3 to 7 second. {5 is the default period with a random delay of -2,2}

Describe the results you expected:

Shouldn't see the CPU Usage spike.

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

Metadata

Metadata

Assignees

No one assigned

    Labels

    Triagedthis issue has been triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions