-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Description
Steps to reproduce the issue:
- Load any 202012 image which was was built with this change PR
- Disable counterpoll, if not already disabled and reload the switch
- Wait for the SNMP Container to be up.
Describe the results you received:
Check the CPU usage of the SNMP docker and snmp-subagent process
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
24b286433cb7 mgmt-framework 0.01% 108.2MiB / 7.689GiB 1.37% 0B / 0B 26.5MB / 69.6kB 19
78e05cfe9404 snmp 100.66% 68.76MiB / 7.689GiB 0.87% 0B / 0B 0B / 139kB 9
f88c433f5867 telemetry 0.89% 141.4MiB / 7.689GiB 1.80% 0B / 0B 36MB / 86kB 34
36437a3448cd dhcp_relay 0.03% 30.47MiB / 7.689GiB 0.39% 0B / 0B 795kB / 73.7kB 6
ea34908916a6 radv 0.03% 29.96MiB / 7.689GiB 0.38% 0B / 0B 561kB / 77.8kB 6
dcdd0f45efb8 pmon 0.94% 212MiB / 7.689GiB 2.69% 0B / 0B 72.4MB / 1.21MB 16
8fa306e016cc lldp 6.52% 59.86MiB / 7.689GiB 0.76% 0B / 0B 7.14MB / 127kB 11
e010a19edc0b syncd 4.44% 813.2MiB / 7.689GiB 10.33% 0B / 0B 34.9MB / 69.6kB 41
6fc14c869be8 teamd 0.04% 33MiB / 7.689GiB 0.42% 0B / 0B 5.78MB / 90.1kB 12
fedda16a902e swss 0.13% 54.29MiB / 7.689GiB 0.69% 0B / 0B 17.4MB / 246kB 37
84aaea73fd3f bgp 0.05% 71.48MiB / 7.689GiB 0.91% 0B / 0B 16.8MB / 5.83MB 22
29fbbe40ea00 database 5.78% 93.08MiB / 7.689GiB 1.18% 0B / 0B 40.8MB / 98.3kB 12
admin@sonic:~$ top
top - 08:06:58 up 4:47, 2 users, load average: 2.00, 1.67, 1.13
Tasks: 306 total, 2 running, 302 sleeping, 0 stopped, 2 zombie
%Cpu(s): 15.9 us, 2.4 sy, 0.0 ni, 81.5 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 7873.9 total, 2911.0 free, 3284.7 used, 1678.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 4232.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
74572 root 20 0 153692 57784 13900 R 100.0 0.7 12:56.35 python3
7819 root 20 0 1330816 822260 46216 S 24.2 10.2 49:31.42 sx_sdk
admin@sonic:~$ ps -aux | grep 74572
root 74572 99.1 0.7 153692 57784 pts/0 Rl 07:53 13:29 python3 -m sonic_ax_impl
admin 75908 0.0 0.0 6692 828 pts/1 S+ 08:07 0:00 grep 74572
After doing some profiling,
GIL: 0.00%, Active: 100.00%, Threads: 2
%Own %Total OwnTime TotalTime Function (filename:line)
100.00% 100.00% 319.0s 319.1s get_all (swsscommon/swsscommon.py:1661)
0.00% 0.00% 2.89s 2.89s update_rem_if_mgmt (sonic_ax_impl/mibs/ieee802_1ab.py:542)
It is very clear that the get_all method of swsscommon.Sonicv2Connector is taking a significant amount of time, when used with blocking set to True (waits for around 60 sec if the data is not present in the Redis). And this seems to busy wait as inferred from the logic here LINK and thus the reason for CPU Usage spike.
This behavior is particular seen after that commit because, that commit essentially delays the initialization of FLEX_COUNTERS and thereby the COUNTERS:oid* until the counterpoll is enabled.
But the MIB classes in the snmp-subagent periodically poll these DB with blocking set to True.
As a fix, we can set the blocking set to False in all these MIB classes. This shouldn't affect how they operate as they periodically poll the redis anyway and if the data is missed at this instant, that can eventually be fetched within the next 3 to 7 second. {5 is the default period with a random delay of -2,2}
Describe the results you expected:
Shouldn't see the CPU Usage spike.
Output of show version
:
(paste your output here)
Output of show techsupport
:
(paste your output here or download and attach the file here )