Description
Description
Recently, an extremely similar bug LINK is seen in the snmp-subagent process which was related to blocking calls and fix is ported already.
This is of similar nature i.e very high CPU Usage, but this occurs occasionally i.e around 20-30 % of the time.
Steps to reproduce the issue:
- Load an image which has the fix for LINK
- Monitor the CPU Usage for the snmp-subagent process
Describe the results you received:
Snapshot of where the process spends the time
Collecting samples from 'python3 -m sonic_ax_impl' (python v3.7.3)
Total Samples 466600
GIL: 100.00%, Active: 100.00%, Threads: 2
%Own %Total OwnTime TotalTime Function (filename:line)
100.00% 100.00% 3862s 3862s update_rem_if_mgmt (sonic_ax_impl/mibs/ieee802_1ab.py:539)
0.00% 0.00% 585.3s 585.3s <listcomp> (sonic_ax_impl/mibs/ieee802_1ab.py:555)
0.00% 0.00% 109.8s 695.1s _update_per_namespace_data (sonic_ax_impl/mibs/ieee802_1ab.py:555)
0.00% 0.00% 32.10s 33.01s get_all (swsscommon/swsscommon.py:1663)
0.00% 0.00% 9.04s 11.21s get_message (swsscommon/swsscommon.py:1726)
0.00% 0.00% 4.14s 5.57s match (re.py:173)
0.00% 0.00% 3.51s 3.51s _parse_octet (ipaddress.py:1159)
0.00% 0.00% 3.05s 3.05s _worker (concurrent/futures/thread.py:78)
0.00% 0.00% 2.84s 2.84s __getitem__ (swsscommon/swsscommon.py:345)
0.00% 0.00% 2.76s 37.93s get_all (swsscommon/swsscommon.py:1706)
0.00% 0.00% 1.59s 26.93s _update_per_namespace_data (sonic_ax_impl/mibs/ieee802_1ab.py:546)
0.00% 0.00% 1.47s 6.11s _ip_int_from_string (ipaddress.py:1138)
0.00% 0.00% 1.41s 6.93s ip2byte_tuple (ax_interface/util.py:109)
0.00% 100.00% 1.31s 3477s _update_per_namespace_data (sonic_ax_impl/mibs/ieee802_1ab.py:552)
0.00% 0.00% 1.22s 2.26s poll_lldp_entry_updates (sonic_ax_impl/mibs/ieee802_1ab.py:82)
0.00% 0.00% 1.14s 1.14s _swig_setattr_nondynamic (swsscommon/swsscommon.py:55)
0.00% 0.00% 1.11s 1.11s get_index_from_str (swsssdk/port_util.py:58)
0.00% 0.00% 1.06s 41.10s update_rem_if_mgmt (sonic_ax_impl/mibs/ieee802_1ab.py:504)
0.00% 0.00% 0.940s 12.15s poll_lldp_entry_updates (sonic_ax_impl/mibs/ieee802_1ab.py:77)
0.00% 0.00% 0.910s 38.85s dbs_get_all (sonic_ax_impl/mibs/__init__.py:628)
0.00% 0.00% 0.860s 0.860s _compile (re.py:273)
0.00% 0.00% 0.730s 0.730s _swig_setattr_nondynamic (swsscommon/swsscommon.py:56)
0.00% 0.00% 0.710s 0.710s __bool__ (swsscommon/swsscommon.py:331)
0.00% 0.00% 0.630s 0.630s get_index_from_str (swsssdk/port_util.py:55)
0.00% 0.00% 0.590s 7.17s __init__ (ipaddress.py:1301)
0.00% 0.00% 0.560s 0.560s _compile (re.py:276)
0.00% 0.00% 0.540s 9.13s poll_lldp_entry_updates (sonic_ax_impl/mibs/ieee802_1ab.py:90)
0.00% 0.00% 0.530s 0.530s keys (swsscommon/swsscommon.py:354)
0.00% 0.00% 0.480s 4.02s get_subtype (sonic_ax_impl/mibs/ieee802_1ab.py:595)
0.00% 0.00% 0.460s 2.59s _swig_setattr (swsscommon/swsscommon.py:71)
0.00% 0.00% 0.460s 3.08s <lambda> (swsscommon/swsscommon.py:317)
0.00% 0.00% 0.450s 0.450s _parse_octet (ipaddress.py:1168)
0.00% 0.00% 0.410s 1.12s poll_lldp_entry_updates (sonic_ax_impl/mibs/ieee802_1ab.py:78)
0.00% 0.00% 0.360s 5.94s get_index_from_str (swsssdk/port_util.py:56)
0.00% 0.00% 0.340s 4.44s ip_address (ipaddress.py:44)
0.00% 0.00% 0.330s 0.330s dbs_get_all (sonic_ax_impl/mibs/__init__.py:630)
0.00% 0.00% 0.310s 0.310s get_index_from_str (swsssdk/port_util.py:48)
0.00% 0.00% 0.300s 0.300s _ip_int_from_string (ipaddress.py:1133)
0.00% 0.00% 0.300s 0.300s _parse_octet (ipaddress.py:1172)
0.00% 0.00% 0.290s 0.290s lldp_entry_table (sonic_ax_impl/mibs/__init__.py:143)
0.00% 0.00% 0.290s 0.580s packed (ipaddress.py:1306)
0.00% 0.00% 0.270s 0.270s v4_int_to_packed (ipaddress.py:137)
0.00% 0.00% 0.270s 0.270s get_index_from_str (swsssdk/port_util.py:52)
0.00% 0.00% 0.270s 0.270s dbs_get_all (sonic_ax_impl/mibs/__init__.py:622)
Traige:
On further troubleshooting, issue was traced down to LLDPRemManAddrUpdater MIB inside the snmp-subagent. This MIB unlike most of the others uses redis key_space notifications to update it's internal cache. It subscribes to the notifications for "LLDP_ENTRY_TABLE*". These keys on the other hand are del & set for every 10 secs. Check lldp_syncd logic for more info.
Each of the set operation on these keys are result in 11 notifications, These (12xnum_interfacesxtime/10) notifications accumulate inside the buffer and once the asyncio schedules the LLDPRemManAddrUpdater MIB, the update_date logic fetches the key-value pairs from redis and updates the internal cache once for every notification seen resulting in the spike
All of these older notifications are redundant and can be skipped.
I've thus tested this prototype, which seemed to get down the CPU usage. If deemed fit, this can be used as a good starting point towards the full solution.
diff --git a/src/sonic_ax_impl/mibs/ieee802_1ab.py b/src/sonic_ax_impl/mibs/ieee802_1ab.py
index 8526352..6d31b97 100644
--- a/src/sonic_ax_impl/mibs/ieee802_1ab.py
+++ b/src/sonic_ax_impl/mibs/ieee802_1ab.py
@@ -96,6 +96,19 @@ def poll_lldp_entry_updates(pubsub):
return ret
return data, interface, if_index
+def get_latest_update(pubsub):
+ """
+ Fetches the latest notification recorded on a lldp entry
+ """
+ latest_update_map = {}
+ while True:
+ data, interface, if_index = poll_lldp_entry_updates(pubsub)
+ if not data:
+ break
+ latest_update_map[interface] = (data, if_index)
+ return latest_update_map
+
def parse_sys_capability(sys_cap):
return bytearray([int (x, 16) for x in sys_cap.split()])
@@ -542,18 +555,16 @@ class LLDPRemManAddrUpdater(MIBUpdater):
"""
Listen to updates in APP DB, update local cache
"""
- while True:
- data, interface, if_index = poll_lldp_entry_updates(pubsub)
-
- if not data:
- break
-
+ event_cache = get_latest_update(pubsub)
+ for interface in event_cache.keys():
+ data = event_cache[interface][0]
+ if_index = event_cache[interface][1]
+
if "set" in data:
self.update_rem_if_mgmt(if_index, interface)
elif "del" in data:
- # some remote data about that neighbor is gone, del it and try to query again
+ # if del is the latest notification, then just delete it from the local cache
self.if_range = [sub_oid for sub_oid in self.if_range if sub_oid[0] != if_index]
- self.update_rem_if_mgmt(if_index, interface)
def update_data(self):
for i in range(len(self.db_conn)):
Note: This event-notification logic is also seen in LocPortUpdater MIB but this issue is not seen there because the PORT table is not updated with the frequency that the LLDP_ENTRY is updated with. So, having this fixed as well might also help.
Describe the results you expected:
Output of show version
:
(paste your output here)
Output of show techsupport
:
(paste your output here or download and attach the file here )