Skip to content

Fix swsscommon psubscribe code break in frrcfgd #13836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 1, 2023

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Feb 16, 2023

Fix swsscommon psubscribe code break in frrcfgd

Why I did it

Fix frrcfgd psubscribe code break: #13109
The code issue caused by API change when migrate from swsssdk to swsscommon

How I did it

Fix frrcfgd code to use swsscommon psubscribe API.

How to verify it

Pass all UT.
Manually check fixed code work correctly.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Fix frrcfgd psubscribe code break: #13109

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liuh-80 liuh-80 marked this pull request as ready for review February 21, 2023 07:13
@liuh-80 liuh-80 requested a review from lguohan as a code owner February 21, 2023 07:13
@liuh-80 liuh-80 requested a review from qiluo-msft February 21, 2023 07:13
@liuh-80 liuh-80 changed the title [POC] Fix swsscommon psubscribe code break in frrcfgd Fix swsscommon psubscribe code break in frrcfgd Feb 21, 2023
@@ -1445,6 +1445,7 @@ class ExtConfigDBConnector(ConfigDBConnector):
def __init__(self, ns_attrs = None):
super(ExtConfigDBConnector, self).__init__()
self.nosort_attrs = ns_attrs if ns_attrs is not None else {}
self.sub_thread_running = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub_thread_running

The member is expected to be private, follow python convention, use prefix _ or __ to indicate the intention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if msg:
self.sub_msg_handler(msg)

self.pubsub.punsubscribe(sub_key_space)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@liuh-80 liuh-80 Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After investigation the code of ConfigDBConnector, we don't need punsubscribe there because inside the listen() there is a infinity loop. as my understand the design is listen() will stop when process exit, so not necessary to unsub.

@qiluo-msft
Copy link
Collaborator

@zhaozhenhong Could you help review?

@jeff-yin
Copy link
Collaborator

Please also tag @mhampasagar and @rathnasabapathyv for review, thanks.

@qiluo-msft qiluo-msft merged commit fabb30f into sonic-net:master Mar 1, 2023
xumia pushed a commit to xumia/sonic-buildimage-1 that referenced this pull request Mar 10, 2023
Fix swsscommon psubscribe code break in frrcfgd

#### Why I did it
Fix frrcfgd psubscribe code break: sonic-net#13109
The code issue caused by API change when migrate from swsssdk to swsscommon

#### How I did it
Fix frrcfgd code to use swsscommon psubscribe API.

#### How to verify it
Pass all UT.
Manually check fixed code work correctly.
@puffc
Copy link
Contributor

puffc commented Sep 7, 2023

@qiluo-msft Would you please merge this PR to 202211? Thanks!

bluecmd added a commit to kamelnetworks/sonic-buildimage that referenced this pull request Feb 22, 2024
@bluecmd
Copy link
Contributor

bluecmd commented Feb 28, 2024

This change caused my frrcfgd to spike to 100% CPU usage on my devices. I changed the default timeout to 2 seconds instead of 1 ms and that seemed to solve things.

diff --git a/src/sonic-frr-mgmt-framework/frrcfgd/frrcfgd.py b/src/sonic-frr-mgmt-framework/frrcfgd/frrcfgd.py
index bc53e38cf..6c18008f1 100755
--- a/src/sonic-frr-mgmt-framework/frrcfgd/frrcfgd.py
+++ b/src/sonic-frr-mgmt-framework/frrcfgd/frrcfgd.py
@@ -1471,12 +1471,12 @@ class ExtConfigDBConnector(ConfigDBConnector):
                 syslog.syslog(syslog.LOG_ERR, '[bgp cfgd] Failed handling config DB update with exception:' + str(e))
                 logging.exception(e)
 
-    def listen_thread(self, timeout):
+    def listen_thread(self):
         self.__listen_thread_running = True
         sub_key_space = "__keyspace@{}__:*".format(self.get_dbid(self.db_name))
         self.pubsub.psubscribe(sub_key_space)
         while self.__listen_thread_running:
-            msg = self.pubsub.get_message(timeout, True)
+            msg = self.pubsub.get_message(timeout=2.0, interrupt_on_signal=True)
             if msg:
                 self.sub_msg_handler(msg)
 
@@ -1486,7 +1486,7 @@ class ExtConfigDBConnector(ConfigDBConnector):
         """Start listen Redis keyspace events and will trigger corresponding handlers when content of a table changes.
         """
         self.pubsub = self.get_redis_client(self.db_name).pubsub()
-        self.sub_thread = threading.Thread(target=self.listen_thread, args=(0.01,))
+        self.sub_thread = threading.Thread(target=self.listen_thread)
         self.sub_thread.start()
 
     def stop_listen(self):

@liuh-80
Copy link
Contributor Author

liuh-80 commented Feb 29, 2024

This change caused my frrcfgd to spike to 100% CPU usage on my devices. I changed the default timeout to 2 seconds instead of 1 ms and that seemed to solve things.

diff --git a/src/sonic-frr-mgmt-framework/frrcfgd/frrcfgd.py b/src/sonic-frr-mgmt-framework/frrcfgd/frrcfgd.py
index bc53e38cf..6c18008f1 100755
--- a/src/sonic-frr-mgmt-framework/frrcfgd/frrcfgd.py
+++ b/src/sonic-frr-mgmt-framework/frrcfgd/frrcfgd.py
@@ -1471,12 +1471,12 @@ class ExtConfigDBConnector(ConfigDBConnector):
                 syslog.syslog(syslog.LOG_ERR, '[bgp cfgd] Failed handling config DB update with exception:' + str(e))
                 logging.exception(e)
 
-    def listen_thread(self, timeout):
+    def listen_thread(self):
         self.__listen_thread_running = True
         sub_key_space = "__keyspace@{}__:*".format(self.get_dbid(self.db_name))
         self.pubsub.psubscribe(sub_key_space)
         while self.__listen_thread_running:
-            msg = self.pubsub.get_message(timeout, True)
+            msg = self.pubsub.get_message(timeout=2.0, interrupt_on_signal=True)
             if msg:
                 self.sub_msg_handler(msg)
 
@@ -1486,7 +1486,7 @@ class ExtConfigDBConnector(ConfigDBConnector):
         """Start listen Redis keyspace events and will trigger corresponding handlers when content of a table changes.
         """
         self.pubsub = self.get_redis_client(self.db_name).pubsub()
-        self.sub_thread = threading.Thread(target=self.listen_thread, args=(0.01,))
+        self.sub_thread = threading.Thread(target=self.listen_thread)
         self.sub_thread.start()
 
     def stop_listen(self):

Hi @bluecmd, can you share how to reproduce this issue and what's the hardware SKU and OS version?

@bluecmd
Copy link
Contributor

bluecmd commented Feb 29, 2024

Hello @liuh-80

Sure - I applied this PR on our 202211 branch and installed the resulting image on both a Celestica Questone 2 and a Celestica Silverstone-X. When running I looked at top and saw that frrcfgd was almost always at 100% CPU.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Feb 29, 2024

Hello @liuh-80

Sure - I applied this PR on our 202211 branch and installed the resulting image on both a Celestica Questone 2 and a Celestica Silverstone-X. When running I looked at top and saw that frrcfgd was almost always at 100% CPU.

Hi @bluecmd , I check on following celestica hardware and not found this issue, I will create some test code and give update later.

SONiC Software Version: SONiC.20230531.19
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 2504c06bcc
Build date: Sun Feb 25 03:18:23 UTC 2024
Built by: cloudtest@361280d0c000000

Platform: x86_64-cel_e1031-r0
HwSKU: Celestica-E1031-T48S4
ASIC: broadcom
ASIC Count: 1
Serial Number: R0882F2B128709BY000006
Model Number: R0882-F4111-01
Hardware Revision: N/A
Uptime: 07:21:38 up 11:34, 1 user, load average: 5.45, 7.10, 5.57
Date: Thu 29 Feb 2024 07:21:38

~$ top
top - 07:21:28 up 11:34, 1 user, load average: 5.53, 7.18, 5.57
Tasks: 275 total, 8 running, 266 sleeping, 0 stopped, 1 zombie
%Cpu(s): 78.3 us, 21.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st
MiB Mem : 1948.3 total, 91.3 free, 1327.9 used, 529.0 buff/cache
MiB Swap: 2048.0 total, 1253.5 free, 794.5 used. 430.9 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

917654 root 20 0 65316 41512 17584 R 33.1 2.1 0:01.03 show
910644 root 20 0 1869700 162264 23540 S 19.9 8.1 1:10.52 syncd
917684 root 20 0 29200 22260 9176 R 17.4 1.1 0:00.54 sonic-cfggen
917616 root 20 0 71844 44428 20000 S 13.5 2.2 0:01.14 python3

@bluecmd
Copy link
Contributor

bluecmd commented Feb 29, 2024

Interesting. Let me know if you need me to test anything.

That said, if I may suggest, I believe that the 0.01s timeout is too low even on platforms where it does not show up as high CPU - since the subscribe is running in its own thread it only needs to wake up occasionally to check if the thread should terminate, and that does not need to happen often at all. I have not checked, but my assumption is that creating and destroying PubSub subscriptions incur a cost, so we should probably not do it needlessly. That's my analysis anyway. EDIT: The subscribe is outside the loop, ignore the last part

@liuh-80
Copy link
Contributor Author

liuh-80 commented Feb 29, 2024

Interesting. Let me know if you need me to test anything.

That said, if I may suggest, I believe that the 0.01s timeout is too low even on platforms where it does not show up as high CPU - since the subscribe is running in its own thread it only needs to wake up occasionally to check if the thread should terminate, and that does not need to happen often at all. I have not checked, but my assumption is that creating and destroying PubSub subscriptions incur a cost, so we should probably not do it needlessly. That's my analysis anyway. EDIT: The subscribe is outside the loop, ignore the last part

I can reproduce this issue by a test script, fix PR is here: #18225

bluecmd added a commit to kamelnetworks/sonic-buildimage that referenced this pull request Mar 3, 2024
bluecmd added a commit to kamelnetworks/sonic-buildimage that referenced this pull request Mar 6, 2024
bradh352 pushed a commit to bradh352/sonic-buildimage that referenced this pull request Dec 3, 2024
Fix swsscommon psubscribe code break in frrcfgd

#### Why I did it
Fix frrcfgd psubscribe code break: sonic-net#13109
The code issue caused by API change when migrate from swsssdk to swsscommon

#### How I did it
Fix frrcfgd code to use swsscommon psubscribe API.

#### How to verify it
Pass all UT.
Manually check fixed code work correctly.
bradh352 pushed a commit to bradh352/sonic-buildimage that referenced this pull request Dec 3, 2024
Fix swsscommon psubscribe code break in frrcfgd

#### Why I did it
Fix frrcfgd psubscribe code break: sonic-net#13109
The code issue caused by API change when migrate from swsssdk to swsscommon

#### How I did it
Fix frrcfgd code to use swsscommon psubscribe API.

#### How to verify it
Pass all UT.
Manually check fixed code work correctly.
bradh352 pushed a commit to bradh352/sonic-buildimage that referenced this pull request Dec 3, 2024
Fix swsscommon psubscribe code break in frrcfgd

#### Why I did it
Fix frrcfgd psubscribe code break: sonic-net#13109
The code issue caused by API change when migrate from swsssdk to swsscommon

#### How I did it
Fix frrcfgd code to use swsscommon psubscribe API.

#### How to verify it
Pass all UT.
Manually check fixed code work correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants