Skip to content

Fix hostcfgd does not react to SIGINT issue. #10977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented May 30, 2022

Why I did it

There are infinite loops inside PubSub::listen() method, so application using this method can't handle SIGTERM correctly.
https://github.com/Azure/sonic-swss-common/issues/603

How I did it

Register swsscommon event handler in hostcfgd.

How to verify it

Pass all E2E test case.
Manually check hostcfgd can be killed with signal.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Fix hostcfgd does not react to SIGINT issue by register swsscommon signal handler in hostcfgd.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liuh-80 liuh-80 requested a review from lguohan as a code owner May 30, 2022 06:15
@liuh-80
Copy link
Contributor Author

liuh-80 commented May 30, 2022

@stepanblyschak, could you please review this PR?
Also in the original issue, your example is ctrl+c which means SIGINT, so I fix the SIGINT signal here, could you please also confirm we only need fix SIGINT?


# Initialize swsscommon signal helper, because python signal handler will be blocked by long running c++ code.
# For more information please check: https://docs.python.org/3/library/signal.html
SignalHandlerHelper.registerSignalHandler(SIGNAL_INT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this overrides signal.signal(signal.SIGINT, signal_handler), we should probably remove signal.SIGINT processing from hostcfgd.

Copy link
Contributor

@akokhan akokhan May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, on reboot, by default systemd sends SIGTERM but not SIGINT. So, this will not handle delay on reboot. In addition to this, we should probably extend hostcfgd.service with KillSignal=SIGINT or ExecStop= that pkill hostcfgd with SIGTERM and then with SIGINT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either

[Service]
Type=simple
ExecStart=/usr/local/bin/hostcfgd
KillSignal=SIGINT

or

[Service]
Type=simple
ExecStart=/usr/local/bin/hostcfgd
ExecStop=/bin/bash -c '/usr/bin/killall -15 hostcfgd && /usr/bin/killall -2 hostcfgd'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, currently swsscommon only handle SIGINT, so will add SIGTERM support with this PR: sonic-net/sonic-swss-common#626

I will update this PR after swsscommon updated.

@stepanblyschak
Copy link
Collaborator

stepanblyschak commented May 31, 2022

@liuh-80 I agree with @akokhan, we need to handle SIGTERM and remove signal handlers defined in python since they are overwritten by swss-common.

Although, I have some concerns on usability of such approach. IMO we have lots of python daemons that use ConfigDBConnector.listen() and usually these daemons use custom signal handlers in order to perform cleanup. We then need to review this python code and rewrite their cleanup procedures.

I prepared the following change - sonic-net/sonic-swss-common#624 which is based on your original PR but does not use signal helpers. The idea is that we can pass interrupt_on_signal=True to select() which will then return immidiatelly once errno == EINTR. The ConfigDBConnector.listen() method will pass interrupt_on_signal=True to PubSub.listen_messages() which will pass it to select(). This means when signal occurs we will exit the C++ .so code and enter python world where python can run his signal handlers and then (if signal handler didn't exit) repeat PubSub.listen_messages() in the loop, so that we don't exit the ConfigDBConnector.listen() on signals that we don't want to exit on.

With such solution we don't need to modify hostcfgd and potentially many other daemons

@liuh-80
Copy link
Contributor Author

liuh-80 commented Jun 1, 2022

@liuh-80 I agree with @akokhan, we need to handle SIGTERM and remove signal handlers defined in python since they are overwritten by swss-common.

Although, I have some concerns on usability of such approach. IMO we have lots of python daemons that use ConfigDBConnector.listen() and usually these daemons use custom signal handlers in order to perform cleanup. We then need to review this python code and rewrite their cleanup procedures.

I prepared the following change - Azure/sonic-swss-common#624 which is based on your original PR but does not use signal helpers. The idea is that we can pass interrupt_on_signal=True to select() which will then return immidiatelly once errno == EINTR. The ConfigDBConnector.listen() method will pass interrupt_on_signal=True to PubSub.listen_messages() which will pass it to select(). This means when signal occurs we will exit the C++ .so code and enter python world where python can run his signal handlers and then (if signal handler didn't exit) repeat PubSub.listen_messages() in the loop, so that we don't exit the ConfigDBConnector.listen() on signals that we don't want to exit on.

With such solution we don't need to modify hostcfgd and potentially many other daemons

@stepanblyschak , according to your PR any signal will break ConfigDBConnector.listen() method. that maybe not a expected behavior.
for example, in hostcfgd, the SIGHUP been ignored by desigm, but with your PR, the SIGHUP will break the listen() method and hostcfgd will exit when receive SIGHUP.

@stepanblyschak
Copy link
Collaborator

@liuh-80 The idea is that PubSub.listen_message() will exit on signal, however ConfigDBConnect.liste() will call PubSub.listen_message() in a while loop, so listen() does not exit. In my testing I was sending SIGHUP to hostcfgd, I see that hostcfgd's signal handler prints a message about received signal but the daemon still runs. When I send SIGTERM I see hostcfgd's message in the syslog about received signal and hostcfgd exits.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Jun 5, 2022

SIGHUP

I'm working on a change in sonic-swss-common to support register python signal handler to swsscommon, and will update this PR after that:

import signal
import os
import time
import threading

from swsscommon.signal import SignalHandlerHelper, RegisterSignalHandler, SIGNAL_INT

def dummy_signal_handler(signum):
... # ignore signal so UT will not break
... print("SIGNAL {} happen".format(signum))
...

RegisterSignalHandler(signal.SIGINT, dummy_signal_handler) <== register signal handler to swsscommon

from swsscommon.swsscommon import SonicV2Connector, ConfigDBConnector

conn = ConfigDBConnector()
conn.connect()
conn.subscribe('A', lambda a: None)
conn.listen()
^CSIGNAL 2 happen

@liuh-80
Copy link
Contributor Author

liuh-80 commented Jun 16, 2022

Close this PR because Stepan has a better solution without change any code here: https://github.com/Azure/sonic-swss-common/pull/624/files

@liuh-80 liuh-80 closed this Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants