-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[Mellanox] Always restart thermalctld on Mellanox platform when it exits #5375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mellanox] Always restart thermalctld on Mellanox platform when it exits #5375
Conversation
What is wrong with |
For example, user issue an kill command in pmon docker, it will send a terminate signal to thermalctld, thermalctld handles such signal and will clear all resources and exit with code 0. This won't trigger an auto restart. |
Certainly, this makes sense. However, this is not "normal operation". If we are this concerned, then I would prefer reverting back to the previous solution (i.e., not creating a new variable/option -- make it |
Actually, we have a regression test case which will kill thermalctld and wait for it to auto restart, and this case failed on master recently. We can also change the test case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not introducing a new variable here. We should decide on a universal solution.
How about we put a exit(-1) to the end of thermalctld? |
As long as we have a non-zero exit code at every exit path, there should be no problem. FYI, exit codes in Linux are non-negative short integers, so |
@jleveque Could you please take a look at sonic-net/sonic-platform-daemons#94? With this new PR, we can drop the current PR. |
Closing in favor of sonic-net/sonic-platform-daemons#94 |
- Why I did it
On mellanox paltform, part of thermalctld function is to handle user space thermal policies for events like fan/PSU removing, it works together with kernel thermal algorithm to make sure the switch won't be overheat.
Recently, we found that commit cbc75fe changes its autorestart configuration in supervisord, and it won't be auto restarted after being killed. This PR is to make sure that thermalctld will be always restarted on mellanox platform when it is killed.
- How I did it
- How to verify it
Manual test
- Which release branch to backport (provide reason below if selected)
Depends on where cbc75fe is going to merge to.
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)