Skip to content

[Mellanox] Always restart thermalctld on Mellanox platform when it exits #5375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Junchao-Mellanox
Copy link
Collaborator

@Junchao-Mellanox Junchao-Mellanox commented Sep 15, 2020

- Why I did it

On mellanox paltform, part of thermalctld function is to handle user space thermal policies for events like fan/PSU removing, it works together with kernel thermal algorithm to make sure the switch won't be overheat.

Recently, we found that commit cbc75fe changes its autorestart configuration in supervisord, and it won't be auto restarted after being killed. This PR is to make sure that thermalctld will be always restarted on mellanox platform when it is killed.

- How I did it

  1. Add a variable "always_restart_thermalctld" in pmon_daemon_control.json
  2. In docker-pmon.supervisord.conf.j2, it checks variable "always_restart_thermalctld" and set autorestart configuration for thermalctld accordingly.

- How to verify it

Manual test

- Which release branch to backport (provide reason below if selected)

Depends on where cbc75fe is going to merge to.

  • 201811
  • 201911
  • 202006

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@jleveque
Copy link
Contributor

What is wrong with autorestart=unexpected? Under what circumstances the daemon will exit "expectedly" according to supervisor and thus not get restarted?

@Junchao-Mellanox
Copy link
Collaborator Author

Junchao-Mellanox commented Sep 15, 2020

For example, user issue an kill command in pmon docker, it will send a terminate signal to thermalctld, thermalctld handles such signal and will clear all resources and exit with code 0. This won't trigger an auto restart.

@jleveque
Copy link
Contributor

jleveque commented Sep 15, 2020

For example, user issue an kill command in pmon docker, it will send a terminate signal to thermalctld, thermalctld handls such signal and will clear all resources and exit with code 0. This won't trigger an auto restart.

Certainly, this makes sense. However, this is not "normal operation". If we are this concerned, then I would prefer reverting back to the previous solution (i.e., not creating a new variable/option -- make it autorestart=true always. However, I must ask, if we are concerned about user error, what would prevent a privileged user from stopping the PMon container altogether?

@Junchao-Mellanox
Copy link
Collaborator Author

Junchao-Mellanox commented Sep 15, 2020

Actually, we have a regression test case which will kill thermalctld and wait for it to auto restart, and this case failed on master recently. We can also change the test case.

Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not introducing a new variable here. We should decide on a universal solution.

@Junchao-Mellanox
Copy link
Collaborator Author

How about we put a exit(-1) to the end of thermalctld?

@jleveque
Copy link
Contributor

How about we put a exit(-1) to the end of thermalctld?

As long as we have a non-zero exit code at every exit path, there should be no problem. FYI, exit codes in Linux are non-negative short integers, so -1 will get converted to 255. Prefer using a positive value so that it's easy to track down.

@Junchao-Mellanox
Copy link
Collaborator Author

@jleveque Could you please take a look at sonic-net/sonic-platform-daemons#94? With this new PR, we can drop the current PR.

@jleveque
Copy link
Contributor

Closing in favor of sonic-net/sonic-platform-daemons#94

@jleveque jleveque closed this Sep 18, 2020
@Junchao-Mellanox Junchao-Mellanox deleted the restart_thermalctld_for_mlnx branch December 15, 2020 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants