Skip to content

[generic-config-updater] Handle failed service restarts #2020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jan 20, 2022

Conversation

renukamanavalan
Copy link
Contributor

@renukamanavalan renukamanavalan commented Jan 18, 2022

What I did

During config update, update of certain tables do demand service restart.
With multiple related updates are not grouped together, this might result in too many service restarts, which could fail with "hitting start limit". When that happens, call reset-failed, try to restart. If it fails again, take a pause and try to restart again.

How I did it

When service restart fails, call reset-failed, try, pause and then call service restart again.

How to verify it

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@lgtm-com
Copy link

lgtm-com bot commented Jan 18, 2022

This pull request introduces 1 alert when merging 3bd91b5 into 5cc9dd5 - view on LGTM.com

new alerts:

  • 1 for Syntax error

@lgtm-com
Copy link

lgtm-com bot commented Jan 19, 2022

This pull request introduces 1 alert when merging f520752 into d9f3afe - view on LGTM.com

new alerts:

  • 1 for Unused local variable

@renukamanavalan renukamanavalan changed the title generic_updater: Handle failed service restarts [generic-config-updater] Handle failed service restarts Jan 19, 2022
@@ -17,13 +17,42 @@ def set_verbose(verbose=False):

def _service_restart(svc_name):
rc = os.system(f"systemctl restart {svc_name}")
logger.log(logger.LOG_PRIORITY_NOTICE,
f"Restarted {svc_name}", print_to_console)
if rc != 0:
Copy link
Contributor

@qiluo-msft qiluo-msft Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rc

Is the rc a stable code when too many restarts happen? If yes, just check if rc == <code>: #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rc == or != 0 is stable comparison for success/failure

Copy link
Contributor

@qiluo-msft qiluo-msft Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean is there a specific exit code for "too many restarts" failure? if yes, we can check that code specifically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no specific error code. 1 is a generic error code, which is most commonly we see. But this could change in future. Moreover if there is any inherent service related error, the monitor catches it and leads to ICM.

In our case, for any failure, we try our best, before giving up.

ref: link

"In case of an error while processing any init-script action except for status, the init script shall print an error message and exit with a non-zero status code:"

"1 generic or unspecified error (current practice)"

print_to_console)

rc = os.system(f"systemctl restart {svc_name}")
if rc != 0:
Copy link
Contributor

@qiluo-msft qiluo-msft Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rc

Is there a specific exit code for 2nd failure? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get your comments. But rc is the absolute way to check success / failure.

return rc == 0


def rsyslog_validator(old_config, upd_config, keys):
return _service_restart("rsyslog-config")
Copy link
Contributor

@qiluo-msft qiluo-msft Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_service_restart

With _service_restart fixed, do we still need this fix? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we need is to run /usr/bin/rsyslog-config.sh.
What it does is, update /etc/rsyslog.conf from CONFG-DB & restart rsyslog.
We do the same and additionally handle the possibility of rsyslog restart failure.

rsyslog-config is not a service, but a one shot wrapper to run /usr/bin/rsyslog-config.sh, after updategraph.service, at the startup.

@renukamanavalan renukamanavalan merged commit 4f2773c into sonic-net:master Jan 20, 2022
renukamanavalan added a commit that referenced this pull request Jan 20, 2022
What I did
Missed update from review comments in PR #2020
s/os.system("sleep 10s")/time.sleep(10)/
judyjoseph pushed a commit that referenced this pull request Jan 31, 2022
What I did
During config update, update of certain tables do demand service restart.
With multiple related updates are not grouped together, this might result in too many service restarts, which could fail with "hitting start limit". When that happens, call reset-failed, try to restart. If it fails again, take a pause and try to restart again.

How I did it
When service restart fails, call reset-failed, try, pause and then call service restart again.
judyjoseph pushed a commit that referenced this pull request Jan 31, 2022
What I did
Missed update from review comments in PR #2020
s/os.system("sleep 10s")/time.sleep(10)/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants