-
Notifications
You must be signed in to change notification settings - Fork 711
[generic-config-updater] Handle failed service restarts #2020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2) Read conf file from install dir 3) Drop empty keys & tables upon jsonpatch.JsonPatch.apply to be in sync with redis update 4) Prefix service_validator module path with "generic_updater"
2) Added vlan validator 3) Added test code for vlan validator
This pull request introduces 1 alert when merging 3bd91b5 into 5cc9dd5 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging f520752 into d9f3afe - view on LGTM.com new alerts:
|
@@ -17,13 +17,42 @@ def set_verbose(verbose=False): | |||
|
|||
def _service_restart(svc_name): | |||
rc = os.system(f"systemctl restart {svc_name}") | |||
logger.log(logger.LOG_PRIORITY_NOTICE, | |||
f"Restarted {svc_name}", print_to_console) | |||
if rc != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rc == or != 0 is stable comparison for success/failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I mean is there a specific exit code for "too many restarts" failure? if yes, we can check that code specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no specific error code. 1 is a generic error code, which is most commonly we see. But this could change in future. Moreover if there is any inherent service related error, the monitor catches it and leads to ICM.
In our case, for any failure, we try our best, before giving up.
ref: link
"In case of an error while processing any init-script action except for status, the init script shall print an error message and exit with a non-zero status code:"
"1 generic or unspecified error (current practice)"
print_to_console) | ||
|
||
rc = os.system(f"systemctl restart {svc_name}") | ||
if rc != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get your comments. But rc is the absolute way to check success / failure.
return rc == 0 | ||
|
||
|
||
def rsyslog_validator(old_config, upd_config, keys): | ||
return _service_restart("rsyslog-config") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we need is to run /usr/bin/rsyslog-config.sh.
What it does is, update /etc/rsyslog.conf from CONFG-DB & restart rsyslog.
We do the same and additionally handle the possibility of rsyslog restart failure.
rsyslog-config is not a service, but a one shot wrapper to run /usr/bin/rsyslog-config.sh, after updategraph.service, at the startup.
What I did Missed update from review comments in PR #2020 s/os.system("sleep 10s")/time.sleep(10)/
What I did During config update, update of certain tables do demand service restart. With multiple related updates are not grouped together, this might result in too many service restarts, which could fail with "hitting start limit". When that happens, call reset-failed, try to restart. If it fails again, take a pause and try to restart again. How I did it When service restart fails, call reset-failed, try, pause and then call service restart again.
What I did Missed update from review comments in PR #2020 s/os.system("sleep 10s")/time.sleep(10)/
What I did
During config update, update of certain tables do demand service restart.
With multiple related updates are not grouped together, this might result in too many service restarts, which could fail with "hitting start limit". When that happens, call reset-failed, try to restart. If it fails again, take a pause and try to restart again.
How I did it
When service restart fails, call reset-failed, try, pause and then call service restart again.
How to verify it
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)