-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[201811] Check platform reboot cause to see if any reset happened during fast/warm-reboot #8912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
21f126e
a4787c1
9e8893c
b673c1a
13d20ec
779f9b4
f529278
b54cf33
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,6 +36,23 @@ function preStartAction() | |
echo -n > /tmp/dump.rdb | ||
docker cp /tmp/dump.rdb database:/var/lib/redis/ | ||
fi | ||
{%- elif docker_container_name == "swss" %} | ||
if [[ "$BOOT_TYPE" == "fast" ]] && [[ -d /host/fast-reboot ]]; then | ||
if [[ -f /host/reboot-cause/previous-reboot-cause.json ]]; then | ||
REG_BOOT_TYPE="fast*" | ||
CAUSE_NO_AVAIL="\"N/A\"" | ||
REBOOT_CAUSE="$(cat /host/reboot-cause/previous-reboot-cause.json | jq '.cause')" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On testing, observed process-reboot-cause is running later than swss There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. determine-reboot-cause is also running later than swss. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @santhosh-kt If that is the case, should we go with the script approach to check the reboot-cause? I think it's too hard to fix the process starting order with platform api dependency. To understand the reboot-cause, we can keep the determine-reboot-cause changes with this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sujinmkang : Already tested this with a dedicated script(A part of the commit - https://github.com/Azure/sonic-buildimage/pull/8024/files - |
||
EXTRA_CAUSE="$(cat /host/reboot-cause/previous-reboot-cause.json | jq '.comment')" | ||
|
||
# Clear the FAST_REBOOT|system db setting if EXTRA_REBOOT_CAUSE is not "N/A" before starting swss | ||
if [[ $REBOOT_CAUSE =~ $REG_BOOT_TYPE ]]; then | ||
if [[ "${EXTRA_CAUSE}" != "${CAUSE_NO_AVAIL}" ]]; then | ||
# Delete the FAST_REBOOT|system db setting | ||
$SONIC_DB_CLI STATE_DB DEL "FAST_REBOOT|system" &>/dev/null | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This delete could be too late: syncd might have read it and proceeded with fast reboot recovery. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yxieca With the platform api approach to determine the hardware reboot-cause, it's hard to get the actual hardware reboot-cause before syncd or swss starts. I think it's better to use a platform script to determine the hardware reboot. What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point of this change is to determine reboot cause is not fast/warm reboot before syncd/swss starts. So that we don't try to start system with fast/warm recovery. |
||
fi | ||
fi | ||
fi | ||
fi | ||
{%- else %} | ||
: # nothing | ||
{%- endif %} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -138,6 +138,35 @@ def find_hardware_reboot_cause(): | |
return hardware_reboot_cause | ||
|
||
|
||
def get_reboot_cause_dict(previous_reboot_cause, comment, gen_time): | ||
"""Store the key infomation of device reboot into a dictionary by parsing the string in | ||
previous_reboot_cause. | ||
If user issused a command to reboot device, then user, command and time will be | ||
stored into a dictionary. | ||
If device was rebooted due to the kernel panic, then the string `Kernel Panic` | ||
and time will be stored into a dictionary. | ||
""" | ||
reboot_cause_dict = {} | ||
reboot_cause_dict['gen_time'] = gen_time | ||
reboot_cause_dict['cause'] = previous_reboot_cause | ||
reboot_cause_dict['user'] = "N/A" | ||
reboot_cause_dict['time'] = "N/A" | ||
reboot_cause_dict['comment'] = comment if comment is not None else "N/A" | ||
if re.search(r'User issued', previous_reboot_cause): | ||
# Match with "User issued '{}' command [User: {}, Time: {}]" | ||
match = re.search(r'User issued \'(.*)\' command \[User: (.*), Time: (.*)\]', previous_reboot_cause) | ||
if match is not None: | ||
reboot_cause_dict['cause'] = match.group(1) | ||
reboot_cause_dict['user'] = match.group(2) | ||
reboot_cause_dict['time'] = match.group(3) | ||
elif re.search(r'Kernel Panic', previous_reboot_cause): | ||
match = re.search(r'Kernel Panic \[Time: (.*)\]', previous_reboot_cause) | ||
if match is not None: | ||
reboot_cause_dict['cause'] = "Kernel Panic" | ||
reboot_cause_dict['time'] = match.group(1) | ||
|
||
return reboot_cause_dict | ||
|
||
def main(): | ||
log_info("Starting up...") | ||
|
||
|
@@ -167,6 +196,8 @@ def main(): | |
# reboot info. We will use it as the previous cause. | ||
software_reboot_cause = find_software_reboot_cause() | ||
|
||
additional_reboot_info = None | ||
|
||
# The main decision logic of the reboot cause: | ||
# If there is a reboot cause indicated by /proc/cmdline, it should be warmreboot/fastreboot | ||
# the software_reboot_cause which is the content of /hosts/reboot-cause/reboot-cause.txt | ||
|
@@ -176,11 +207,34 @@ def main(): | |
# Else the software_reboot_cause will be treated as the reboot cause | ||
if proc_cmdline_reboot_cause is not None: | ||
previous_reboot_cause = software_reboot_cause | ||
if not hardware_reboot_cause.startswith(REBOOT_CAUSE_NON_HARDWARE): | ||
# Add the hardware_reboot_cause into additional_reboot_info | ||
additional_reboot_info = hardware_reboot_cause | ||
elif hardware_reboot_cause is not None: | ||
previous_reboot_cause = hardware_reboot_cause | ||
# Check if any software reboot was issued before this hardware reboot happened | ||
if software_reboot_cause is not REBOOT_CAUSE_UNKNOWN: | ||
additional_reboot_info = software_reboot_cause | ||
else: | ||
previous_reboot_cause = software_reboot_cause | ||
|
||
# Current time | ||
reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Import Error!
|
||
|
||
# Save the previous cause info into its history file as json format | ||
reboot_cause_dict = get_reboot_cause_dict(previous_reboot_cause, additional_reboot_info, reboot_cause_gen_time) | ||
|
||
# Create reboot-cause-#time#.json under history directory | ||
REBOOT_CAUSE_HISTORY_FILE_JSON = os.path.join(REBOOT_CAUSE_HISTORY_DIR, "reboot-cause-{}.json".format(reboot_cause_gen_time)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. REBOOT_CAUSE_HISTORY_DIR is not defined. |
||
|
||
# Create REBOOT_CAUSE_HISTORY_DIR if it doesn't exist | ||
if not os.path.exists(REBOOT_CAUSE_HISTORY_DIR): | ||
os.makedirs(REBOOT_CAUSE_HISTORY_DIR) | ||
|
||
# Write the previous reboot cause to REBOOT_CAUSE_HISTORY_FILE_JSON as a JSON format | ||
with open(REBOOT_CAUSE_HISTORY_FILE_JSON, "w") as reboot_cause_history_file: | ||
json.dump(reboot_cause_dict, reboot_cause_history_file) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. import error for json |
||
|
||
# Write the previous reboot cause to PREVIOUS_REBOOT_CAUSE_FILE | ||
with open(PREVIOUS_REBOOT_CAUSE_FILE, "w") as prev_cause_file: | ||
prev_cause_file.write(previous_reboot_cause) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use white space consistently.