Skip to content

[201811] Check platform reboot cause to see if any reset happened during fast/warm-reboot #8912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 1, 2021
17 changes: 17 additions & 0 deletions files/build_templates/docker_image_ctl.j2
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,23 @@ function preStartAction()
echo -n > /tmp/dump.rdb
docker cp /tmp/dump.rdb database:/var/lib/redis/
fi
{%- elif docker_container_name == "swss" %}
if [[ "$BOOT_TYPE" == "fast" ]] && [[ -d /host/fast-reboot ]]; then
if [[ -f /host/reboot-cause/previous-reboot-cause.json ]]; then
REG_BOOT_TYPE="fast*"
CAUSE_NO_AVAIL="\"N/A\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use white space consistently.

REBOOT_CAUSE="$(cat /host/reboot-cause/previous-reboot-cause.json | jq '.cause')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On testing, observed process-reboot-cause is running later than swss

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

determine-reboot-cause is also running later than swss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santhosh-kt If that is the case, should we go with the script approach to check the reboot-cause? I think it's too hard to fix the process starting order with platform api dependency. To understand the reboot-cause, we can keep the determine-reboot-cause changes with this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujinmkang : Already tested this with a dedicated script(A part of the commit - https://github.com/Azure/sonic-buildimage/pull/8024/files - _is_software_reboot() in track_reboot_reason.sh) that is being called inside the preStartAction() script and it is able to identify the CPU reset cases.

EXTRA_CAUSE="$(cat /host/reboot-cause/previous-reboot-cause.json | jq '.comment')"

# Clear the FAST_REBOOT|system db setting if EXTRA_REBOOT_CAUSE is not "N/A" before starting swss
if [[ $REBOOT_CAUSE =~ $REG_BOOT_TYPE ]]; then
if [[ "${EXTRA_CAUSE}" != "${CAUSE_NO_AVAIL}" ]]; then
# Delete the FAST_REBOOT|system db setting
$SONIC_DB_CLI STATE_DB DEL "FAST_REBOOT|system" &>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This delete could be too late: syncd might have read it and proceeded with fast reboot recovery.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxieca With the platform api approach to determine the hardware reboot-cause, it's hard to get the actual hardware reboot-cause before syncd or swss starts. I think it's better to use a platform script to determine the hardware reboot. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this change is to determine reboot cause is not fast/warm reboot before syncd/swss starts. So that we don't try to start system with fast/warm recovery.

fi
fi
fi
fi
{%- else %}
: # nothing
{%- endif %}
Expand Down
54 changes: 54 additions & 0 deletions files/image_config/process-reboot-cause/process-reboot-cause
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,35 @@ def find_hardware_reboot_cause():
return hardware_reboot_cause


def get_reboot_cause_dict(previous_reboot_cause, comment, gen_time):
"""Store the key infomation of device reboot into a dictionary by parsing the string in
previous_reboot_cause.
If user issused a command to reboot device, then user, command and time will be
stored into a dictionary.
If device was rebooted due to the kernel panic, then the string `Kernel Panic`
and time will be stored into a dictionary.
"""
reboot_cause_dict = {}
reboot_cause_dict['gen_time'] = gen_time
reboot_cause_dict['cause'] = previous_reboot_cause
reboot_cause_dict['user'] = "N/A"
reboot_cause_dict['time'] = "N/A"
reboot_cause_dict['comment'] = comment if comment is not None else "N/A"
if re.search(r'User issued', previous_reboot_cause):
# Match with "User issued '{}' command [User: {}, Time: {}]"
match = re.search(r'User issued \'(.*)\' command \[User: (.*), Time: (.*)\]', previous_reboot_cause)
if match is not None:
reboot_cause_dict['cause'] = match.group(1)
reboot_cause_dict['user'] = match.group(2)
reboot_cause_dict['time'] = match.group(3)
elif re.search(r'Kernel Panic', previous_reboot_cause):
match = re.search(r'Kernel Panic \[Time: (.*)\]', previous_reboot_cause)
if match is not None:
reboot_cause_dict['cause'] = "Kernel Panic"
reboot_cause_dict['time'] = match.group(1)

return reboot_cause_dict

def main():
log_info("Starting up...")

Expand Down Expand Up @@ -167,6 +196,8 @@ def main():
# reboot info. We will use it as the previous cause.
software_reboot_cause = find_software_reboot_cause()

additional_reboot_info = None

# The main decision logic of the reboot cause:
# If there is a reboot cause indicated by /proc/cmdline, it should be warmreboot/fastreboot
# the software_reboot_cause which is the content of /hosts/reboot-cause/reboot-cause.txt
Expand All @@ -176,11 +207,34 @@ def main():
# Else the software_reboot_cause will be treated as the reboot cause
if proc_cmdline_reboot_cause is not None:
previous_reboot_cause = software_reboot_cause
if not hardware_reboot_cause.startswith(REBOOT_CAUSE_NON_HARDWARE):
# Add the hardware_reboot_cause into additional_reboot_info
additional_reboot_info = hardware_reboot_cause
elif hardware_reboot_cause is not None:
previous_reboot_cause = hardware_reboot_cause
# Check if any software reboot was issued before this hardware reboot happened
if software_reboot_cause is not REBOOT_CAUSE_UNKNOWN:
additional_reboot_info = software_reboot_cause
else:
previous_reboot_cause = software_reboot_cause

# Current time
reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import Error!

Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: Starting up...
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: No reboot cause found from /proc/cmdline
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: No reboot cause found from platform api
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: Reboot cause file /host/reboot-cause/reboot-cause.txt not found
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: Traceback (most recent call last):
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: File "/usr/bin/process-reboot-cause", line 255, in
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: main()
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: File "/usr/bin/process-reboot-cause", line 221, in main
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: NameError: global name 'datetime' is not defined


# Save the previous cause info into its history file as json format
reboot_cause_dict = get_reboot_cause_dict(previous_reboot_cause, additional_reboot_info, reboot_cause_gen_time)

# Create reboot-cause-#time#.json under history directory
REBOOT_CAUSE_HISTORY_FILE_JSON = os.path.join(REBOOT_CAUSE_HISTORY_DIR, "reboot-cause-{}.json".format(reboot_cause_gen_time))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REBOOT_CAUSE_HISTORY_DIR is not defined.


# Create REBOOT_CAUSE_HISTORY_DIR if it doesn't exist
if not os.path.exists(REBOOT_CAUSE_HISTORY_DIR):
os.makedirs(REBOOT_CAUSE_HISTORY_DIR)

# Write the previous reboot cause to REBOOT_CAUSE_HISTORY_FILE_JSON as a JSON format
with open(REBOOT_CAUSE_HISTORY_FILE_JSON, "w") as reboot_cause_history_file:
json.dump(reboot_cause_dict, reboot_cause_history_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import error for json


# Write the previous reboot cause to PREVIOUS_REBOOT_CAUSE_FILE
with open(PREVIOUS_REBOOT_CAUSE_FILE, "w") as prev_cause_file:
prev_cause_file.write(previous_reboot_cause)
Expand Down