Skip to content

[warm-reboot] Add new preboot health check: verify database integrity #1785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 13, 2021

Conversation

vaibhavhd
Copy link
Contributor

@vaibhavhd vaibhavhd commented Aug 27, 2021

What I did

Verify database integrity before proceeding with warm reboot or fast reboot.

This integrity check uses a JSON schema to validate DBs. To start with, only counters_db's table COUNTERS_PORT_NAME_MAP presence is verified. But, this list can advance in future.

The test logic is designed to be generic; any more databases or tables within them can be just added to schema list, and the verification logic needs no change.

How I did it

Added a JSON schema, and generic schema validation logic.

How to verify it

Bad case:

root@str-s6100-acs-2:/usr/local/bin# warm-reboot -vvv
Fri 10 Sep 2021 06:47:00 PM UTC Saving counters folder before warmboot...
Failed to validate DB's integrity. Exit code: 1. Use '-d' option to force ignore this check.
Fri 10 Sep 2021 06:47:02 PM UTC warm-reboot failure (15) cleanup ...
Fri 10 Sep 2021 06:47:03 PM UTC Cancel warm-reboot: code (0)
root@str-s6100-acs-2:/usr/local/bin#

SYSLOG:

root@str-s6100-acs-2:/usr/local/bin# tail /var/log/syslog
Sep 10 18:47:00.082206 str-s6100-acs-2 NOTICE admin: Saving counters folder before warmboot...
Sep 10 18:47:02.635871 str-s6100-acs-2 ERR /check_db_integrity.py: Database is missing tables/entries needed for reboot procedure. DB integrity check failed with:#012'COUNTERS_PORT_NAME_MAP' is a required property
Sep 10 18:47:02.646312 str-s6100-acs-2 NOTICE admin: warm-reboot failure (15) cleanup ...   <<<--------------------------------------- Check failed
Sep 10 18:47:03.685792 str-s6100-acs-2 NOTICE admin: Cancel warm-reboot: code (0)
root@str-s6100-acs-2:/usr/local/bin# 

Bad case with forced ignore:

root@str-s6100-acs-2:~# warm-reboot -vvvd
Fri 10 Sep 2021 08:32:04 PM UTC Saving counters folder before warmboot...
Fri 10 Sep 2021 08:32:06 PM UTC Ignoring Database integrity checks...   <<<--------------------------------------- IGNORED failure
Fri 10 Sep 2021 08:32:08 PM UTC Pausing orchagent ...
Fri 10 Sep 2021 08:32:08 PM UTC Collecting logs to check ssd health before warm-reboot...
Fri 10 Sep 2021 08:32:09 PM UTC Stopping lldp ...
Fri 10 Sep 2021 08:32:11 PM UTC Stopped lldp
Fri 10 Sep 2021 08:32:11 PM UTC Stopping nat ...
...

SYSLOG:

Sep 10 20:32:04.202216 str-s6100-acs-2 NOTICE admin: Saving counters folder before warmboot...
Sep 10 20:32:06.630111 str-s6100-acs-2 ERR /check_db_integrity.py: Database is missing tables/entries needed for reboot procedure. DB integrity check failed with:#012'COUNTERS_PORT_NAME_MAP' is a required property
Sep 10 20:32:06.638531 str-s6100-acs-2 NOTICE admin: Ignoring Database integrity checks...  <<<--------------------------------------- IGNORED failure
Sep 10 20:32:06.987446 str-s6100-acs-2 INFO admin: Checking that ASIC configuration has not changed
Sep 10 20:32:07.647537 str-s6100-acs-2 INFO admin: ASIC config unchanged, current and destination SONiC version are the same
Sep 10 20:32:08.739484 str-s6100-acs-2 NOTICE admin: Pausing orchagent ...

Good case:

root@str-s6100-acs-2:~# warm-reboot -vvvv
Fri 10 Sep 2021 06:56:20 PM UTC Saving counters folder before warmboot...
Fri 10 Sep 2021 06:56:29 PM UTC Pausing orchagent ...
Fri 10 Sep 2021 06:56:29 PM UTC Collecting logs to check ssd health before warm-reboot...
Fri 10 Sep 2021 06:56:29 PM UTC Stopping lldp ...
Fri 10 Sep 2021 06:56:32 PM UTC Stopped lldp
Fri 10 Sep 2021 06:56:32 PM UTC Stopping nat ...
Dumping conntrack entries failed
Warning: The unit file, source configuration file or drop-ins of nat.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Fri 10 Sep 2021 06:56:33 PM UTC Stopped nat
Fri 10 Sep 2021 06:56:33 PM UTC Stopping radv ...
Fri 10 Sep 2021 06:56:33 PM UTC Stopped radv
Fri 10 Sep 2021 06:56:34 PM UTC Stopping sflow ...
Warning: The unit file, source configuration file or drop-ins of sflow.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Fri 10 Sep 2021 06:56:34 PM UTC Stopped sflow
Fri 10 Sep 2021 06:56:34 PM UTC Stopping bgp ...
Fri 10 Sep 2021 06:56:38 PM UTC Stopped bgp
Fri 10 Sep 2021 06:56:38 PM UTC Stopping swss ...
Fri 10 Sep 2021 06:56:48 PM UTC Stopped swss
Fri 10 Sep 2021 06:56:48 PM UTC Initialize pre-shutdown ...
Fri 10 Sep 2021 06:56:48 PM UTC Requesting pre-shutdown ...
Fri 10 Sep 2021 06:56:49 PM UTC Waiting for pre-shutdown ...
Fri 10 Sep 2021 06:56:52 PM UTC Pre-shutdown succeeded, state: pre-shutdown-succeeded ...
Fri 10 Sep 2021 06:56:52 PM UTC Backing up database ...
Fri 10 Sep 2021 06:56:53 PM UTC Stopping teamd ...
Fri 10 Sep 2021 06:57:00 PM UTC Stopped teamd
Fri 10 Sep 2021 06:57:00 PM UTC Stopping syncd ...
Fri 10 Sep 2021 06:57:13 PM UTC Stopped syncd
Fri 10 Sep 2021 06:57:13 PM UTC Stopping all remaining containers ...
Fri 10 Sep 2021 06:57:17 PM UTC Stopped all remaining containers ...
Fri 10 Sep 2021 06:57:18 PM UTC updating ssd fw forwarm-reboot
Fri 10 Sep 2021 06:57:18 PM UTC Enabling Watchdog before warm-reboot
Watchdog armed for 180 seconds
Fri 10 Sep 2021 06:57:19 PM UTC Running x86_64-dell_s6100_c2538-r0 specific plugin...
Fri 10 Sep 2021 06:57:19 PM UTC Rebooting with /sbin/kexec -e to SONiC-OS-master.34161-dirty-20210907.143917 ...

Syslog:

Sep 10 18:56:20.820803 str-s6100-acs-2 NOTICE admin: Saving counters folder before warmboot...
Sep 10 18:56:27.269754 str-s6100-acs-2 DEBUG /check_db_integrity.py: Database integrity checks passed.    <<<--------------------------------------- Check passed

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@lgtm-com
Copy link

lgtm-com bot commented Aug 27, 2021

This pull request introduces 1 alert when merging 096061e into f5ce87a - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

@lgtm-com
Copy link

lgtm-com bot commented Aug 31, 2021

This pull request introduces 1 alert when merging 948f587 into 720b650 - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

@lgtm-com
Copy link

lgtm-com bot commented Sep 9, 2021

This pull request introduces 1 alert when merging 04b5889 into 2b12aad - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

@vaibhavhd
Copy link
Contributor Author

/Azp run

1 similar comment
@vaibhavhd
Copy link
Contributor Author

/Azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vaibhavhd vaibhavhd merged commit c007d65 into sonic-net:master Sep 13, 2021
@vaibhavhd vaibhavhd deleted the warmboot-db-integrity branch September 13, 2021 19:45
judyjoseph pushed a commit that referenced this pull request Sep 14, 2021
…#1785)

What I did
Verify database integrity before proceeding with warm reboot or fast reboot.
This integrity check uses a JSON schema to validate DBs. To start with, only counters_db's table COUNTERS_PORT_NAME_MAP presence is verified. But, this list can advance in future.
The test logic is designed to be generic; any more databases or tables within them can be just added to schema list, and the verification logic needs no change.
How I did it
Added a JSON schema, and generic schema validation logic.
@qiluo-msft
Copy link
Contributor

This PR could not be cleanly cherry-picked to 202012. Please submit another PR.

vaibhavhd added a commit that referenced this pull request Sep 18, 2021
The script was added in the PR #1785 which did not add this script to the setup.py script.
Added the check_db_integrity script to setup.py.
judyjoseph pushed a commit that referenced this pull request Sep 19, 2021
The script was added in the PR #1785 which did not add this script to the setup.py script.
Added the check_db_integrity script to setup.py.
vaibhavhd added a commit that referenced this pull request Sep 28, 2021
…1839)

Porting changes from master PRs- #1785, #1828. The PR on master cannot be cherrypicked cleanly, hence a separate PR for 202012:
Verify database integrity before proceeding with warm reboot or fast reboot.
This integrity check uses a JSON schema to validate DBs. To start with, only counters_db's table COUNTERS_PORT_NAME_MAP presence is verified. But, this list can advance in future.
The test logic is designed to be generic; any more databases or tables within them can be just added to schema list, and the verification logic needs no change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants