Description
Description
Observed containers and interfaces flapping after upgrade with fast-reboot using fast-reboot over warm-reboot infrastructure while upgrade involves FW update.
This happens since in fast-reboot script Fast-reboot|system entry is being written to state_db with a timer of 180 seconds to remove the entry. FW update consumes some time (MLNX/NVDA platforms FW update consumes 120~ seconds), together with teardown and startup phase the timer expires before syncd is started. In syncd_init_common script it verifies that the fast-reboot state_db entry still exists when using fast-reboot, otherwise it falls back to cold-reboot.
Since fast-reboot uses warm-reboot infrastructure it will enter syncd.cpp performStartupLogic function with m_isWarmStart set to true and due to the fallback to cold-reboot m_commandLineOptions->m_startType = cold. Due to the flow in the function it will be changed to warm, then it will look for warmBootReadFile and fail and it will fallback finally to cold-reboot and startup flow is broken, on MLNX/NVDA platforms this led to containers and interfaces flap.
Extending the timer manually on the switch before fast-reboot is a work-around.
A temporary fix will be to extend the timer to 210 seconds, long term solution is to utilize fast-reboot finalizer (based on sonic-net/sonic-swss-common#691) that will remove the entry from state_db instead of using the timer.
Steps to reproduce the issue:
- Install SONiC version containing fast-reboot over warm-reboot enhancement.
- Install new SONiC image that requires FW update.
- Fast-reboot after installing.
Describe the results you received:
Containers and interfaces flapping after upgrade.
Describe the results you expected:
No flaps after upgrade with fast-reboot
Output of show version
:
Upgrade base version:
SONiC Software Version: SONiC.202205.42-ea51d9514_Internal
Distribution: Debian 11.5
Kernel: 5.10.0-12-2-amd64
Build commit: ea51d9514
Build date: Fri Oct 7 05:45:56 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci03-243
Platform: x86_64-mlnx_msn4700-r0
HwSKU: ACS-MSN4700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2022X08595
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 16:07:04 up 2 min, 1 user, load average: 1.77, 1.06, 0.43
Date: Mon 09 Jan 2023 16:07:04
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-mlnx 202205.42-ea51d9514_Internal 97dd12cebe1e 859MB
docker-syncd-mlnx latest 97dd12cebe1e 859MB
docker-orchagent 202205.42-ea51d9514_Internal 347f0cdc723f 478MB
docker-orchagent latest 347f0cdc723f 478MB
docker-fpm-frr 202205.42-ea51d9514_Internal ddadceae2d69 488MB
docker-fpm-frr latest ddadceae2d69 488MB
docker-teamd 202205.42-ea51d9514_Internal 28f79f968d3c 459MB
docker-teamd latest 28f79f968d3c 459MB
docker-platform-monitor 202205.42-ea51d9514_Internal 629c9ea03cf2 861MB
docker-platform-monitor latest 629c9ea03cf2 861MB
docker-macsec latest a7ea8b95281f 461MB
docker-snmp 202205.42-ea51d9514_Internal 0e96a62d07ee 488MB
docker-snmp latest 0e96a62d07ee 488MB
docker-dhcp-relay latest 8cef09a39edf 452MB
docker-lldp 202205.42-ea51d9514_Internal 337146c6b971 485MB
docker-lldp latest 337146c6b971 485MB
docker-mux 202205.42-ea51d9514_Internal 464339799d55 492MB
docker-mux latest 464339799d55 492MB
docker-sonic-telemetry 202205.42-ea51d9514_Internal 7fc604d28c7c 523MB
docker-sonic-telemetry latest 7fc604d28c7c 523MB
docker-database 202205.42-ea51d9514_Internal 98a7bdcfd7e8 443MB
docker-database latest 98a7bdcfd7e8 443MB
docker-router-advertiser 202205.42-ea51d9514_Internal f05c810acb38 443MB
docker-router-advertiser latest f05c810acb38 443MB
docker-nat 202205.42-ea51d9514_Internal 272fda2cdf1a 430MB
docker-nat latest 272fda2cdf1a 430MB
docker-sflow 202205.42-ea51d9514_Internal 5723c8d63918 428MB
docker-sflow latest 5723c8d63918 428MB
docker-sonic-mgmt-framework 202205.42-ea51d9514_Internal 0fd3a3d91b98 557MB
docker-sonic-mgmt-framework latest 0fd3a3d91b98 557MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh 1.3.1-202205 4e8b9199b984 643MB
target version: (example, this happens also when upgrading to other versions when FW update is required)
SONiC Software Version: SONiC.202205.88-0693662da_Internal
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
Build commit: 0693662da
Build date: Thu Jan 5 22:23:07 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-242
Platform: x86_64-mlnx_msn4700-r0
HwSKU: ACS-MSN4700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2022X08595
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 18:28:09 up 5 min, 1 user, load average: 0.78, 0.66, 0.35
Date: Mon 09 Jan 2023 18:28:09
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-orchagent 202205.88-0693662da_Internal 4254b06254c9 518MB
docker-orchagent latest 4254b06254c9 518MB
docker-fpm-frr 202205.88-0693662da_Internal 2f49c95b6049 529MB
docker-fpm-frr latest 2f49c95b6049 529MB
docker-teamd 202205.88-0693662da_Internal dc1e2cad03eb 500MB
docker-teamd latest dc1e2cad03eb 500MB
docker-macsec 202205.88-0693662da_Internal 24025905077e 502MB
docker-syncd-mlnx 202205.88-0693662da_Internal ffedd2906306 902MB
docker-syncd-mlnx latest ffedd2906306 902MB
docker-platform-monitor 202205.88-0693662da_Internal 9424506bab73 907MB
docker-platform-monitor latest 9424506bab73 907MB
docker-snmp 202205.88-0693662da_Internal 91a76c339e2b 528MB
docker-snmp latest 91a76c339e2b 528MB
docker-dhcp-relay 202205.88-0693662da_Internal a673c28f0b41 492MB
docker-sonic-telemetry 202205.88-0693662da_Internal 5c4eaee05c76 563MB
docker-sonic-telemetry latest 5c4eaee05c76 563MB
docker-lldp 202205.88-0693662da_Internal 3cb482e329d0 525MB
docker-lldp latest 3cb482e329d0 525MB
docker-database 202205.88-0693662da_Internal c220331fd92b 483MB
docker-database latest c220331fd92b 483MB
docker-mux 202205.88-0693662da_Internal 031e5242ddeb 531MB
docker-mux latest 031e5242ddeb 531MB
docker-router-advertiser 202205.88-0693662da_Internal a9989dd54119 483MB
docker-router-advertiser latest a9989dd54119 483MB
docker-sonic-mgmt-framework 202205.88-0693662da_Internal c21cbbdff887 598MB
docker-sonic-mgmt-framework latest c21cbbdff887 598MB
docker-nat 202205.88-0693662da_Internal 44ef7aa226b0 471MB
docker-nat latest 44ef7aa226b0 471MB
docker-sflow 202205.88-0693662da_Internal 94414c247b96 469MB
docker-sflow latest 94414c247b96 469MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doroce 1.0.1-202205 111468a02d75 200MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh 1.3.2-202205 f466330be644 310MB
Output of show techsupport
:
(paste your output here or download and attach the file here )
Additional information you deem important (e.g. issue happens only occasionally):
The relevant flow in syncd.cpp:
The timer check in syncd_init_common:
Setting the timer in fast-reboot:
The PR that added the timer: #3741
The issue that required adding the timer: #3697