Skip to content

Containers and interfaces flapping after upgrade with fast-reboot #13251

Open
@arfeigin

Description

@arfeigin

Description

Observed containers and interfaces flapping after upgrade with fast-reboot using fast-reboot over warm-reboot infrastructure while upgrade involves FW update.
This happens since in fast-reboot script Fast-reboot|system entry is being written to state_db with a timer of 180 seconds to remove the entry. FW update consumes some time (MLNX/NVDA platforms FW update consumes 120~ seconds), together with teardown and startup phase the timer expires before syncd is started. In syncd_init_common script it verifies that the fast-reboot state_db entry still exists when using fast-reboot, otherwise it falls back to cold-reboot.
Since fast-reboot uses warm-reboot infrastructure it will enter syncd.cpp performStartupLogic function with m_isWarmStart set to true and due to the fallback to cold-reboot m_commandLineOptions->m_startType = cold. Due to the flow in the function it will be changed to warm, then it will look for warmBootReadFile and fail and it will fallback finally to cold-reboot and startup flow is broken, on MLNX/NVDA platforms this led to containers and interfaces flap.
Extending the timer manually on the switch before fast-reboot is a work-around.
A temporary fix will be to extend the timer to 210 seconds, long term solution is to utilize fast-reboot finalizer (based on sonic-net/sonic-swss-common#691) that will remove the entry from state_db instead of using the timer.

Steps to reproduce the issue:

  1. Install SONiC version containing fast-reboot over warm-reboot enhancement.
  2. Install new SONiC image that requires FW update.
  3. Fast-reboot after installing.

Describe the results you received:

Containers and interfaces flapping after upgrade.

Describe the results you expected:

No flaps after upgrade with fast-reboot

Output of show version:

Upgrade base version:

SONiC Software Version: SONiC.202205.42-ea51d9514_Internal
Distribution: Debian 11.5
Kernel: 5.10.0-12-2-amd64
Build commit: ea51d9514
Build date: Fri Oct  7 05:45:56 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci03-243

Platform: x86_64-mlnx_msn4700-r0
HwSKU: ACS-MSN4700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2022X08595
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 16:07:04 up 2 min,  1 user,  load average: 1.77, 1.06, 0.43
Date: Mon 09 Jan 2023 16:07:04

Docker images:
REPOSITORY                                         TAG                            IMAGE ID       SIZE
docker-syncd-mlnx                                  202205.42-ea51d9514_Internal   97dd12cebe1e   859MB
docker-syncd-mlnx                                  latest                         97dd12cebe1e   859MB
docker-orchagent                                   202205.42-ea51d9514_Internal   347f0cdc723f   478MB
docker-orchagent                                   latest                         347f0cdc723f   478MB
docker-fpm-frr                                     202205.42-ea51d9514_Internal   ddadceae2d69   488MB
docker-fpm-frr                                     latest                         ddadceae2d69   488MB
docker-teamd                                       202205.42-ea51d9514_Internal   28f79f968d3c   459MB
docker-teamd                                       latest                         28f79f968d3c   459MB
docker-platform-monitor                            202205.42-ea51d9514_Internal   629c9ea03cf2   861MB
docker-platform-monitor                            latest                         629c9ea03cf2   861MB
docker-macsec                                      latest                         a7ea8b95281f   461MB
docker-snmp                                        202205.42-ea51d9514_Internal   0e96a62d07ee   488MB
docker-snmp                                        latest                         0e96a62d07ee   488MB
docker-dhcp-relay                                  latest                         8cef09a39edf   452MB
docker-lldp                                        202205.42-ea51d9514_Internal   337146c6b971   485MB
docker-lldp                                        latest                         337146c6b971   485MB
docker-mux                                         202205.42-ea51d9514_Internal   464339799d55   492MB
docker-mux                                         latest                         464339799d55   492MB
docker-sonic-telemetry                             202205.42-ea51d9514_Internal   7fc604d28c7c   523MB
docker-sonic-telemetry                             latest                         7fc604d28c7c   523MB
docker-database                                    202205.42-ea51d9514_Internal   98a7bdcfd7e8   443MB
docker-database                                    latest                         98a7bdcfd7e8   443MB
docker-router-advertiser                           202205.42-ea51d9514_Internal   f05c810acb38   443MB
docker-router-advertiser                           latest                         f05c810acb38   443MB
docker-nat                                         202205.42-ea51d9514_Internal   272fda2cdf1a   430MB
docker-nat                                         latest                         272fda2cdf1a   430MB
docker-sflow                                       202205.42-ea51d9514_Internal   5723c8d63918   428MB
docker-sflow                                       latest                         5723c8d63918   428MB
docker-sonic-mgmt-framework                        202205.42-ea51d9514_Internal   0fd3a3d91b98   557MB
docker-sonic-mgmt-framework                        latest                         0fd3a3d91b98   557MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.3.1-202205                   4e8b9199b984   643MB

target version: (example, this happens also when upgrading to other versions when FW update is required)

SONiC Software Version: SONiC.202205.88-0693662da_Internal
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
Build commit: 0693662da
Build date: Thu Jan  5 22:23:07 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-242

Platform: x86_64-mlnx_msn4700-r0
HwSKU: ACS-MSN4700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2022X08595
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 18:28:09 up 5 min,  1 user,  load average: 0.78, 0.66, 0.35
Date: Mon 09 Jan 2023 18:28:09

Docker images:
REPOSITORY                                         TAG                            IMAGE ID       SIZE
docker-orchagent                                   202205.88-0693662da_Internal   4254b06254c9   518MB
docker-orchagent                                   latest                         4254b06254c9   518MB
docker-fpm-frr                                     202205.88-0693662da_Internal   2f49c95b6049   529MB
docker-fpm-frr                                     latest                         2f49c95b6049   529MB
docker-teamd                                       202205.88-0693662da_Internal   dc1e2cad03eb   500MB
docker-teamd                                       latest                         dc1e2cad03eb   500MB
docker-macsec                                      202205.88-0693662da_Internal   24025905077e   502MB
docker-syncd-mlnx                                  202205.88-0693662da_Internal   ffedd2906306   902MB
docker-syncd-mlnx                                  latest                         ffedd2906306   902MB
docker-platform-monitor                            202205.88-0693662da_Internal   9424506bab73   907MB
docker-platform-monitor                            latest                         9424506bab73   907MB
docker-snmp                                        202205.88-0693662da_Internal   91a76c339e2b   528MB
docker-snmp                                        latest                         91a76c339e2b   528MB
docker-dhcp-relay                                  202205.88-0693662da_Internal   a673c28f0b41   492MB
docker-sonic-telemetry                             202205.88-0693662da_Internal   5c4eaee05c76   563MB
docker-sonic-telemetry                             latest                         5c4eaee05c76   563MB
docker-lldp                                        202205.88-0693662da_Internal   3cb482e329d0   525MB
docker-lldp                                        latest                         3cb482e329d0   525MB
docker-database                                    202205.88-0693662da_Internal   c220331fd92b   483MB
docker-database                                    latest                         c220331fd92b   483MB
docker-mux                                         202205.88-0693662da_Internal   031e5242ddeb   531MB
docker-mux                                         latest                         031e5242ddeb   531MB
docker-router-advertiser                           202205.88-0693662da_Internal   a9989dd54119   483MB
docker-router-advertiser                           latest                         a9989dd54119   483MB
docker-sonic-mgmt-framework                        202205.88-0693662da_Internal   c21cbbdff887   598MB
docker-sonic-mgmt-framework                        latest                         c21cbbdff887   598MB
docker-nat                                         202205.88-0693662da_Internal   44ef7aa226b0   471MB
docker-nat                                         latest                         44ef7aa226b0   471MB
docker-sflow                                       202205.88-0693662da_Internal   94414c247b96   469MB
docker-sflow                                       latest                         94414c247b96   469MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doroce      1.0.1-202205                   111468a02d75   200MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.3.2-202205                   f466330be644   310MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The relevant flow in syncd.cpp:
image

The timer check in syncd_init_common:
image

Setting the timer in fast-reboot:
image

The PR that added the timer: #3741
The issue that required adding the timer: #3697

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions