Skip to content

System recovery when syncd crashes #3517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Prasanth-KV
Copy link

When syncd gets terminated unexpectedly system goes into unstable state. And system stays in that state unless a reboot is triggered by the user.
It doesn't get receovered by iteslef. This is the problem getting addressed with this change.

- What I did

- How I did it
When syncd process crashes, we cannot restart syncd alone as it has dependency on orch-agent.
So same mechanism when SAI API call failure happens will be used here. syncd process state is monitored and when it crashes, shutdown notification is been sent to orch-agent which eventually result in SWSS restart and syncd restart, to recover the system.
- How to verify it
Kill syncd daemon and check if the system recovers.
- Description for the changelog
When syncd process crashes, we cannot restart syncd alone as it has dependency on orch-agent.
syncd process state is monitored and when it crashes, shutdown notification is been sent to orch-agent which eventually result in SWSS restart and syncd restart, to recover the system.

- A picture of a cute animal (not mandatory but encouraged)

When syncd gets terminated unexpectedly system  goes into unstable state. And system stays in that state unless a reboot is triggered by the user.
It doesn't get receovered  by iteslef.  This is the problem getting addressed  with this change.

When syncd process crashes, we cannot restart syncd alone as it has dependency on orch-agent.
So same mechanism when SAI API call failure happens will be used here. syncd process state is monitored and when it crashes,
shutdown notification is been sent to orch-agent which eventually result in SWSS restart and syncd restart, to recover the system.
@jleveque
Copy link
Contributor

@Prasanth-KV: I am currently working on a more general solution for this exact problem which will work with all syncd containers, regardless of ASIC type.

Also, instead of notifying orchagent to shutdown, I am configuring the swss service to exit if either the swss OR syncd containers exit. My solution will kill swss even if orchagent is, for some reason, unresponsive. Do you see any advantages to your approach over mine?

@jleveque
Copy link
Contributor

@Prasanth-KV: I have raised my PR here: #3534

Please take a look, as I suggest closing this PR in favor of that one.

@lguohan
Copy link
Collaborator

lguohan commented Nov 9, 2019

closing this pr in favor of #3534

@lguohan lguohan closed this Nov 9, 2019
mssonicbld added a commit that referenced this pull request Jun 6, 2025
…lly (#22790)

#### Why I did it
src/sonic-swss
```
* 5e07127a - (HEAD -> master, origin/master, origin/HEAD) [dashhaorch]: Fix error: stack protector not protecting local variables: variable length buffer (#3643) (4 hours ago) [Nazarii Hnydyn]
* d589d8d9 - [swss]: IcmpOrch to support ICMP session offload to ASIC (#3535) (6 hours ago) [manamand2020]
* f05e8e9e - [SRv6] add MySID counters support (#3601) (6 hours ago) [Yakiv Huryk]
* a0bd39e5 - Skip "port doesn't exist" SWSS_LOG_INFO messages for local ports (#3553) (31 hours ago) [HP]
* 74b2cc61 - [ci]: Skip publishing of asan vstest summary (#3669) (32 hours ago) [prabhataravind]
* 398161b4 - [Dynamic Buffer][Mellanox] Fix an issue when handling 2-digit queue ID in the Lua plugin (#3588) (2 days ago) [Stephen Sun]
* 7106cc0a - Fixing macsecmgrd memory corruption (#3611) (2 days ago) [sivanuka-arista]
* e830a491 - [fpmsyncd]Fixing blackhole route to publish protocol field to APPL_DB (#3655) (2 days ago) [Sudharsan Dhamal Gopalarathnam]
* de5b8e51 - Setting default nexthop weight to 1 in `fpmsyncd` (#3636) (3 days ago) [mramezani95]
* 176bcea9 - Change Log Level for BFD Offload Capability Implementation (#3641) (3 days ago) [Sai Rama Mohan Reddy S]
* f9f7ff0e - Fix NextHopGroupEntry class data member not initialized bug (#3644) (3 days ago) [Hua Liu]
* c8c597cf - Install symlink to Python 3 to work around AzP diff coverage issue (#3670) (6 days ago) [Saikrishna Arcot]
* 3a5efa38 - [tests]: Fix `test_MirrorDestMoveLag` test failure (#3639) (6 days ago) [Carmine Scarpitta]
* 13d559d4 - Revert "Set Port UPDATE_DSCP attribute when TC_TO_DSCP map is attached (#3517)" (#3666) (6 days ago) [Kumaresh Perumal]
* 1c601cb8 - Changes to unblock swss pipeline tests (#3664) (7 days ago) [prabhataravind]
* b31500b2 - [build] Support optionally using other container registries instead of DockerHub (#3668) (7 days ago) [Saikrishna Arcot]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants