Skip to content

[Dualtor] Overlapping config load_minigraphs leaves the linkmgrd's packet sockets in invalidated state #19855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lolyu opened this issue Aug 8, 2024 · 1 comment
Assignees
Labels

Comments

@lolyu
Copy link
Contributor

lolyu commented Aug 8, 2024

Description

If two config load_minigraph overlaps, the second config load_minigraph might fails to restart the mux container if it is started by the first config load_minigraph but still in activating state:
image
The packet sockets open by linkmgrd will be invalidated, any I/O will have ENXIO error:

admin@str2-7050cx3-acs-06:~$ cat /proc/net/packet
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
00000000e387b4b8 2      3    0003   -1    0 0      0      3905243
000000004de5cd30 2      3    0003   -1    0 0      0      3905245
0000000025b13c8e 2      3    0003   -1    0 0      0      3906908
00000000eb4e03d3 2      3    0003   -1    0 0      0      3905247
0000000040e9cf12 2      3    0003   -1    0 0      0      3906910
00000000b50da572 2      3    0003   -1    0 0      0      3905250
00000000cdd99133 2      3    0003   -1    0 0      0      3905251
00000000610eddd1 2      3    0003   -1    0 0      0      3906953
00000000bcca7a88 2      3    0003   -1    0 0      0      3905256
000000006cdb1e05 2      3    0003   -1    0 0      0      3905325
000000008cfcecea 2      3    0003   -1    0 0      0      3905326
000000007440716d 2      3    0003   -1    0 0      0      3905811
00000000656aa72e 2      3    0003   -1    0 0      0      3906984
0000000001fd475e 2      3    0003   -1    0 0      0      3905339
0000000013268c02 2      3    0003   -1    0 0      0      3905341
00000000aad1e9f6 2      3    0003   -1    0 0      0      3906986
00000000af21576f 2      3    0003   -1    0 0      0      3906988
00000000634b790f 2      3    0003   -1    0 0      0      3907626
00000000467c795f 2      3    0003   -1    0 0      0      3905360
000000005dd925d8 2      3    0003   -1    0 0      0      3907009
00000000a543c7b6 2      3    0003   -1    0 0      0      3907011
0000000061394386 2      3    0003   -1    0 0      0      3905945
0000000083b64f36 2      3    0003   -1    0 0      0      3907041
00000000a98e1f3f 2      3    0003   -1    0 0      0      3907071

Steps to reproduce the issue:

This can be reproduced by:

  1. Let mux service sleep for 30s during ExecStartPre phase.
admin@lab-dev:~$ sudo systemctl cat mux.service
# /etc/systemd/system/mux.service
[Unit]
Description=MUX Cable Container
Requires=database.service updategraph.service swss.service
After=swss.service interfaces-config.service
BindsTo=sonic.target
After=sonic.target

[Service]
ExecStartPre=/usr/local/bin/write_standby.py -r
ExecStartPre=/usr/local/bin/mark_dhcp_packet.py
ExecStartPre=/usr/bin/mux.sh start
ExecStartPre=/usr/bin/sleep 30
ExecStart=/usr/bin/mux.sh wait
ExecStop=/usr/bin/mux.sh stop
ExecStopPost=/usr/local/bin/write_standby.py --shutdown mux
  1. Restart mux.service, ensure mux.service is stuck in “activating” status due to the sleep.
admin@lab-dev:~$ sudo systemctl restart mux.service --no-block
admin@lab-dev:~$ systemctl status mux.service
● mux.service - MUX Cable Container
     Loaded: loaded (/etc/systemd/system/mux.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/mux.service.d
             └─auto_restart.conf
     Active: activating (start-pre) since Tue 2024-08-06 03:10:30 UTC; 6s ago
    Process: 586122 ExecStartPre=/usr/local/bin/write_standby.py -r (code=exited, status=0/SUCCESS)
    Process: 586216 ExecStartPre=/usr/local/bin/mark_dhcp_packet.py (code=exited, status=0/SUCCESS)
    Process: 586386 ExecStartPre=/usr/bin/mux.sh start (code=exited, status=0/SUCCESS)
Cntrl PID: 586439 (sleep)
      Tasks: 1 (limit: 19126)
     Memory: 152.0K
     CGroup: /system.slice/mux.service
             └─586439 /usr/bin/sleep 30

  1. Restart sonic.target, validate the packet sockets of linkmgrd becomes invalidated (Iface is -1).
admin@str2-7050cx3-acs-06:~$ sudo systemctl restart sonic.target
admin@str2-7050cx3-acs-06:~$ cat /proc/net/packet
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
00000000e387b4b8 2      3    0003   -1    0 0      0      3905243
000000004de5cd30 2      3    0003   -1    0 0      0      3905245
0000000025b13c8e 2      3    0003   -1    0 0      0      3906908
00000000eb4e03d3 2      3    0003   -1    0 0      0      3905247
0000000040e9cf12 2      3    0003   -1    0 0      0      3906910
00000000b50da572 2      3    0003   -1    0 0      0      3905250
00000000cdd99133 2      3    0003   -1    0 0      0      3905251
00000000610eddd1 2      3    0003   -1    0 0      0      3906953
00000000bcca7a88 2      3    0003   -1    0 0      0      3905256
000000006cdb1e05 2      3    0003   -1    0 0      0      3905325
000000008cfcecea 2      3    0003   -1    0 0      0      3905326
000000007440716d 2      3    0003   -1    0 0      0      3905811
00000000656aa72e 2      3    0003   -1    0 0      0      3906984
0000000001fd475e 2      3    0003   -1    0 0      0      3905339
0000000013268c02 2      3    0003   -1    0 0      0      3905341
00000000aad1e9f6 2      3    0003   -1    0 0      0      3906986
00000000af21576f 2      3    0003   -1    0 0      0      3906988
00000000634b790f 2      3    0003   -1    0 0      0      3907626
00000000467c795f 2      3    0003   -1    0 0      0      3905360
000000005dd925d8 2      3    0003   -1    0 0      0      3907009
00000000a543c7b6 2      3    0003   -1    0 0      0      3907011
0000000061394386 2      3    0003   -1    0 0      0      3905945
0000000083b64f36 2      3    0003   -1    0 0      0      3907041
00000000a98e1f3f 2      3    0003   -1    0 0      0      3907071

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@lolyu
Copy link
Contributor Author

lolyu commented Oct 22, 2024

fixed by sonic-net/sonic-utilities#3475

@lolyu lolyu closed this as completed Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants