Skip to content

[Dualtor] Overlapping config load_minigraphs leaves the linkmgrd's packet sockets in invalidated state #19855

Closed
@lolyu

Description

@lolyu

Description

If two config load_minigraph overlaps, the second config load_minigraph might fails to restart the mux container if it is started by the first config load_minigraph but still in activating state:
image
The packet sockets open by linkmgrd will be invalidated, any I/O will have ENXIO error:

admin@str2-7050cx3-acs-06:~$ cat /proc/net/packet
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
00000000e387b4b8 2      3    0003   -1    0 0      0      3905243
000000004de5cd30 2      3    0003   -1    0 0      0      3905245
0000000025b13c8e 2      3    0003   -1    0 0      0      3906908
00000000eb4e03d3 2      3    0003   -1    0 0      0      3905247
0000000040e9cf12 2      3    0003   -1    0 0      0      3906910
00000000b50da572 2      3    0003   -1    0 0      0      3905250
00000000cdd99133 2      3    0003   -1    0 0      0      3905251
00000000610eddd1 2      3    0003   -1    0 0      0      3906953
00000000bcca7a88 2      3    0003   -1    0 0      0      3905256
000000006cdb1e05 2      3    0003   -1    0 0      0      3905325
000000008cfcecea 2      3    0003   -1    0 0      0      3905326
000000007440716d 2      3    0003   -1    0 0      0      3905811
00000000656aa72e 2      3    0003   -1    0 0      0      3906984
0000000001fd475e 2      3    0003   -1    0 0      0      3905339
0000000013268c02 2      3    0003   -1    0 0      0      3905341
00000000aad1e9f6 2      3    0003   -1    0 0      0      3906986
00000000af21576f 2      3    0003   -1    0 0      0      3906988
00000000634b790f 2      3    0003   -1    0 0      0      3907626
00000000467c795f 2      3    0003   -1    0 0      0      3905360
000000005dd925d8 2      3    0003   -1    0 0      0      3907009
00000000a543c7b6 2      3    0003   -1    0 0      0      3907011
0000000061394386 2      3    0003   -1    0 0      0      3905945
0000000083b64f36 2      3    0003   -1    0 0      0      3907041
00000000a98e1f3f 2      3    0003   -1    0 0      0      3907071

Steps to reproduce the issue:

This can be reproduced by:

  1. Let mux service sleep for 30s during ExecStartPre phase.
admin@lab-dev:~$ sudo systemctl cat mux.service
# /etc/systemd/system/mux.service
[Unit]
Description=MUX Cable Container
Requires=database.service updategraph.service swss.service
After=swss.service interfaces-config.service
BindsTo=sonic.target
After=sonic.target

[Service]
ExecStartPre=/usr/local/bin/write_standby.py -r
ExecStartPre=/usr/local/bin/mark_dhcp_packet.py
ExecStartPre=/usr/bin/mux.sh start
ExecStartPre=/usr/bin/sleep 30
ExecStart=/usr/bin/mux.sh wait
ExecStop=/usr/bin/mux.sh stop
ExecStopPost=/usr/local/bin/write_standby.py --shutdown mux
  1. Restart mux.service, ensure mux.service is stuck in “activating” status due to the sleep.
admin@lab-dev:~$ sudo systemctl restart mux.service --no-block
admin@lab-dev:~$ systemctl status mux.service
● mux.service - MUX Cable Container
     Loaded: loaded (/etc/systemd/system/mux.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/mux.service.d
             └─auto_restart.conf
     Active: activating (start-pre) since Tue 2024-08-06 03:10:30 UTC; 6s ago
    Process: 586122 ExecStartPre=/usr/local/bin/write_standby.py -r (code=exited, status=0/SUCCESS)
    Process: 586216 ExecStartPre=/usr/local/bin/mark_dhcp_packet.py (code=exited, status=0/SUCCESS)
    Process: 586386 ExecStartPre=/usr/bin/mux.sh start (code=exited, status=0/SUCCESS)
Cntrl PID: 586439 (sleep)
      Tasks: 1 (limit: 19126)
     Memory: 152.0K
     CGroup: /system.slice/mux.service
             └─586439 /usr/bin/sleep 30

  1. Restart sonic.target, validate the packet sockets of linkmgrd becomes invalidated (Iface is -1).
admin@str2-7050cx3-acs-06:~$ sudo systemctl restart sonic.target
admin@str2-7050cx3-acs-06:~$ cat /proc/net/packet
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
00000000e387b4b8 2      3    0003   -1    0 0      0      3905243
000000004de5cd30 2      3    0003   -1    0 0      0      3905245
0000000025b13c8e 2      3    0003   -1    0 0      0      3906908
00000000eb4e03d3 2      3    0003   -1    0 0      0      3905247
0000000040e9cf12 2      3    0003   -1    0 0      0      3906910
00000000b50da572 2      3    0003   -1    0 0      0      3905250
00000000cdd99133 2      3    0003   -1    0 0      0      3905251
00000000610eddd1 2      3    0003   -1    0 0      0      3906953
00000000bcca7a88 2      3    0003   -1    0 0      0      3905256
000000006cdb1e05 2      3    0003   -1    0 0      0      3905325
000000008cfcecea 2      3    0003   -1    0 0      0      3905326
000000007440716d 2      3    0003   -1    0 0      0      3905811
00000000656aa72e 2      3    0003   -1    0 0      0      3906984
0000000001fd475e 2      3    0003   -1    0 0      0      3905339
0000000013268c02 2      3    0003   -1    0 0      0      3905341
00000000aad1e9f6 2      3    0003   -1    0 0      0      3906986
00000000af21576f 2      3    0003   -1    0 0      0      3906988
00000000634b790f 2      3    0003   -1    0 0      0      3907626
00000000467c795f 2      3    0003   -1    0 0      0      3905360
000000005dd925d8 2      3    0003   -1    0 0      0      3907009
00000000a543c7b6 2      3    0003   -1    0 0      0      3907011
0000000061394386 2      3    0003   -1    0 0      0      3905945
0000000083b64f36 2      3    0003   -1    0 0      0      3907041
00000000a98e1f3f 2      3    0003   -1    0 0      0      3907071

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions