Skip to content

[Chassis] sonic-mgmt PC suite tests test_voq_po_update and test_po_update_io_no_loss are failing #19357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
saksarav-nokia opened this issue Jun 20, 2024 · 5 comments · Fixed by sonic-net/sonic-swss#3207
Assignees

Comments

@saksarav-nokia
Copy link
Contributor

Description

In latest master, the Port Channel test cases test_voq_po_update and test_po_update_io_no_loss are failing and looks like it is due to the changes made in sonic-net/sonic-swss#3150.

When the empty LAG is created in one IMM, the oper status of the Lag is changed from unknown to down as shown in the logs below. When this oper state change notification is processed, the voqSyncIntfState is called which writes to the SYSTEM_INTERFACE in chassis_db and the other asics in the same IMM and other IMM's receives SYSTEM_LAG_TABLE & SYSTEM_INTERFACE notifications from chassis_db as shown below. The Router interface is created in remote asics and reference to the Lag.
But when the LAG is deleted, the local asic does not delete the SYSTEM_INTERFACE and only deletes the SYSTEM_LAG_TABLE in chassis_db, so when the remote asics receives the SYSTEM_LAG_TABLE delete, it calls removeLag and since RouterInterface is referencing this lag, the delete fails and the test cases fail.
Does PortsOrch::updatePortOperStatusneed need to call gIntfsOrch->voqSyncIntfStat when the oper state change from unknown to down?

2024 Jun 19 16:54:19.889094 ixre-egl-board7 NOTICE swss0#orchagent: :- addLag: Create an empty LAG PortChannel999 lid:2000000000c55
2024 Jun 19 16:54:19.889589 ixre-egl-board7 NOTICE swss0#orchagent: :- updatePortOperStatus: Port PortChannel999 oper state set from unknown to down
2024 Jun 19 16:54:19.889589 ixre-egl-board7 NOTICE swss0#orchagent: :- voqSyncIntfState: Syncing system interface state down for port ixre-egl-board7|asic0|PortChannel999
2024 Jun 19 16:54:19.891401 ixre-egl-board7 NOTICE swss1#orchagent: :- addLag: Create an empty LAG ixre-egl-board7|asic0|PortChannel999 lid:102000000000c0b
2024 Jun 19 16:54:19.897058 ixre-egl-board7 INFO kernel: [ 5084.526940] PortChannel999: Mode changed to "loadbalance"
2024 Jun 19 16:54:19.898792 ixre-egl-board7 NOTICE swss1#orchagent: :- addRouterIntfs: Create router interface ixre-egl-board7|asic0|PortChannel999 MTU 1492
2024 Jun 19 16:54:19.901126 ixre-egl-board7 NOTICE teamd0#teammgrd: :- addLag: Start port channel PortChannel999 with teamd
2024 Jun 19 16:54:19.904120 ixre-egl-board7 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel999 admin:1 oper:0 addr:40:7c:7d:bb:25:9d ifindex:62 master:0 type:team
2024 Jun 19 16:54:19.904969 ixre-egl-board7 NOTICE teamd0#teammgrd: :- setLagAdminStatus: Set port channel PortChannel999 admin status to up
2024 Jun 19 16:54:19.905031 ixre-egl-board7 INFO kernel: [ 5084.534338] 8021q: adding VLAN 0 to HW filter on device PortChannel999
2024 Jun 19 16:54:19.929936 ixre-egl-board7 NOTICE teamd0#teammgrd: :- setLagMtu: Set port channel PortChannel999 MTU to 9100
2024 Jun 19 16:54:19.930240 ixre-egl-board7 NOTICE teamd0#tlm_teamd: :- try_add_lag: The LAG 'PortChannel999' has been added.
2024 Jun 19 16:54:19.945668 ixre-egl-board7 NOTICE swss0#orchagent: :- updatePortOperStatus: Port PortChannel999 oper state set from down to down
2024 Jun 19 16:54:20.204105 ixre-egl-board7 NOTICE syncd1#syncd: :- addObject: Rif Counter oid:0x15100600003015 does not has supported counters

  1. "pmessage"
  2. "keyspace@12:*"
  3. "keyspace@12:SYSTEM_LAG_TABLE|ixre-egl-board7|asic0|PortChannel999"
  4. "hset"
  5. "pmessage"
  6. "keyspace@12:*"
  7. "keyspace@12:SYSTEM_INTERFACE|ixre-egl-board7|asic0|PortChannel999"
  8. "hset"
  9. "pmessage"
  10. "keyspace@12:*"
  11. "keyspace@12:SYSTEM_LAG_ID_SET"
  12. "srem"
  13. "pmessage"
  14. "keyspace@12:*"
  15. "keyspace@12:SYSTEM_LAG_ID_TABLE"
  16. "hdel"
  17. "pmessage"
  18. "keyspace@12:*"
  19. "keyspace@12:SYSTEM_LAG_TABLE|ixre-egl-board7|asic0|PortChannel999"
  20. "del"
    ^C(1057.04s)

Steps to reproduce the issue:

  1. Run tests test_voq_po_update and test_po_update_io_no_loss in chassis

Describe the results you received:

The tests should pass

Describe the results you expected:

The tests are failing

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@saksarav-nokia
Copy link
Contributor Author

@arlakshm for your viz

@judyjoseph
Copy link
Contributor

judyjoseph commented Jun 26, 2024

@saksarav-nokia @arlakshm There was a PR recently to update the LAG operstatus when addLag is done, will this help here?
sonic-net/sonic-swss#3195

@arlakshm arlakshm self-assigned this Jun 27, 2024
@arlakshm
Copy link
Contributor

arlakshm commented Jun 27, 2024

Hi @saksarav-nokia, regarding are below comment.
""But when the LAG is deleted, the local asic does not delete the SYSTEM_INTERFACE and only deletes the SYSTEM_LAG_TABLE in chassis_db, so when the remote asics receives the SYSTEM_LAG_TABLE delete, it calls removeLag and since RouterInterface is referencing this lag, the delete fails and the test cases fail."

If the lag is deleted without removing the ip interface on the local card. It is expected the remove lag will fail. this is existing design; it has nothing to do changes in sonic-net/sonic-swss#3150. Please let me know if I am missing something?
'''
un 27 02:24:39.701624 str2-sonic-lc3-1 NOTICE swss#orchagent: :- removeLagMember: Remove member Ethernet12 from LAG PortChannel101 lid:2000000000bef lmid:1b000000000bf1
Jun 27 02:24:48.899892 str2-sonic-lc3-1 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel101
Jun 27 02:24:48.901268 str2-sonic-lc3-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel101 admin:0 oper:0 addr:fc:bd:67:67:e4:1c ifindex:492 master:0 type:team
Jun 27 02:24:48.907396 str2-sonic-lc3-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:PortChannel101 admin:0 oper:0 addr:fc:bd:67:67:e4:1c ifindex:492 master:0 type:team
Jun 27 02:24:48.911889 str2-sonic-lc3-1 NOTICE swss#orchagent: :- updatePortOperStatus: Port PortChannel101 oper state set from down to down
Jun 27 02:24:48.912355 str2-sonic-lc3-1 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel101'. Skipping
Jun 27 02:24:48.913092 str2-sonic-lc3-1 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel101' has been removed.
Jun 27 02:24:48.914067 str2-sonic-lc3-1 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel101 MTU to 9100
Jun 27 02:24:48.916115 str2-sonic-lc3-1 ERR swss#orchagent: :- removeLag: Failed to remove ref count 4 LAG PortChannel101
'''

@saksarav-nokia
Copy link
Contributor Author

@arlakshm , The test test_voq_po_update doesn't have ip address on the PortChannel999. It creates empty lag and verifies in CHASSIS_DB and ASIC_DB and removes the lag. The test was passing few week ago and started failing.
1. On any ASIC, add a new LAG
2. verify added lag gets a unique lag id in chassis app db
3. verify added lag exist in app db
4. verify lag exist in asic db on remote and local asic db
5. delete the added lag

@saksarav-nokia
Copy link
Contributor Author

@arlakshm , With the following code commented out, the tests are passing
diff --git a/orchagent/portsorch.cpp b/orchagent/portsorch.cpp
index 40f79eb1..0f7e6529 100644
--- a/orchagent/portsorch.cpp
+++ b/orchagent/portsorch.cpp
@@ -8131,7 +8131,7 @@ void PortsOrch::updatePortOperStatus(Port &port, sai_port_oper_status_t status)
SWSS_LOG_WARN("Inform nexthop operation failed for sub interface %s", child_port.c_str());
}
}

+#if 0
if(gMySwitchType == "voq")
{
if (gIntfsOrch->isLocalSystemPortIntf(port.m_alias))
@@ -8139,6 +8139,7 @@ void PortsOrch::updatePortOperStatus(Port &port, sai_port_oper_status_t status)
gIntfsOrch->voqSyncIntfState(port.m_alias, isUp);
}
}
+#endif

mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this issue Aug 2, 2024
… is no rif assciated with the port (sonic-net#3207)

What I did
Fixes: sonic-net/sonic-buildimage#19357

Why I did it
In the sonic-mgmt pc test suite. When an empty lag is created. The portchannel changed is sync'ed to the remote LC even if there portchannel has no route interface created. This results in a dummy route interface created on the remote LC.

So when the empty port channel is removed on the local card, the removal fails in the remote LC because of the dummy route interface.

Add a fix to sync the portchannel interface state to the remote LC only when there routeinterface is created on the local LC.
mssonicbld pushed a commit to sonic-net/sonic-swss that referenced this issue Aug 3, 2024
… is no rif assciated with the port (#3207)

What I did
Fixes: sonic-net/sonic-buildimage#19357

Why I did it
In the sonic-mgmt pc test suite. When an empty lag is created. The portchannel changed is sync'ed to the remote LC even if there portchannel has no route interface created. This results in a dummy route interface created on the remote LC.

So when the empty port channel is removed on the local card, the removal fails in the remote LC because of the dummy route interface.

Add a fix to sync the portchannel interface state to the remote LC only when there routeinterface is created on the local LC.
shiraez pushed a commit to Marvell-switching/sonic-swss that referenced this issue Feb 17, 2025
… is no rif assciated with the port (sonic-net#3207)

What I did
Fixes: sonic-net/sonic-buildimage#19357

Why I did it
In the sonic-mgmt pc test suite. When an empty lag is created. The portchannel changed is sync'ed to the remote LC even if there portchannel has no route interface created. This results in a dummy route interface created on the remote LC.

So when the empty port channel is removed on the local card, the removal fails in the remote LC because of the dummy route interface.

Add a fix to sync the portchannel interface state to the remote LC only when there routeinterface is created on the local LC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants