[SmartSwitch] Add tests for reboot of a smart switch #16566

vvolam · 2025-01-17T03:30:13Z

Description of PR

Summary: Add sonic-mgmt tests for reboot of a smart switch and individual DPUs
Fixes # (issue)

Type of change

Back port request

Approach

What is the motivation for this PR?

Supporting different types of reboots for smart switch

How did you do it?

Extend existing reboot() method for a smart switch as well to reboot the DPUs.
Add a test case to reboot all the DPUs individually

How did you verify/test it?

-Verified on NVIDIA 4280 smarswitch.

Any platform specific information?

-Smartswitch topology

Supported testbed topology if it's a new test case?

Documentation

mssonicbld · 2025-01-17T03:30:15Z

/azp run

azure-pipelines · 2025-01-17T03:30:28Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-01-17T03:36:25Z

/azp run

azure-pipelines · 2025-01-17T03:36:37Z

Azure Pipelines successfully started running 1 pipeline(s).

tests/smartswitch/common/reboot.py

mssonicbld · 2025-02-06T22:35:32Z

/azp run

mssonicbld · 2025-02-06T23:46:02Z

/azp run

azure-pipelines · 2025-02-06T23:46:15Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-02-11T03:32:44Z

/azp run

azure-pipelines · 2025-02-11T03:32:57Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-02-11T04:44:55Z

/azp run

azure-pipelines · 2025-02-11T04:45:09Z

Azure Pipelines successfully started running 1 pipeline(s).

nissampa · 2025-02-11T19:33:43Z

lgtm

oleksandrivantsiv · 2025-02-11T21:05:12Z

@congh-nvidia, @JibinBao pleae review

tests/common/devices/sonic.py

tests/smartswitch/common/reboot.py

tests/smartswitch/platform_tests/test_reload_dpu.py

mssonicbld · 2025-03-18T19:05:36Z

/azp run

azure-pipelines · 2025-03-18T19:05:49Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-03-18T19:12:24Z

/azp run

azure-pipelines · 2025-03-18T19:12:35Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-03-18T21:35:57Z

/azp run

azure-pipelines · 2025-03-18T21:36:08Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2025-04-02T03:02:32Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-04-02T03:05:11Z

/azp run

azure-pipelines · 2025-04-02T03:05:22Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-04-02T17:12:25Z

/azp run

azure-pipelines · 2025-04-02T17:12:36Z

Azure Pipelines successfully started running 1 pipeline(s).

tests/common/reboot.py

tests/platform_tests/test_reboot.py

tests/common/devices/sonic.py

mssonicbld · 2025-04-07T20:43:12Z

/azp run

azure-pipelines · 2025-04-07T20:43:23Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-04-07T21:11:15Z

/azp run

azure-pipelines · 2025-04-07T21:11:27Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-04-08T18:22:13Z

/azp run

azure-pipelines · 2025-04-08T18:22:25Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-04-08T18:58:19Z

/azp run

azure-pipelines · 2025-04-08T18:58:37Z

Azure Pipelines successfully started running 1 pipeline(s).

tests/smartswitch/common/reboot.py

prabhataravind

LGTM overall. Please test on all smartswitch vendors.

* Update snappi_fixtures.py updated to incorporate new snappi build changes * Update traffic_generation.py updated new snappi build changes * Update traffic_generation.py updated capture code * Add test to verify db_migrator with DNS_NAMESERVER (#17639) Approach What is the motivation for this PR? There's a test gap, we don't have test to verify db_migrator How did you do it? This test will modify CONFIG_DB and run db_migrator, and verify that DNS_NAMESERVER is from minigraph or golden config. test_migrate_dns_02: there's minigraph.xml and dns.j2, and there's no golden config. After migration, there's DNS_NAMESERVER in CONFIG_DB, because db_migrator can migrate from minigraph. test_migrate_dns_03 is used to reproduce SonicQosProfile issue: there's minigraph.xml and dns.j2, and I added SonicQosProfile in minigraph.xml, and there'no golden config. After migration, there's no DNS_NAMESERVER in CONFIG_DB, because db_migrator can't migrate from minigraph. How did you verify/test it? Run end to end test * Fix pfcwd/test_pfcwd_function.py for dualtor topologies (#17833) What is the motivation for this PR? pfcwd/test_pfcwd_function.py::TestPfcwdFunc::test_pfcwd_actions is flaky and fails with the following signature. ====================================================================== FAIL: pfc_wd.PfcWdTest ---------------------------------------------------------------------- Traceback (most recent call last): File "ptftests/py3/pfc_wd.py", line 148, in runTest return verify_packet_any_port(self, masked_exp_pkt, dst_port_list) File "/root/env-python3/lib/python3.7/site-packages/ptf/testutils.py", line 3437, in verify_packet_any_port % (result.port, device_number, ports, result.format()) AssertionError: Received expected packet on port 1 for device 0, but it should have arrived on one of these ports: [23]. ========== RECEIVED ========== 0000 82 FD E1 7F 90 01 00 AA BB CC DD EE 08 00 45 0D ..............E. 0010 00 56 00 01 00 00 3F 06 1B DF 64 5B 3A B0 C0 A8 .V....?...d[:... 0020 00 02 EA F5 27 6F 00 00 00 00 00 00 00 00 50 02 ....'o........P. 0030 20 00 21 87 00 00 00 01 02 03 04 05 06 07 08 09 .!............. 0040 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 ................ 0050 1A 1B 1C 1D 1E 1F 20 21 22 23 24 25 26 27 28 29 ...... !"#$%&'() 0060 2A 2B 2C 2D *+,- ============================== How did you do it? The test randomly selects a dst_port but always assigns the IP 192.168.0.2 to it. In dualtor topologies there is a notion of static/fixed IP addresses on the ToR's side admin@ld301:~$ show mux config SWITCH_NAME PEER_TOR ------------- ---------- ld302 10.1.0.33 port state ipv4 ipv6 ---------- ------- --------------- ----------------- Ethernet4 auto 192.168.0.2/32 fc02:1000::2/128 Ethernet8 auto 192.168.0.3/32 fc02:1000::3/128 Ethernet12 auto 192.168.0.4/32 fc02:1000::4/128 Ethernet16 auto 192.168.0.5/32 fc02:1000::5/128 Ethernet20 auto 192.168.0.6/32 fc02:1000::6/128 Ethernet24 auto 192.168.0.7/32 fc02:1000::7/128 Ethernet28 auto 192.168.0.8/32 fc02:1000::8/128 Ethernet32 auto 192.168.0.9/32 fc02:1000::9/128 Ethernet36 auto 192.168.0.10/32 fc02:1000::a/128 Ethernet40 auto 192.168.0.11/32 fc02:1000::b/128 Ethernet44 auto 192.168.0.12/32 fc02:1000::c/128 Ethernet48 auto 192.168.0.13/32 fc02:1000::d/128 Ethernet52 auto 192.168.0.14/32 fc02:1000::e/128 Ethernet56 auto 192.168.0.15/32 fc02:1000::f/128 Ethernet60 auto 192.168.0.16/32 fc02:1000::10/128 Ethernet64 auto 192.168.0.17/32 fc02:1000::11/128 Ethernet68 auto 192.168.0.18/32 fc02:1000::12/128 Ethernet72 auto 192.168.0.19/32 fc02:1000::13/128 Ethernet76 auto 192.168.0.20/32 fc02:1000::14/128 Ethernet80 auto 192.168.0.21/32 fc02:1000::15/128 Ethernet84 auto 192.168.0.22/32 fc02:1000::16/128 Ethernet88 auto 192.168.0.23/32 fc02:1000::17/128 Ethernet92 auto 192.168.0.24/32 fc02:1000::18/128 Ethernet96 auto 192.168.0.25/32 fc02:1000::19/128 Due to this packet sometimes ends up being forwarded to Ethernet4 (port1) instead of the port expected by the test. The proposed fix is that in case of dualtor alone choose destination IP according to MUX_CONFIG for the interface chosen as the dst_port. How did you verify/test it? Ran all pfcwd tests on Arista-7260CX3 with dualtor-120 topology. * Refine baseline pipeline yml and fix error (#17499) What is the motivation for this PR? Baseline testplan name are different from PR testing, but it's better to let them have the same name, will be easier for kusto query. t0-sonic test didn't pass VM_TYPE to elastictest template, which caused t0-sonic deploy failure. t0-sonic and dpu test lost specific param. How did you do it? Refine baseline pipeline yml to let testplan name have same build reason with PR test. Pass VM_TYPE to elastictest template. Add specific param for t0-sonic and dpu test. * Choose correct vlan ip for 2vlan config in advance_reboot (#17831) What is the motivation for this PR? There are 2 Vlans on the t0-118 topology. We observe that the ptftest launched from upgrade_path tests will default to using the 192.169.0.0/22 IP for Vlan1000 and the test would fail with DUT is not ready due to packets sent by the PTF does not have any response from the DUT. However, by switching to use 192.168.0.0/25 for Vlan2000, upgrade_path no longer fails on DUT is not ready and is able to pass normal warm upgrade. How did you do it? Call the common help function get_vlan_interface_list and get_vlan_interface_info to get vlan interface and ipv4 address. How did you verify/test it? Run platform_tests.test_advanced_reboot on T0 testbeds. Any platform specific information? T0 platforms * skip dynamic_acl on platform x86_64-8101_32fh_o_c01-r0 (#17848) * refactor: optimize mgmt ipv6 only test (#17851) Description of PR Optimize the ip/test_mgmt_ipv6_only.py test module with Python multithreading. Summary: Fixes # (issue) Microsoft ADO 30056122 Approach What is the motivation for this PR? The ip/test_mgmt_ipv6_only.py takes a long time to finish on a multi-DUT device, for example, ~100 min on T2 device, so we wanted to optimize it with Python multithreading to reduce the running time. How did you do it? How did you verify/test it? I ran the updated code on a multi-DUT device and verified that the running time was reduced to ~50 min: Elastictest link Besides, I also verified the change on T0 and dualtor: T0: https://elastictest.org/scheduler/testplan/67f05c6787ffab7db692a20b?testcase=ip%2Ftest_mgmt_ipv6_only.py&type=console&leftSideViewMode=detail dualtor: https://elastictest.org/scheduler/testplan/67f05c8d40a6f1f300f5363e?leftSideViewMode=detail&testcase=ip%2Ftest_mgmt_ipv6_only.py&type=console co-authorzied by: [email protected] * feat: support trimming lab inv file (#17348) Description of PR Support trimming the inventory files such as ansible/lab, ansible/t2_lab etc when passing --trim_inv option. Summary: Fixes # (issue) Microsoft ADO 30056122 Approach What is the motivation for this PR? When we enable inventory trimming by passing the --trim_inv option, the current logic is to only trim the ansible/veos file, but we noticed that the other inventory file (such as ansible/lab) should also be trimmed because it contains the configs of all the devices in that lab, but we only need the configs related to the current test run. Therefore, we decided to support trimming these inventory files as well. Please note that the PDU & Fanout hosts trimming is not supported in this PR as it's currently blocked by #17347 How did you do it? How did you verify/test it? I ran the new trimming logic on various lab files and can confirm it's working well: https://elastictest.org/scheduler/testplan/67c7ad505048655bf9cf8a58 https://elastictest.org/scheduler/testplan/67c78be48dcac0cdc64a3998 https://elastictest.org/scheduler/testplan/67c78cc7f60a7a79ff1ae585 https://elastictest.org/scheduler/testplan/67c78c9c8dcac0cdc64a399c https://elastictest.org/scheduler/testplan/67c7b419d0bae94c81d8a9d6 https://elastictest.org/scheduler/testplan/67ca846a5048655bf9cf8f7b Any platform specific information? co-authorized by: [email protected] * Add multi-asic support for test-intf-fec (#17814) Description of PR Summary: Add multi ASIC support for test-intf-fec. This is possible with the utility command update in sonic-net/sonic-utilities#3819 Fixes # (issue) 28838870 Approach What is the motivation for this PR? Described How did you do it? Update the command from sonic-net/sonic-utilities#3819 and update the code base so that works with T2. For 202405 Please note that for a release branch to work internally, the following PR here needs to be included: #17183 #14661 #16424 #15481 How did you verify/test it? T2 platform verified Signed-off-by: Austin Pham <[email protected]> --------- Signed-off-by: Austin Pham <[email protected]> * warm boot to config save before reboot (#17849) * [KubeSonic] Add gnmi to container_upgrade (#17796) Approach What is the motivation for this PR? We need to verify gnmi feature after container upgrade How did you do it? And gnmi and gnmi_watchdog to container upgrade How did you verify/test it? Run container upgrade pipeline * Update pfcwd_multi_node_helper.py updated to support new snappi model * [performance_meter] add swss create time criteria (#17740) What is the motivation for this PR? Need check for checking time spent in swss create switch How did you do it? Add new success criteria to check for occurrence of swss create switch start and end How did you verify/test it? Run test on 7215 devices * [mcx] fix bug with mcx deployment script (#17841) What is the motivation for this PR? Fix a none working mcx deployment script. How did you do it? Fix iteritems How did you verify/test it? Deploy mcx with new script * [port_util] Add port alias-to-name mapping for Arista-7050CX3-32S-S128 (#17877) What is the motivation for this PR? Add port alias-to-name mapping for Arista-7050CX3-32S-S128 How did you do it? Update port_utils.py. How did you verify/test it? Verified by deploy testbed. * Update pfcwd_runtime_traffic_helper.py updated file to accomodate new snappi changes. * Update pfcwd_burst_storm_helper.py updated file to accomodate new snappi changes * Update pfcwd_basic_helper.py updating files to accomodate snappi changes * [dualtor] update template to latest (#17879) What is the motivation for this PR? Old template is not up to date and does not match with changes in vm_topo results. Update it so the generated minigraph work. How did you do it? Copy the section from minigraph_dpg.j2 How did you verify/test it? Run yang validation on generated minigraph. * Fixed swss feature name for test_lldp_neighbor_post_orchagent_reboot (#15715) What is the motivation for this PR? The test test_lldp_neighbor_post_orchagent_reboot fails on multi-asic system. The test tries to disable autorestart feature for swss by using the namespace container name, e.g., swss0, swss1, etc For config feature autorestart disable, it needs to use 'swss' as global feature name How did you do it? Changed code to use 'swss' as feature name without using namespace id How did you verify/test it? run sonic-mgmt test_lldp.py --------- Signed-off-by: Anand Mehra [email protected] * Add a fixture to enable nat for dpus (#17753) 1. Enable nat for dpus on smartswitch * Ignore subnet decap test when no portchannels found (#17810) What is the motivation for this PR? Solve IndexError: list index out of range in dut_port = list(mg_facts['minigraph_portchannels'].keys())[0] because minigraph_portchannels is empty. How did you do it? This checks if any portchannels exist before attempting to access them, preventing the IndexError. How did you verify/test it? ========================================================================================================================================================================================= short test summary info ========================================================================================================================================================================================== SKIPPED [4] decap/test_subnet_decap.py:207: No portchannels found in minigraph ================================================================================================================================================================================ 4 skipped, 1 warning in 797.40s (0:13:17) ================================================================================================================================================================================= Any platform specific information? str4-sn5600-1 * [sonic-mgmt][dualtor-aa] Fix flakiness of fdb/test_fdb_mac_learning.py (#17873) What is the motivation for this PR? After link bringup, it's taking some time for mux status to be consistent in dualtor-aa topology (i.e SERVER_STATUS is 'unknown'). And it's not a test specific issue, I can see similar behaviour on dut where dualtor-aa is deployed. How did you do it? So increasing the timeout to 300 (currently it's 150 secs) to fix flakiness. * Increase timeout to 5 in verify_packet_any_port for background traffic (#17904) What is the motivation for this PR? The test is giving us a false negative msg = 'Did not receive expected packet on any of ports [7, 13, 17, 30, 27, 25, 5, 34, 21, 16, 24, 1, 33, 12, 4, 20, 2, 0, 11... 01 .............0..\n0050 00 AA BB CC DD EE ......\n==============================\n' self = <tests.common.plugins.ptfadapter.ptfadapter.PtfTestAdapter testMethod=runTest> /usr/lib/python3.8/unittest/case.py:753: AssertionError Although on a closer look we found that the DUTis forwarding the packet in a reasonable duration of time but for some reason testutils.verify_packet_any_port is taking longer to detect it. There is also another issue which doesn't cause any failure but defeats the purpose of testing. In case of active-active dualtor we call setup_standby_ports_on_rand_unselected_tor_unconditionally to put the system in active-standby mode. If this is called after background_traffic then the background trafffic flows through the unselected ToR which is not desired. How did you do it? Increase the timeout to 5s from system default for testutils.verify_packet_any_port Make the order of fixture execution deterministic so that setup_standby_ports_on_rand_unselected_tor_unconditionally is called before background_traffic How did you verify/test it? Verified on Arista-7050CX3 with dualtor-aa topology. * Disable all bmp table after test to avoid potential impact to other test cases. (#17910) Disable all bmp table after test to avoid potential impact to other test cases Description of PR Work item tracking Microsoft ADO (number only):32206168 Approach What is the motivation for this PR? Disable all bmp table after test to avoid potential impact to other test cases How did you do it? Disable all relevant bmp table via config cli after each test. How did you verify/test it? kvm test verified. Any platform specific information? * Make lossyqueuevoq check platform/hwskus. (#17726) * Configure macsec rekey period on EOS hosts (#17811) What is the motivation for this PR? Macsec::TestControlPlane::test_rekey_by_period tests failing when EOS selected as key-server How did you do it? If rekey-period is non-zero, we are configuring rekey period on EOS host How did you verify/test it? Sonic-mgmt Macsec::TestControlPlane::test_rekey_by_period tests are passing with the above change. * [M1] Add doc for M1 topology announce routes (#17905) Summary: Add doc for M1 topology announce routes. * [SmartSwitch] Add tests for reboot of a smart switch (#16566) Add sonic-mgmt tests for reboot of a smart switch and individual DPUs * Rewrite platform_tests/broadcom/test_ser.py (#17381) * Rewrite ser test Rewrite the SER injection test to use the internal broadcom command instead of doing the SER injection manually. Skipping for TH5 skus as it does have this functionality at the moment. * Rewrite ser test: PR edits Use "stdout_lines" instead of "stdout" for ser output parsing and adjust Arista-7060X6 conditions to include Github issue * [dhcp_relay] Optimize log for test_dhcp_relay (#17906) What is the motivation for this PR? Add log for test_dhcp_relay for triaging issue How did you do it? Add log for test_dhcp_relay for triaging issue How did you verify/test it? Run test and find below log files * Revert "[dhcp_relay] Remove test_dhcp_relay test in t0-2vlans (#17208)" (#17676) This reverts commit 1762bc28f8ccdbde3cedd83ceb2f76204b2f2e17. * Skip test_reload_configuration_checks on Cisco platform (#17868) * Skip test_reload_configuration_checks on Cisco platform * Revise * update d18u8s4 PT0 ASN to 4 bytes (#17888) What is the motivation for this PR? Fix topo error. How did you do it? How did you verify/test it? admin@sonic:~$ show ip bgp summary IPv4 Unicast Summary: BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0 BGP table version 2 RIB entries 3, using 672 bytes of memory Peers 12, using 8903712 KiB of memory Peer groups 5, using 320 bytes of memory Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName ----------- --- ---------- --------- --------- -------- ----- ------ --------- -------------- -------------- 10.0.0.57 4 64600 0 0 0 0 0 never Active ARISTA01T1 10.0.0.59 4 64600 0 0 0 0 0 never Active ARISTA02T1 10.0.0.61 4 64600 0 0 0 0 0 never Active ARISTA03T1 10.0.0.63 4 64600 0 0 0 0 0 never Active ARISTA04T1 10.0.0.65 4 64600 0 0 0 0 0 never Active ARISTA05T1 10.0.0.67 4 64600 0 0 0 0 0 never Active ARISTA06T1 10.0.0.69 4 64600 0 0 0 0 0 never Active ARISTA07T1 10.0.0.71 4 64600 0 0 0 0 0 never Active ARISTA08T1 10.0.0.157 4 4200000000 0 0 0 0 0 never Active ARISTA01PT0 10.0.0.159 4 4200000001 0 0 0 0 0 never Active ARISTA02PT0 10.0.0.161 4 4200000002 0 0 0 0 0 never Active ARISTA03PT0 10.0.0.163 4 4200000003 0 0 0 0 0 never Active ARISTA04PT0 Total number of neighbors 12 admin@sonic:~$ show ipv6 bgp summary IPv6 Unicast Summary: BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0 BGP table version 2 RIB entries 3, using 672 bytes of memory Peers 12, using 8903712 KiB of memory Peer groups 5, using 320 bytes of memory Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName ----------- --- ---------- --------- --------- -------- ----- ------ --------- -------------- -------------- fc00::7a 4 64600 0 0 0 0 0 never Active ARISTA03T1 fc00::7e 4 64600 0 0 0 0 0 never Active ARISTA04T1 fc00::8a 4 64600 0 0 0 0 0 never Active ARISTA07T1 fc00::8e 4 64600 0 0 0 0 0 never Active ARISTA08T1 fc00::17a 4 4200000002 0 0 0 0 0 never Active ARISTA03PT0 fc00::17e 4 4200000003 0 0 0 0 0 never Active ARISTA04PT0 fc00::72 4 64600 0 0 0 0 0 never Active ARISTA01T1 fc00::76 4 64600 0 0 0 0 0 never Active ARISTA02T1 fc00::82 4 64600 0 0 0 0 0 never Active ARISTA05T1 fc00::86 4 64600 0 0 0 0 0 never Active ARISTA06T1 fc00::172 4 4200000000 0 0 0 0 0 never Active ARISTA01PT0 fc00::176 4 4200000001 0 0 0 0 0 never Active ARISTA02PT0 Total number of neighbors 12 admin@sonic:~$ * [dualtor_io] Add test_tor_switchover_impact test (#15262) * [dualtor_io] Add test_tor_switchover_impact test Test will send traffic from T1 -> server and perform switchover. It will then collect the logs and process the results to test_tor_switchover_impact.json Any disruptions that break the threshold will cause test failure. Signed-off-by: Nikola Dancejic <[email protected]> * [test_switchover_impact] Moved to new file and refactored Steps: 1. set up ipv4 and ipv6 neighbors. default 10 ipv4 and 64 ipv6. 2. set dut to active. 3. start traffic test. 4. switch interface to standby. 5. record and validate results. by default the test runs 100 iterations, taking around 3 hours. The test will fail if one of the following conditions occur: - Traffic drop exceeds threshold. (100ms for planned, 400ms for unplanned) - Switchover metrics on at least one of the duts do not match within threshold of measured traffic impact. (100ms for planned, 400ms for unplanned) - Metrics on either device are not present. - If there are multiple disruptions during a single switchover Signed-off-by: Nikola Dancejic <[email protected]> * Update tests_mark_conditions.yaml switchover impact test takes hours to complete, skip until we set up a way to make it run weekly * Update tests_mark_conditions.yaml fixing order of conditions for switchover_impact --------- Signed-off-by: Nikola Dancejic <[email protected]> * Fix srv6/test_srv6_dataplane.py (#17896) Fix srv6/test_srv6_dataplane.py * Fix pl test to handle outbound_direction_lookup (#17764) * Fix pl test to handle outbound_direction_lookup #17764 * Default mac for direction lookup is src_mac, so outbound_direction_lookup needs to be explicitly set to "dst_mac" * Only print the matched syslog in loganalzyer teardown check, no traceback info printed (#17926) What is the motivation for this PR? To make the failed summary of teardown loganalyzer shorter and clearer. It can make the summary easy to understand and downstream failure analyzer can do analysis based on clean summaries. The summary when a case failed in loganalzyer teardown phase: Before change: E Failed: Processes "['analyze_logs--<MultiAsicSonicHost str-msn4700-02>']" failed with exit code "1" E Exception: E match: 1 E expected_match: 0 E expected_missing_match: 0 E E Match Messages: E 2025 Apr 9 02:42:13.609855 str-msn4700-02 ERR kernel: [ 1820.284908] sxd_kernel: [error] Failed to bind BFD socket to local_addr (ip:104.0.0.74 ,port:49282) (err:-98). E Traceback: E Traceback (most recent call last): E File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/helpers/parallel.py", line 35, in run E Process.run(self) E File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run E self._target(*self._args, **self._kwargs) E File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/helpers/parallel.py", line 245, in wrapper E target(*args, **kwargs) E File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/plugins/loganalyzer/__init__.py", line 45, in analyze_logs E dut_analyzer.analyze(markers[node.hostname], fail_test, store_la_logs=store_la_logs) E File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/plugins/loganalyzer/loganalyzer.py", line 409, in analyze E self._verify_log(analyzer_summary) E File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/plugins/loganalyzer/loganalyzer.py", line 140, in _verify_log E raise LogAnalyzerError(result_str) E tests.common.plugins.loganalyzer.loganalyzer.LogAnalyzerError: match: 1 E expected_match: 0 E expected_missing_match: 0 E E Match Messages: DEBUG:tests.conftest:[log_custom_msg] item: <Function test_bfd_multihop[ipv6]> INFO:root:Can not get Allure report URL. Please check logs E 2025 Apr 9 02:42:13.609855 str-msn4700-02 ERR kernel: [ 1820.284908] sxd_kernel: [error] Failed to bind BFD socket to local_addr (ip:104.0.0.74 ,port:49282) (err:-98). After change: E Failed: Got matched syslog in processes "analyze_logs--<MultiAsicSonicHost bjw2-can-7260-10>" exit code:"1" E match: 1 E expected_match: 0 E expected_missing_match: 0 E E Match Messages: E 2025 Apr 10 08:21:06.808698 bjw2-can-7260-10 ERR admin: [ 1820.284908] sxd_kernel: [error] Failed to bind BFD socket to local_addr (ip:104.0.0.74 ,port:49282) (err:-98) How did you do it? Check if the failed process is analyze_log and if there is Matched Messages in the exception, if so, just print the exception. Don't need to print the traceback, it ensures the summary is shorter and clearer. How did you verify/test it? Run a case failed in loganalyzer teardown phase, check the summary of the failed case Signed-off-by: Zhaohui Sun <[email protected]> * [dualtor_io] Allow duplications for link down downstream I/O (#17909) What is the motivation for this PR? The following two link failure cases are failing on Cisco/MLNX: test_active_link_down_downstream_active test_active_link_down_downstream_active_soc The reason is that, after link down, between the fdb flush and tunnel route add (due to mux toggle-to-standby), the ASIC has no l2 information for server/soc neighbors, downstream traffic will flood to all vlan member ports on Cisco/MLNX platform. Those two testcase has no tolerance for packet duplications due to that, on Broadcom platform, traffic to neighbors with no l2 information will be simply dropped. Let's adapt to Cisco/MLNX platform, by allowing packet duplications for those two testcases. How did you verify/test it? dualtor_io/test_link_failure.py::test_active_link_down_downstream_active[active-active] PASSED [100%] dualtor_io/test_link_failure.py::test_active_link_down_downstream_active_soc[active-active] PASSED [100%] Signed-off-by: Longxiang Lyu <[email protected]> * Revert "Skip test_vnet_decap on Cisco-8000 with 202411 (#17776)" (#17941) This reverts commit 02ff5e8cb77feb4201b003d70e472f17b93d9db5. * Support Ubuntu 24.04 server in KVM (#17883) What is the motivation for this PR? When running with Ubuntu 24.04 host, many tasks from a sonic-mgmt container may fail. One reason is that a newer version of Python being used in Ubuntu 24.04 that doesn't support some of the packages that ansible modules attempt to import. Such as imp module is removed in Python 3.11, need to use importlib, and PR #16039 did some replacement but still need to replace in somewhere else. Another reason is that some ansible modules must be running in a python3 venv for scecurity reason. Also there are some access issues need to fix. How did you do it? Try to use import importlib instead of importing imp Install python3-venv in Ubuntu 24.04 server Update libvirt qemu configuration in Ubuntu 24.04 server * fix * remove --------- Co-authored-by: xwjiang2021 <[email protected]> * add hwsku V64 (#17897) * Default the inner dscp to outer dscp map to be 1-1. (#17860) * Add dualtor fixtures to no_traffic test. (#17916) * add template t0-isolated-d96u32s2-leaf.j2 and Arista-7060X6-16PE-384C-O128S2 into th5 hwsku (#17898) add moby hwsku into sonic-mgmt Description of PR Summary: Fix template for d96u32s2 topo. also add missing hwsku into the list. port_utils.py change is not included as the final port layout may still change, pending on sonic-net/sonic-buildimage#22086 * Restore config after vxlan_crm from vxlan_ecmp. (#17767) What is the motivation for this PR? test_vxlan_crm.py was recently enabled for smartswitch, which uses the same setup fixture as test_vxlan_ecmp.py. However, the tests in test_vxlan_ecmp.py have additional cleanup code to handle these BFD entries which is missing in test_vxlan_crm.py. How did you do it? Added a config restore fixture, same as in #17046, to test_vxlan_ecmp.py which is re-used by test_vxlan_crm.py How did you verify/test it? Ran test_cacl_application.py after running test_vxlan_crm.py and it's now passing with the fix in place. Also checked that iptable is cleaned up after running test_vxlan_crm.py. * Fix loganalyzer regex to ignore chronyd related ERR logs (#17880) What is the motivation for this PR? PR #16239 ignored an error message coming from ntpd (actually written by nss_tacplus). As that PR explains, this can happen when a config reload is run, and chrony calls getpwnam when /etc/tacplus_nss.conf is still being regenerated. However, it is not a failure. How did you do it? With the change of NTP daemons, this loganalyzer ignore regex needs to be updated to match the right chrony log. How did you verify/test it? By running sonic-mgmt tests that perform config reload and verifying that these errors do not cause loganalyzer to mark the tests as failed. Signed-off-by: Prabhat Aravind <[email protected]> * AddCluster Test (#17744) Approach What is the motivation for this PR? Validate Add Cluster Incremental Config Update. How did you do it? Remove one of downstream T1 from minigraph Load updated minigraph Apply AddCluster json patch Check Port Status, BGP etc. How did you verify/test it? Running in Nokia-IXR7250E-36x400G Chassis linecard. Any platform specific information? This test only verified in Nokia-IXR7250E-36x400G, for other SKU, we have not verified it yet. Supported testbed topology if it's a new test case? T2 only since this is add cluster scenario which is only applicable in T2. * Add test_srv6_uN_no_vlan_flooding test case (#17861) Summary: This is to add a test case to verify that after enabling proxy_arp, there is no L2 flooding in downstream VLAN even though the switch does not have a FDB entry to forward a downstream SRv6 packet. * Revert "Revert "[dhcp_relay] Remove test_dhcp_relay test in t0-2vlans (#17208)" (#17676)" (#17946) This reverts commit 53e58628642db1dacc0c5374f63e3a4246c7c86c. * Revert "Disable all bmp table after test to avoid potential impact to other t…" (#17969) Reverts #17910 * [M1/M2/M3] Skip test_null_route_helper on M1/M2/M3 topo (#17962) * [pretest] update collect_dut_lossless_prio method (#17907) Fix method collect_dut_lossless_prio. The pfc_enable parameter can be empty. Without this change we can get error: > result = [int(x) for x in port_qos_map[intf]['pfc_enable'].split(',')] E ValueError: invalid literal for int() with base 10: '' PRs with related changes: sonic-net/sonic-buildimage#22252 sonic-net/sonic-buildimage#22067 Signed-off-by: AntonHryshchuk <[email protected]> * [Mellanox] Update qos sai test for SN56xx buffer configuration alignment (#17788) Update QoS sai test according to the PR: sonic-net/sonic-buildimage#22067 * Increase timeout for gathering sonic hosts facts (#17945) What is the motivation for this PR? Gather facts will run commands on sonic hosts and get the results of platform information. But in dualtor, hosts may need more time to get the return results and caused pretest failure. How did you do it? Increase timeout in gathering facts and check if the success rate increase in a week. * Enable bmp table dump before all test cases (#17963) What is the motivation for this PR? Enable bmp table dump before all test cases as pre-condition How did you do it? update ansible/config_sonic_basedon_testbed.yml and add one blocl How did you verify/test it? validation will cover it. * [GCU] Fixing argument passed to format_json function for multi-asic case (#17096) * Adding fix for the namespace that is sent as argument in format_json_patch_for_multiasic * Adding fix for the namespace that is sent as argument in format_json_patch_for_multiasic - ip bgp suite * Adding fix for the namespace that is sent as argument in format_json_patch_for_multiasic - portchannel suite * Fix BFD status check and ipv6 PTF intermittent issue. (#17819) * Fix BFD status check and ipv6 PTF intermittent issue. * Add RFC comment. * Skip nasa proc check because that the usage of nasa is always high it expected (#17681) * Fix test_radv_ipv6_ra (#17275) Signed-off-by: Ze Gan <[email protected]> * Improve arp_responder.py performance (#17280) * Optimize performance of arp_responder by reusing sockets Instead of having scapy.sendp create a socket, bind it, send the packet, and then close the packet every time, just reuse the listening socket that's been created. Signed-off-by: Saikrishna Arcot <[email protected]> * Use libpcap backend for scapy in arp_responder Use the libpcap backend in arp_responder, which sets up a ring buffer for incoming packets. This is expected to be more efficient than making many recvmsg syscalls, especially during ARP/NDP floods. Signed-off-by: Saikrishna Arcot <[email protected]> * Enhance BPF filter for arp_responder Enhance the BPF filter for arp_responder to only get ARP/ICMP packets for the IPv4/IPv6 addresses that it is interested in. This is especially useful in large VLANs with many hosts. Signed-off-by: Saikrishna Arcot <[email protected]> --------- Signed-off-by: Saikrishna Arcot <[email protected]> * Switch from tcpdump to dumpcap (#17276) * Switch from tcpdump to dumpcap This is (hopefully) more efficient on the system by being a single process (but multiple threads for each interface), and also stores information about which interface each packet was seen on. This also eliminates the need to run mergecap afterwards. In addition, specify the snapshot length and the buffer size to optimize for small packets, but maybe a large burst of them at times. For now, capture_filtered.pcap will still be a regular pcap file; if/when tshark is added to the PTF container, tshark can then be used for creating this file and still preserve interface information. Signed-off-by: Saikrishna Arcot <[email protected]> * Actually catch and ignore TimeoutExpired for the dumpcap process Signed-off-by: Saikrishna Arcot <[email protected]> * Wait up to 15 seconds for the pcap file to be created Signed-off-by: Saikrishna Arcot <[email protected]> * Add a check afterwards to see if the pcap file was actually created; if not, bail out Signed-off-by: Saikrishna Arcot <[email protected]> --------- Signed-off-by: Saikrishna Arcot <[email protected]> * Pfcwd multi port: Fix for Multi-asic Scenario and test_multi_port Requirement of 3 routed ports (#16778) Summary: Updated the test case logic to include active interfaces from all available ASICs, preventing unnecessary skips. Ensured that the multi-port test case selects an appropriate RX port for the second port, avoiding the use of an already stormed port. Fixes # (issue) #16777 * Disable PFC-WD during PCBB and some wmk test improvements (#17889) * Improve validation so multiple failures can be reported. * Disable PFC-WD during PCBB tests. * Wait for PFC-WD to stop. * Remove packet aging code from pfcwd fixture. * [ondatra] Add ThinKit-on-Ondatra support/tests. (#17720) * [ondatra] Add ThinKit-on-Ondatra interop layer. * [ondatra] Add ThinKit-On-Ondatra fixtures. * [ondatra] Add ThinKit-on-Ondatra support/tests. --------- Co-authored-by: kishanps <[email protected]> * Reduce flakiness of test_l2_configure.py. (#17577) What is the motivation for this PR? Address flakes such as https://elastictest.org/scheduler/testplan/67d6bf6e607a6896f60ddd2a?testcase=l2%2ftest_l2_configure.py&type=console if callback.unreachable: > raise AnsibleConnectionFailure( "Host unreachable in the inventory", dark=callback.unreachable, contacted=callback.contacted, ) E pytest_ansible.errors.AnsibleConnectionFailure: Host unreachable in the inventory During config reload, the ipv4 connection can get broken and the ansible will through an exception here. This doesn't affect the assertion of this particular test. How did you do it? Catch the exception. How did you verify/test it? 10 runs of test_l2_configure.py on kvm. * Fix dependency issue in test_vxlan_crm.py (#17996) * [watchdog] Add the x86_64-8101_32fh_o_c01-r0 config to the platform_tests/api/watchdog.yml (#17878) Signed-off-by: vhlushko <[email protected]> * Fix vlan vs router mac issue with test_qos_dscp_mapping.py (#17846) * Enable test_qos_dscp_mapping.py to check for VLAN macs, especially re dualtor topos that use an explicit VLAN mac. * Rewrite utility to be more generic. * Fix loopback search to do a double-break. * Add logging details and comments. * add hwsku Cisco-8101-V64 in cisco-8000_gb_hwskus list (#17950) * refactor: optimize snmp intf test (#17975) Description of PR Optimize the snmp/test_snmp_interfaces.py test to reduce the running time on multi-asic devices. Summary: Fixes # (issue) Microsoft ADO 32181200 Approach What is the motivation for this PR? The running time of the snmp/test_snmp_interfaces.py test is too long on multi-asic devices (~130 min on a T2 device). The reason is that it will call get_snmp_facts() to get snmp_facts multiple times if it's a multi-asic device due to the enum_asic_index fixture. This is unnecessary because we only care about snmp_facts["snmp_interfaces"] and its value won't change in the entire test. Therefore, we can simply call it once and use it for all ASICs. How did you do it? How did you verify/test it? I ran the updated code on T2 topo and can confirm it's working well: https://elastictest.org/scheduler/testplan/67fc8b281b14df63e35b863a. The running time has decreased from ~130 min to ~30 min with parallel run enabled. Single-asic device (T1) regression test also passed: https://elastictest.org/scheduler/testplan/67fc93ed73da730645330ba4 co-authorized by: [email protected] * fix the issue when the timezone on the DUT is not UTC (#17863) In the ntp test case, if the timezone is earlier than the UTC, then test will fail The issue is introduced by PR #17554 * Align the sensors data for SN4280 with sonic-buildimage PR#21845 (#17747) * Add retry when checking fec mode restore status (#17679) Add retry when checking fec mode restore status in test_intf_fec.py Change-Id: I4058e2a5d7472375cbfefd23e952065e45e212eb * Enhance arp update test to support port toggle at dualtor active active testbed (#17253) Enhance arp update test to support port toggle at dualtor active active testbed Change-Id: I884d383b8472f64eb283f12c92f0f0b7cc4e7607 * limit parallel_run cct tasks number from 24 to 8 for fixture setup_bgp_graceful_restart (#16989) * Update CPU threshold for telemetry test_events case (#16611) * Add SmartSwitch HA feature test plan (#13043) Adding test plan for smart switch HA feature. * Update generic hash test to support dualtor active active topology (#16217) Update generic hash test to support dualtor active active topology * [M1] Add M1 up/down stream neighbor type (#17978) What is the motivation for this PR? Add M1 up/down stream neighbor type. How did you do it? Add M1 up/down stream neighbor type. How did you verify/test it? Verified by testcase route/test_default_route.py. * [acl/test_acl.py] Ensure frontend dut is used in ACL testing (#17582) What is the motivation for this PR? Previously ACL test change picks up loopback ip from rand_selected_dut, on T2 devices supervisor card doesn't have a loopback ip causing errors How did you do it? Use rand_selected_front_end_dut instead of rand_selected_dut for T2 How did you verify/test it? Ran 61 iterations of basic ACL test to make sure it passes on T2 device Additionally: T0: Test plan 67daa46be946313de56c8bf2: ACL changes regression test T0 - Elastictest (elastictest.org) T1: Test plan 67daa4a42f30cd54926394de: ACL changes regression test T1 - Elastictest (elastictest.org) T2: Test plan 67daa4ebe946313de56c8bf4: ACL changes regression test T2 - Elastictest (elastictest.org) Signed-off-by: Javier Tan [email protected] Signed-off-by: Javier Tan [email protected] * [iface_namingmode/test_iface_namingmode] Ensure LLDP neighbor comes back after link flap (#17603) What is the motivation for this PR? test_show_lldp_table sometimes fails because of missing lldp entry from interface flapped in test_config_interface_state How did you do it? Add an assert wait_until for lldp neighbor coming back before proceeding in test_config_interface_State How did you verify/test it? T0: Test plan 67da932be946313de56c8bd5: ifacenamingmode changes regression test T0 - Elastictest (elastictest.org) T1: Test plan 67db4c7f2d7dcfaa9d43f859: ifacenamingmode changes regression test T1 - Elastictest (elastictest.org) T2: Test plan 67da9530e3ea980065c43a91: ifacenamingmode changes regression test T2 - Elastictest (elastictest.org) Any platform specific information? Signed-off-by: Javier Tan [email protected] * Fix route/test_static_route.py (#17998) What is the motivation for this PR? The test is flaky. self = <tests.common.plugins.ptfadapter.ptfadapter.PtfTestAdapter testMethod=runTest> msg = 'Received expected packet on port 5 for device 0, but it should have arrived on one of these ports: [1].\n========== R...65 2E ute tests.route.\n0060 74 65 73 74 test\n==============================\n' def fail(self, msg=None): """Fail immediately, with the given message.""" > raise self.failureException(msg) E AssertionError: Received expected packet on port 5 for device 0, but it should have arrived on one of these ports: [1]. E ========== RECEIVED ========== E 0000 EE 0E C9 56 A0 05 00 AA BB CC DD EE 08 00 45 00 ...V..........E. E 0010 00 56 00 01 00 00 3F 06 77 9E 01 01 01 01 01 01 .V....?.w....... E 0020 01 01 04 D2 10 E1 00 00 00 00 00 00 00 00 50 02 ..............P. E 0030 20 00 5D 73 00 00 74 65 73 74 73 2E 72 6F 75 74 .]s..tests.rout E 0040 65 2E 74 65 73 74 5F 73 74 61 74 69 63 5F 72 6F e.test_static_ro E 0050 75 74 65 20 74 65 73 74 73 2E 72 6F 75 74 65 2E ute tests.route. E 0060 74 65 73 74 test E ============================== msg = 'Received expected packet on port 5 for device 0, but it should have arrived on one of these ports: [1].\n========== R...65 2E ute tests.route.\n0060 74 65 73 74 test\n==============================\n' self = <tests.common.plugins.ptfadapter.ptfadapter.PtfTestAdapter testMethod=runTest> pfcwd/test_pfcwd_warm_reboot.py configures an IP on a PTF interface but doesn't clean that up. route/test_static_route.py doesn't configures an IP but rather only sets up arp_responder for the IP. If both tests pick up the same IP but map it to different MACs (either by assigning or setting up ARP responder) then for the same ARP request the DUT can either receive reply from kernel or arp_responder. If the kernel responds the mismatch in ARP table can cause misforwarding. How did you do it? Proposed fix is to request remove_ip_addresses to clean up the IP addresses inside the PTF How did you verify/test it? Verified on Arista-7050CX3 with dualtor topology * Fix telemetry/test_events.py for dualtor (#17448) What is the motivation for this PR? telemetry/test_events.py is flaky The verification of events wrt DHCP via telemetry involves sending traffic. In that case it is intended that - For active-standby dualtor the randomly selected ToR is active. For active-active dualtor the traffic doesn't go to the unselected ToR How did you do it? Proposed fix is to introduce following new fixtures - toggle_all_simulator_ports_to_enum_rand_one_per_hwsku_host_m which will make the randomly selected ToR as active in active-standby dualtor setup_standby_ports_on_non_enum_rand_one_per_hwsku_host_m will make the unselected ToR as standby in active-active dualtor These fixtures rely on enum_rand_one_per_hwsku_hostname for the ToR being selected. How did you verify/test it? Ran on Arista-7050CX3 platform with 202411 image and dualtor/dualtor-aa topologies. * fix incompatible import of scapy and skip when no lldp neigh (#18020) Fix incompatible import of scapy and skip when no lldp neigh in test_srv6_dataplane.py * [CI]Add trigger type for test plan creation, add source repo and branch to test plan name (#17783) What is the motivation for this PR? Due to historical reason, test plan type is designed to either nightly or pr. But there are other types of test plan, baseline, recover and custom test. But from az pipline, there are only 3 type: nightly, pr or baseline. This PR fill trigger type by the test plan type and build reason. This will help to categorize the test plan. How did you do it? Add trigger type. Add repo name and branch name to PR test plan image How did you verify/test it? New filed is already ready but always null, no impact. image New pipeline will fill the filed. image Manually set build reason to simulate baseline test. image https://dev.azure.com/mssonic/build/_build/results?buildId=813692&view=logs&j=c4781836-bf04-5f60-aed4-f5b1830934f2&t=0a02131c-cb1d-58ae-2efd-7b654448dcb1 Any platform specific information? Signed-off-by: Chun'ang Li <[email protected]> * [M1] Update everflow testcase to support M1 topo (#18027) What is the motivation for this PR? Update everflow testcase to support M1 topo How did you do it? Update common functions How did you verify/test it? Verified by run everflow testcases on Arista-7050CX3 M1-48 testbed. * [Snappi] - Infra change for dynamic port selection from the setup replacing variables.py file. (#15069) Description of PR The purpose of the pull-request is dynamic port_selection from available setup rather than relying on variables.py. Pull-request adds a function snappi_port_selection in snappi_fixtures.py file. Summary: Fixes # (issue) Type of change #13769 Bug fix Testbed and Framework(new/improvement) Test case(new/improvement) Back port request 202012 202205 202305 202311 202405 Approach What is the motivation for this PR? Existing variables.py had following drawbacks: Various line-cards and ports had to be manually added in this file, making it dependent on that particular setup. For different setup, user had to re-configure this file. This is not scalable. This also hindered selecting setups on run-time. The variables.py did not have any provision for the interface-speed selection. The user had no provision to mention the speeds of the interfaces selected. For example, if the setup had both 100 and 400Gbps ports, user would have to define two different files or create additional dictionaries to accommodate 100 and 400Gbps interface separately. If a line-card is added or removed, then variables.py will require manual modification. To counter the above drawbacks, function snappi_port_selection is added in snappi_fixtures.py How did you do it? Following are the changes and reasoning behind the changes: Each testbed has to re-run test_pretest.py to generate a .JSON file in tests/metadata/snappi_tests/ folder. Metadata file generations will be in metadata/snappi_tests/ folder. This is avoid modification to the current metadata folder, therefore addressing our concern of conflicting with the current code base. Syntax: ./run_tests.sh -n TESTBED_NAME -c test_pretest.py::test_update_snappi_testbed_metadata -i ../ansible/INVENTORY,../ansible/veos -e "--topology=multidut-tgen,any --skip_sanity --trim_inv --disable_loganalyzer" -u If the topology is not 'multi-tgen' or 'tgen', then a skip message for non-tgen topology has been added. Function 'generate_skeleton_port_info' parses the above JSON file and creates template to fetch port-data from output of 'snappi_port_selection'. Skeleton parameterization format will be -, for example: 400.0-single_linecard_single_asic. The reason for this change is to follow the Pytest standard of using delimiter "-" for parameterization. This also skips the speed-category combination if it's not available with comes to 'snappi_port_selection' fixtures. The conditions for skip are: Speed or category is not in snappi_port_selection Or snappi_port_selection return None for the combination Function snappi_port_selection parses through all the available ports used in the testbed and generates a dictionary with ::. The line-card combination has three available modes - single line-card single asic, single line-card multiple asic and multiple linecard. The set of ports are determined by fixture number_of_tx_rx ports with scope "module" defined in each test. We don't need the setup_ports_and_dut as well now and we can simply call the snappi_testbed_config in the test itself and iterate through the available ports. Tagging for relevant reach: @sdszhang , @vmittal-msft , @rawal01 , @selldinesh, @developfast How did you verify/test it? Snapshot of the log: AzDevOps@68684a43ec9e:/data/tests$ python3 -m pytest --inventory ../ansible/ixia-sonic --host-pattern board71,board72,board73,board74 --testbed ixre-chassis117-t2 --testbed_file ../ansible/testbed.csv --log-cli-level info --log-file-level info --kube_master unset --showlocals -ra --show-capture stdout --junit-xml=/tmp/f.xml --skip_sanity --log-file=/tmp/f.log --disable_loganalyzer --topology multidut-tgen,any --cache-clear snappi_tests/pfc/test_lossless_response_to_external_pause_storms.py --pdb ====================================================================================================================== test session starts ======================================================================================================================= platform linux -- Python 3.8.10, pytest-7.4.0, pluggy-1.4.0 ansible: 2.13.13 rootdir: /data/tests configfile: pytest.ini ------------ curtailing irrelevant output ---------------- 20:06:33 __init__.store_fixture_values L0017 INFO | store memory_utilization test_lossless_response_to_external_pause_storms_test[400.0-multiple_linecard_multiple_asic] 20:06:33 __init__.pytest_runtest_setup L0024 INFO | collect memory before test test_lossless_response_to_external_pause_storms_test[400.0-multiple_linecard_multiple_asic] 20:06:33 __init__.pytest_runtest_setup L0044 INFO | Before test: collected memory_values {'before_test': {}, 'after_test': {}} ------------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------------- 20:06:33 test_lossless_response_to_external_pause L0070 INFO | Ports:[{'ip': '100.117.59.187', 'port_id': '1', 'location': '100.117.59.187/1', 'peer_port': 'Ethernet0', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '2', 'location': '100.117.59.187/2', 'peer_port': 'Ethernet8', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '4', 'location': '100.117.59.187/4', 'peer_port': 'Ethernet0', 'peer_device': 'board74', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board74>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}] 20:06:38 snappi_fixtures.__intf_config_multidut L0934 INFO | Configuring Dut: board73 with port Ethernet0 with IP 20.10.1.0/31 20:06:39 snappi_fixtures.__intf_config_multidut L0934 INFO | Configuring Dut: board73 with port Ethernet8 with IP 20.10.1.2/31 20:06:41 snappi_fixtures.__intf_config_multidut L0934 INFO | Configuring Dut: board74 with port Ethernet0 with IP 20.10.1.4/31 --------------- curtailed irrelevant output ---------- 20:11:02 snappi_fixtures.cleanup_config L1159 INFO | Removing Configuration on Dut: board73 with port Ethernet0 with ip :20.10.1.0/31 20:11:03 snappi_fixtures.cleanup_config L1159 INFO | Removing Configuration on Dut: board73 with port Ethernet8 with ip :20.10.1.2/31 20:11:04 snappi_fixtures.cleanup_config L1159 INFO | Removing Configuration on Dut: board74 with port Ethernet0 with ip :20.10.1.4/31 PASSED [ 16%] ----------------------------------------------------------------------------------------------------------------------- live log teardown ------------------------------------------------------------------------------------------------------------------------ 20:11:04 __init__.pytest_runtest_teardown L0049 INFO | collect memory after test test_lossless_response_to_external_pause_storms_test[400.0-multiple_linecard_multiple_asic] 20:11:04 __init__.pytest_runtest_teardown L0072 INFO | After test: collected memory_values {'before_test': {}, 'after_test': {}} snappi_tests/multidut/pfc/test_lossless_response_to_external_pause_storms.py::test_lossless_response_to_external_pause_storms_test[400.0-single_linecard_single_asic] ------------------------------------------------------------------------------------------------------------------------- live log setup ------------------------------------------------------------------------------------------------------------------------- 20:11:04 __init__.set_default L0053 INFO | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic 20:11:04 __init__.check_test_completeness L0151 INFO | Test has no defined levels. Continue without test completeness checks 20:11:04 __init__.loganalyzer L0051 INFO | Log analyzer is disabled 20:11:04 __init__.store_fixture_values L0017 INFO | store memory_utilization test_lossless_response_to_external_pause_storms_test[400.0-single_linecard_single_asic] 20:11:04 __init__.pytest_runtest_setup L0024 INFO | collect memory before test test_lossless_response_to_external_pause_storms_test[400.0-single_linecard_single_asic] 20:11:04 __init__.pytest_runtest_setup L0044 INFO | Before test: collected memory_values {'before_test': {}, 'after_test': {}} ------------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------------- 20:11:04 test_lossless_response_to_external_pause L0070 INFO | Ports:[{'ip': '100.117.59.187', 'port_id': '1', 'location': '100.117.59.187/1', 'peer_port': 'Ethernet0', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '2', 'location': '100.117.59.187/2', 'peer_port': 'Ethernet8', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '5', 'location': '100.117.59.187/5', 'peer_port': 'Ethernet16', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}] 20:11:10 snappi_fixtures.__intf_config_multidut L0934 INFO | Configuring Dut: board73 with port Ethernet0 with IP 20.10.1.0/31 20:11:12 snappi_fixtures.__intf_config_multidut L0934 INFO | Configuring Dut: board73 with port Ethernet8 with IP 20.10.1.2/31 20:11:13 snappi_fixtures.__intf_config_multidut L0934 INFO | Configuring Dut: board73 with port Ethernet16 with IP 20.10.1.4/31 --------------- curtailed irrelevant output ---------- 20:14:48 snappi_fixtures.cleanup_config L1159 INFO | Removing Configuration on Dut: board73 with port Ethernet0 with ip :20.10.1.0/31 20:14:49 snappi_fixtures.cleanup_config L1159 INFO | Removing Configuration on Dut: board73 with port Ethernet8 with ip :20.10.1.2/31 20:14:50 snappi_fixtures.cleanup_config L1159 INFO | Removing Configuration on Dut: board73 with port Ethernet16 with ip :20.10.1.4/31 PASSED [ 33%] ----------------------------------------------------------------------------------------------------------------------- live log teardown ------------------------------------------------------------------------------------------------------------------------ 20:14:51 __init__.pytest_runtest_teardown L0049 INFO | collect memory after test test_lossless_response_to_external_pause_storms_test[400.0-single_linecard_single_asic] 20:14:51 __init__.pytest_runtest_teardown L0072 INFO | After test: collected memory_values {'before_test': {}, 'after_test': {}} snappi_tests/multidut/pfc/test_lossless_response_to_external_pause_storms.py::test_lossless_response_to_external_pause_storms_test[100.0-single_linecard_multiple_asic] ------------------------------------------------------------------------------------------------------------------------- live log setup ------------------------------------------------------------------------------------------------------------------------- 20:14:51 __init__.set_default L0053 INFO | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic 20:14:51 __init__.check_test_completeness L0151 INFO | Test has no defined levels. Continue without test completeness checks 20:14:51 __init__.loganalyzer L0051 INFO | Log analyzer is disabled 20:14:51 __init__.store_fixture_values L0017 INFO | store memory_utilization test_lossless_response_to_external_pause_storms_test[100.0-single_linecard_multiple_asic] 20:14:51 __init__.pytest_runtest_setup L0024 INFO | collect memory before test test_lossless_response_to_external_pause_storms_test[100.0-single_linecard_multiple_asic] 20:14:51 __init__.pytest_runtest_setup L0044 INFO | Before test: collected memory_values {'before_test': {}, 'after_test': {}} ------------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------------- 20:14:51 test_lossless_response_to_external_pause L0070 INFO | Ports:[{'ip': '100.117.59.187', 'port_id': '9.1', 'location': '100.117.59.187/9.1', 'peer_port': 'Ethernet0', 'peer_device': 'board71', 'speed': '100000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board71>, 'snappi_speed_type': 'speed_100_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '9.2', 'location': '100.117.59.187/9.2', 'peer_port': 'Ethernet8', 'peer_device': 'board71', 'speed': '100000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board71>, 'snappi_speed_type': 'speed_100_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '9.3', 'location': '100.117.59.187/9.3', 'peer_port': 'Ethernet144', 'peer_device': 'board71', 'speed': '100000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board71>, 'snappi_speed_type': 'speed_100_gbps', 'asic_value': 'asic1'}] 20:14:57 snappi_fixtures.__intf_config_multidut L0934 INFO | Configuring Dut: board71 with port Ethernet0 with IP 20.10.1.0/31 20:14:58 snappi_fixtures.__intf_config_multidut L0934 INFO | Configuring Dut: board71 with port Ethernet8 with IP 20.10.1.2/31 20:14:59 snappi_fixtures.__intf_config_multidut L0934 INFO | Configuring Dut: board71 with port Ethernet144 with IP 20.10.1.4/31 --------------- curtailed irrelevant output ---------- 20:18:20 snappi_fixtures.cleanup_config L1159 INFO | Removing Configuration on Dut: board71 with port Ethernet0 with ip :20.10.1.0/31 20:18:21 snappi_fixtures.cleanup_config L1159 INFO | Removing Configuration on Dut: board71 with port Ethernet8 with ip :20.10.1.2/31 20:18:22 snappi_fixtures.cleanup_config L1159 INFO | Removing Configuration on Dut: board71 with port Ethernet144 with ip :20.10.1.4/31 PASSED [ 50%] ----------------------------------------------------------------------------------------------------------------------- live log teardown ------------------------------------------------------------------------------------------------------------------------ 20:18:23 __init__.pytest_runtest_teardown L0049 INFO | collect memory after test test_lossless_response_to_external_pause_storms_test[100.0-single_linecard_multiple_asic] 20:18:23 __init__.pytest_runtest_teardown L0072 INFO | After test: collected memory_values {'before_test': {}, 'after_test': {}} snappi_tests/multidut/pfc/test_lossless_response_to_external_pause_storms.py::test_lossless_response_to_external_pause_storms_test[100.0-multiple_linecard_multiple_asic] ------------------------------------------------------------------------------------------------------------------------- live log setup ------------------------------------------------------------------------------------------------------------------------- 20:18:23 __init__.set_default L0053 INFO | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic 20:18:23 __init__.check_test_completeness L0151 INFO | Test has no defined levels. Continue without test completeness checks 20:18:23 __init__.loganalyzer L0051 INFO | Log analyzer is disabled 20:18:23 __init__.store_fixture_values L0017 INFO | store memory_utilization test_lossless_response_to_external_pause_storms_test[100.0-multiple_linecard_multiple_asic] 20:18:23 __init__.pytest_runtest_setup L0024 INFO | collect memory before test test_lossless_response_to_external_pause_storms_test[100.0-multiple_linecard_multiple_asic] 20:18:23 __init__.pytest_runtest_setup L0044 INFO | Before test: collected memory_values {'before_test': {}, 'after_test': {}} ------------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------------- 20:18:23 test_lossless_r…

vvolam mentioned this pull request Jan 17, 2025

Smart Switch reboot high level design sonic-net/SONiC#1699

Merged

vvolam requested review from oleksandrivantsiv, rameshraghupathy and prgeor January 17, 2025 18:12

qiluo-msft reviewed Jan 22, 2025

View reviewed changes

tests/smartswitch/common/reboot.py Show resolved Hide resolved

vvolam requested a review from prabhataravind February 11, 2025 18:27

JibinBao reviewed Feb 21, 2025

View reviewed changes

Add reboot tests for smartswitch

e0275da

vvolam force-pushed the ss-reboot branch from 763dd01 to e0275da Compare March 18, 2025 19:05

Reboot DPUs in parallel

d984657

Fix flake8 error

69c103b

vvolam requested a review from theasianpianist April 2, 2025 03:02

Merge remote-tracking branch 'public/master' into ss-reboot

1e1964d

vvolam force-pushed the ss-reboot branch from b87f082 to 1e1964d Compare April 2, 2025 03:05

Fix flake8 error

0c718ce

JibinBao reviewed Apr 7, 2025

View reviewed changes

tests/common/reboot.py Show resolved Hide resolved

tests/platform_tests/test_reboot.py Outdated Show resolved Hide resolved

tests/common/devices/sonic.py Outdated Show resolved Hide resolved

Address review comments

d63b366

Identify the smartswitch and DPU from ansible_facts

a5903cf

JibinBao approved these changes Apr 8, 2025

View reviewed changes

Merge remote-tracking branch 'public/master' into ss-reboot

36270b4

nissampa approved these changes Apr 8, 2025

View reviewed changes

Few log enhancements and enable watchdog reboot

0a63a45

judyjoseph approved these changes Apr 9, 2025

View reviewed changes

prabhataravind reviewed Apr 10, 2025

View reviewed changes

tests/smartswitch/common/reboot.py Show resolved Hide resolved

prabhataravind approved these changes Apr 10, 2025

View reviewed changes

rlhui merged commit 91ddf8e into sonic-net:master Apr 10, 2025
18 checks passed

vvolam deleted the ss-reboot branch May 6, 2025 19:15

[SmartSwitch] Add tests for reboot of a smart switch #16566

[SmartSwitch] Add tests for reboot of a smart switch #16566

Uh oh!

Conversation

vvolam commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Jan 17, 2025

Uh oh!

azure-pipelines bot commented Jan 17, 2025

Uh oh!

mssonicbld commented Jan 17, 2025

Uh oh!

azure-pipelines bot commented Jan 17, 2025

Uh oh!

Uh oh!

mssonicbld commented Feb 6, 2025

Uh oh!

mssonicbld commented Feb 6, 2025

Uh oh!

azure-pipelines bot commented Feb 6, 2025

Uh oh!

mssonicbld commented Feb 11, 2025

Uh oh!

azure-pipelines bot commented Feb 11, 2025

Uh oh!

mssonicbld commented Feb 11, 2025

Uh oh!

azure-pipelines bot commented Feb 11, 2025

Uh oh!

nissampa commented Feb 11, 2025

Uh oh!

oleksandrivantsiv commented Feb 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented Mar 18, 2025

Uh oh!

azure-pipelines bot commented Mar 18, 2025

Uh oh!

mssonicbld commented Mar 18, 2025

Uh oh!

azure-pipelines bot commented Mar 18, 2025

Uh oh!

mssonicbld commented Mar 18, 2025

Uh oh!

azure-pipelines bot commented Mar 18, 2025

Uh oh!

azure-pipelines bot commented Apr 2, 2025

Uh oh!

mssonicbld commented Apr 2, 2025

Uh oh!

azure-pipelines bot commented Apr 2, 2025

Uh oh!

mssonicbld commented Apr 2, 2025

Uh oh!

azure-pipelines bot commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented Apr 7, 2025

Uh oh!

azure-pipelines bot commented Apr 7, 2025

Uh oh!

mssonicbld commented Apr 7, 2025

Uh oh!

vvolam commented Jan 17, 2025 •

edited

Loading

prabhataravind left a comment •

edited

Loading