Skip to content

[SmartSwitch] Add tests for reboot of a smart switch #16566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 10, 2025

Conversation

vvolam
Copy link
Contributor

@vvolam vvolam commented Jan 17, 2025

Description of PR

Summary: Add sonic-mgmt tests for reboot of a smart switch and individual DPUs
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405
  • 202411

Approach

What is the motivation for this PR?

Supporting different types of reboots for smart switch

How did you do it?

  • Extend existing reboot() method for a smart switch as well to reboot the DPUs.
  • Add a test case to reboot all the DPUs individually

How did you verify/test it?

-Verified on NVIDIA 4280 smarswitch.

Any platform specific information?

-Smartswitch topology

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

1 similar comment
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@nissampa
Copy link
Contributor

lgtm

@oleksandrivantsiv
Copy link
Contributor

@congh-nvidia, @JibinBao pleae review

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from nissampa April 2, 2025 03:02
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from theasianpianist April 2, 2025 03:02
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@prabhataravind prabhataravind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Please test on all smartswitch vendors.

@rlhui rlhui merged commit 91ddf8e into sonic-net:master Apr 10, 2025
18 checks passed
kamalsahu0001 added a commit to kamalsahu0001/sonic-mgmt that referenced this pull request Apr 22, 2025
* Update snappi_fixtures.py

updated to incorporate new snappi build changes

* Update traffic_generation.py

updated new snappi build changes

* Update traffic_generation.py

updated capture code

* Add test to verify db_migrator with DNS_NAMESERVER (#17639)

Approach
What is the motivation for this PR?
There's a test gap, we don't have test to verify db_migrator

How did you do it?
This test will modify CONFIG_DB and run db_migrator, and verify that DNS_NAMESERVER is from minigraph or golden config.

test_migrate_dns_02: there's minigraph.xml and dns.j2, and there's no golden config. After migration, there's DNS_NAMESERVER in CONFIG_DB, because db_migrator can migrate from minigraph.
test_migrate_dns_03 is used to reproduce SonicQosProfile issue: there's minigraph.xml and dns.j2, and I added SonicQosProfile in minigraph.xml, and there'no golden config. After migration, there's no DNS_NAMESERVER in CONFIG_DB, because db_migrator can't migrate from minigraph.
How did you verify/test it?
Run end to end test

* Fix pfcwd/test_pfcwd_function.py for dualtor topologies (#17833)

What is the motivation for this PR?
pfcwd/test_pfcwd_function.py::TestPfcwdFunc::test_pfcwd_actions is flaky and fails with the following signature.

======================================================================
FAIL: pfc_wd.PfcWdTest
----------------------------------------------------------------------
Traceback (most recent call last):
  File "ptftests/py3/pfc_wd.py", line 148, in runTest
    return verify_packet_any_port(self, masked_exp_pkt, dst_port_list)
  File "/root/env-python3/lib/python3.7/site-packages/ptf/testutils.py", line 3437, in verify_packet_any_port
    % (result.port, device_number, ports, result.format())
AssertionError: Received expected packet on port 1 for device 0, but it should have arrived on one of these ports: [23].
========== RECEIVED ==========
0000  82 FD E1 7F 90 01 00 AA BB CC DD EE 08 00 45 0D  ..............E.
0010  00 56 00 01 00 00 3F 06 1B DF 64 5B 3A B0 C0 A8  .V....?...d[:...
0020  00 02 EA F5 27 6F 00 00 00 00 00 00 00 00 50 02  ....'o........P.
0030  20 00 21 87 00 00 00 01 02 03 04 05 06 07 08 09   .!.............
0040  0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19  ................
0050  1A 1B 1C 1D 1E 1F 20 21 22 23 24 25 26 27 28 29  ...... !"#$%&'()
0060  2A 2B 2C 2D                                      *+,-
==============================
How did you do it?
The test randomly selects a dst_port but always assigns the IP 192.168.0.2 to it. In dualtor topologies there is a notion of static/fixed IP addresses on the ToR's side

admin@ld301:~$ show mux config
SWITCH_NAME    PEER_TOR
-------------  ----------
ld302          10.1.0.33
port        state    ipv4             ipv6
----------  -------  ---------------  -----------------
Ethernet4   auto     192.168.0.2/32   fc02:1000::2/128
Ethernet8   auto     192.168.0.3/32   fc02:1000::3/128
Ethernet12  auto     192.168.0.4/32   fc02:1000::4/128
Ethernet16  auto     192.168.0.5/32   fc02:1000::5/128
Ethernet20  auto     192.168.0.6/32   fc02:1000::6/128
Ethernet24  auto     192.168.0.7/32   fc02:1000::7/128
Ethernet28  auto     192.168.0.8/32   fc02:1000::8/128
Ethernet32  auto     192.168.0.9/32   fc02:1000::9/128
Ethernet36  auto     192.168.0.10/32  fc02:1000::a/128
Ethernet40  auto     192.168.0.11/32  fc02:1000::b/128
Ethernet44  auto     192.168.0.12/32  fc02:1000::c/128
Ethernet48  auto     192.168.0.13/32  fc02:1000::d/128
Ethernet52  auto     192.168.0.14/32  fc02:1000::e/128
Ethernet56  auto     192.168.0.15/32  fc02:1000::f/128
Ethernet60  auto     192.168.0.16/32  fc02:1000::10/128
Ethernet64  auto     192.168.0.17/32  fc02:1000::11/128
Ethernet68  auto     192.168.0.18/32  fc02:1000::12/128
Ethernet72  auto     192.168.0.19/32  fc02:1000::13/128
Ethernet76  auto     192.168.0.20/32  fc02:1000::14/128
Ethernet80  auto     192.168.0.21/32  fc02:1000::15/128
Ethernet84  auto     192.168.0.22/32  fc02:1000::16/128
Ethernet88  auto     192.168.0.23/32  fc02:1000::17/128
Ethernet92  auto     192.168.0.24/32  fc02:1000::18/128
Ethernet96  auto     192.168.0.25/32  fc02:1000::19/128
Due to this packet sometimes ends up being forwarded to Ethernet4 (port1) instead of the port expected by the test.

The proposed fix is that in case of dualtor alone choose destination IP according to MUX_CONFIG for the interface chosen as the dst_port.

How did you verify/test it?
Ran all pfcwd tests on Arista-7260CX3 with dualtor-120 topology.

* Refine baseline pipeline yml and fix error (#17499)

What is the motivation for this PR?
Baseline testplan name are different from PR testing, but it's better to let them have the same name, will be easier for kusto query.
t0-sonic test didn't pass VM_TYPE to elastictest template, which caused t0-sonic deploy failure.
t0-sonic and dpu test lost specific param.

How did you do it?
Refine baseline pipeline yml to let testplan name have same build reason with PR test.
Pass VM_TYPE to elastictest template.
Add specific param for t0-sonic and dpu test.

* Choose correct vlan ip for 2vlan config in advance_reboot (#17831)

What is the motivation for this PR?
There are 2 Vlans on the t0-118 topology. We observe that the ptftest launched from upgrade_path tests will default to using the 192.169.0.0/22 IP for Vlan1000 and the test would fail with DUT is not ready due to packets sent by the PTF does not have any response from the DUT.

However, by switching to use 192.168.0.0/25 for Vlan2000, upgrade_path no longer fails on DUT is not ready and is able to pass normal warm upgrade.

How did you do it?
Call the common help function get_vlan_interface_list and get_vlan_interface_info to get vlan interface and ipv4 address.

How did you verify/test it?
Run platform_tests.test_advanced_reboot on T0 testbeds.

Any platform specific information?
T0 platforms

* skip dynamic_acl on platform x86_64-8101_32fh_o_c01-r0 (#17848)

* refactor: optimize mgmt ipv6 only test (#17851)

Description of PR
Optimize the ip/test_mgmt_ipv6_only.py test module with Python multithreading.

Summary:
Fixes # (issue) Microsoft ADO 30056122

Approach
What is the motivation for this PR?
The ip/test_mgmt_ipv6_only.py takes a long time to finish on a multi-DUT device, for example, ~100 min on T2 device, so we wanted to optimize it with Python multithreading to reduce the running time.

How did you do it?
How did you verify/test it?
I ran the updated code on a multi-DUT device and verified that the running time was reduced to ~50 min: Elastictest link

Besides, I also verified the change on T0 and dualtor:

T0: https://elastictest.org/scheduler/testplan/67f05c6787ffab7db692a20b?testcase=ip%2Ftest_mgmt_ipv6_only.py&type=console&leftSideViewMode=detail
dualtor: https://elastictest.org/scheduler/testplan/67f05c8d40a6f1f300f5363e?leftSideViewMode=detail&testcase=ip%2Ftest_mgmt_ipv6_only.py&type=console

co-authorzied by: [email protected]

* feat: support trimming lab inv file (#17348)

Description of PR
Support trimming the inventory files such as ansible/lab, ansible/t2_lab etc when passing --trim_inv option.

Summary:
Fixes # (issue) Microsoft ADO 30056122

Approach
What is the motivation for this PR?
When we enable inventory trimming by passing the --trim_inv option, the current logic is to only trim the ansible/veos file, but we noticed that the other inventory file (such as ansible/lab) should also be trimmed because it contains the configs of all the devices in that lab, but we only need the configs related to the current test run. Therefore, we decided to support trimming these inventory files as well.

Please note that the PDU & Fanout hosts trimming is not supported in this PR as it's currently blocked by #17347

How did you do it?
How did you verify/test it?
I ran the new trimming logic on various lab files and can confirm it's working well:

https://elastictest.org/scheduler/testplan/67c7ad505048655bf9cf8a58
https://elastictest.org/scheduler/testplan/67c78be48dcac0cdc64a3998
https://elastictest.org/scheduler/testplan/67c78cc7f60a7a79ff1ae585
https://elastictest.org/scheduler/testplan/67c78c9c8dcac0cdc64a399c
https://elastictest.org/scheduler/testplan/67c7b419d0bae94c81d8a9d6
https://elastictest.org/scheduler/testplan/67ca846a5048655bf9cf8f7b
Any platform specific information?

co-authorized by: [email protected]

* Add multi-asic support for test-intf-fec (#17814)

Description of PR
Summary: Add multi ASIC support for test-intf-fec. This is possible with the utility command update in sonic-net/sonic-utilities#3819
Fixes # (issue) 28838870

Approach
What is the motivation for this PR?
Described

How did you do it?
Update the command from sonic-net/sonic-utilities#3819 and update the code base so that works with T2. For 202405

Please note that for a release branch to work internally, the following PR here needs to be included:

#17183
#14661
#16424
#15481

How did you verify/test it?
T2 platform verified

Signed-off-by: Austin Pham <[email protected]>

---------

Signed-off-by: Austin Pham <[email protected]>

* warm boot to config save before reboot (#17849)

* [KubeSonic] Add gnmi to container_upgrade (#17796)

Approach
What is the motivation for this PR?
We need to verify gnmi feature after container upgrade

How did you do it?
And gnmi and gnmi_watchdog to container upgrade

How did you verify/test it?
Run container upgrade pipeline

* Update pfcwd_multi_node_helper.py

updated to support new snappi model

* [performance_meter] add swss create time criteria (#17740)

What is the motivation for this PR?
Need check for checking time spent in swss create switch

How did you do it?
Add new success criteria to check for occurrence of swss create switch start and end

How did you verify/test it?
Run test on 7215 devices

* [mcx] fix bug with mcx deployment script (#17841)

What is the motivation for this PR?
Fix a none working mcx deployment script.

How did you do it?
Fix iteritems

How did you verify/test it?
Deploy mcx with new script

* [port_util] Add port alias-to-name mapping for Arista-7050CX3-32S-S128 (#17877)

What is the motivation for this PR?
Add port alias-to-name mapping for Arista-7050CX3-32S-S128

How did you do it?
Update port_utils.py.

How did you verify/test it?
Verified by deploy testbed.

* Update pfcwd_runtime_traffic_helper.py

updated file to accomodate new snappi changes.

* Update pfcwd_burst_storm_helper.py

updated file to accomodate new snappi changes

* Update pfcwd_basic_helper.py

updating files to accomodate snappi changes

* [dualtor] update template to latest (#17879)

What is the motivation for this PR?
Old template is not up to date and does not match with changes in vm_topo results. Update it so the generated minigraph work.

How did you do it?
Copy the section from minigraph_dpg.j2

How did you verify/test it?
Run yang validation on generated minigraph.

* Fixed swss feature name for test_lldp_neighbor_post_orchagent_reboot (#15715)

What is the motivation for this PR?
The test test_lldp_neighbor_post_orchagent_reboot fails on multi-asic system. The test tries to disable autorestart feature for swss by using the namespace container name, e.g., swss0, swss1, etc

For config feature autorestart disable, it needs to use 'swss' as global feature name

How did you do it?
Changed code to use 'swss' as feature name without using namespace id

How did you verify/test it?
run sonic-mgmt test_lldp.py

---------

Signed-off-by: Anand Mehra [email protected]

* Add a fixture to enable nat for dpus (#17753)

1. Enable nat for dpus on smartswitch

* Ignore subnet decap test when no portchannels found (#17810)

What is the motivation for this PR?
Solve IndexError: list index out of range in dut_port = list(mg_facts['minigraph_portchannels'].keys())[0] because minigraph_portchannels is empty.

How did you do it?
This checks if any portchannels exist before attempting to access them, preventing the IndexError.

How did you verify/test it?
========================================================================================================================================================================================= short test summary info ==========================================================================================================================================================================================
SKIPPED [4] decap/test_subnet_decap.py:207: No portchannels found in minigraph
================================================================================================================================================================================ 4 skipped, 1 warning in 797.40s (0:13:17) =================================================================================================================================================================================
Any platform specific information?
str4-sn5600-1

* [sonic-mgmt][dualtor-aa] Fix flakiness of fdb/test_fdb_mac_learning.py (#17873)

What is the motivation for this PR?
After link bringup, it's taking some time for mux status to be consistent in dualtor-aa topology (i.e SERVER_STATUS is 'unknown'). And it's not a test specific issue, I can see similar behaviour on dut where dualtor-aa is deployed.

How did you do it?
So increasing the timeout to 300 (currently it's 150 secs) to fix flakiness.

* Increase timeout to 5 in verify_packet_any_port for background traffic (#17904)

What is the motivation for this PR?
The test is giving us a false negative

msg        = 'Did not receive expected packet on any of ports [7, 13, 17, 30, 27, 25, 5, 34, 21, 16, 24, 1, 33, 12, 4, 20, 2, 0, 11... 01  .............0..\n0050  00 AA BB CC DD EE                                ......\n==============================\n'
self       = <tests.common.plugins.ptfadapter.ptfadapter.PtfTestAdapter testMethod=runTest>

/usr/lib/python3.8/unittest/case.py:753: AssertionError
Although on a closer look we found that the DUTis forwarding the packet in a reasonable duration of time but for some reason testutils.verify_packet_any_port is taking longer to detect it.

There is also another issue which doesn't cause any failure but defeats the purpose of testing. In case of active-active dualtor we call setup_standby_ports_on_rand_unselected_tor_unconditionally to put the system in active-standby mode. If this is called after background_traffic then the background trafffic flows through the unselected ToR which is not desired.

How did you do it?
Increase the timeout to 5s from system default for testutils.verify_packet_any_port

Make the order of fixture execution deterministic so that setup_standby_ports_on_rand_unselected_tor_unconditionally is called before background_traffic

How did you verify/test it?
Verified on Arista-7050CX3 with dualtor-aa topology.

* Disable all bmp table after test to avoid potential impact to other test cases. (#17910)

Disable all bmp table after test to avoid potential impact to other test cases

Description of PR
Work item tracking
Microsoft ADO (number only):32206168

Approach
What is the motivation for this PR?
Disable all bmp table after test to avoid potential impact to other test cases

How did you do it?
Disable all relevant bmp table via config cli after each test.

How did you verify/test it?
kvm test verified.

Any platform specific information?

* Make lossyqueuevoq check platform/hwskus. (#17726)

* Configure macsec rekey period on EOS hosts (#17811)

What is the motivation for this PR?
Macsec::TestControlPlane::test_rekey_by_period tests failing when EOS selected as key-server
How did you do it?
If rekey-period is non-zero, we are configuring rekey period on EOS host
How did you verify/test it?
Sonic-mgmt Macsec::TestControlPlane::test_rekey_by_period tests are passing with the above change.

* [M1] Add doc for M1 topology announce routes (#17905)

Summary:
Add doc for M1 topology announce routes.

* [SmartSwitch] Add tests for reboot of a smart switch (#16566)

Add sonic-mgmt tests for reboot of a smart switch and individual DPUs

* Rewrite platform_tests/broadcom/test_ser.py (#17381)

* Rewrite ser test

Rewrite the SER injection test to use the internal broadcom command
instead of doing the SER injection manually.

Skipping for TH5 skus as it does have this functionality at the moment.

* Rewrite ser test: PR edits

Use "stdout_lines" instead of "stdout" for ser output parsing and
adjust Arista-7060X6 conditions to include Github issue

* [dhcp_relay] Optimize log for test_dhcp_relay (#17906)

What is the motivation for this PR?
Add log for test_dhcp_relay for triaging issue

How did you do it?
Add log for test_dhcp_relay for triaging issue

How did you verify/test it?
Run test and find below log files

* Revert "[dhcp_relay] Remove test_dhcp_relay test in t0-2vlans (#17208)" (#17676)

This reverts commit 1762bc28f8ccdbde3cedd83ceb2f76204b2f2e17.

* Skip test_reload_configuration_checks on Cisco platform (#17868)

* Skip test_reload_configuration_checks on Cisco platform

* Revise

* update d18u8s4 PT0 ASN to 4 bytes (#17888)

What is the motivation for this PR?
Fix topo error.

How did you do it?
How did you verify/test it?
admin@sonic:~$ show ip bgp summary

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2
RIB entries 3, using 672 bytes of memory
Peers 12, using 8903712 KiB of memory
Peer groups 5, using 320 bytes of memory


Neighbhor      V          AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
-----------  ---  ----------  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4       64600          0          0         0      0       0  never      Active          ARISTA01T1
10.0.0.59      4       64600          0          0         0      0       0  never      Active          ARISTA02T1
10.0.0.61      4       64600          0          0         0      0       0  never      Active          ARISTA03T1
10.0.0.63      4       64600          0          0         0      0       0  never      Active          ARISTA04T1
10.0.0.65      4       64600          0          0         0      0       0  never      Active          ARISTA05T1
10.0.0.67      4       64600          0          0         0      0       0  never      Active          ARISTA06T1
10.0.0.69      4       64600          0          0         0      0       0  never      Active          ARISTA07T1
10.0.0.71      4       64600          0          0         0      0       0  never      Active          ARISTA08T1
10.0.0.157     4  4200000000          0          0         0      0       0  never      Active          ARISTA01PT0
10.0.0.159     4  4200000001          0          0         0      0       0  never      Active          ARISTA02PT0
10.0.0.161     4  4200000002          0          0         0      0       0  never      Active          ARISTA03PT0
10.0.0.163     4  4200000003          0          0         0      0       0  never      Active          ARISTA04PT0

Total number of neighbors 12
admin@sonic:~$ show ipv6 bgp summary

IPv6 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2
RIB entries 3, using 672 bytes of memory
Peers 12, using 8903712 KiB of memory
Peer groups 5, using 320 bytes of memory


Neighbhor      V          AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
-----------  ---  ----------  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
fc00::7a       4       64600          0          0         0      0       0  never      Active          ARISTA03T1
fc00::7e       4       64600          0          0         0      0       0  never      Active          ARISTA04T1
fc00::8a       4       64600          0          0         0      0       0  never      Active          ARISTA07T1
fc00::8e       4       64600          0          0         0      0       0  never      Active          ARISTA08T1
fc00::17a      4  4200000002          0          0         0      0       0  never      Active          ARISTA03PT0
fc00::17e      4  4200000003          0          0         0      0       0  never      Active          ARISTA04PT0
fc00::72       4       64600          0          0         0      0       0  never      Active          ARISTA01T1
fc00::76       4       64600          0          0         0      0       0  never      Active          ARISTA02T1
fc00::82       4       64600          0          0         0      0       0  never      Active          ARISTA05T1
fc00::86       4       64600          0          0         0      0       0  never      Active          ARISTA06T1
fc00::172      4  4200000000          0          0         0      0       0  never      Active          ARISTA01PT0
fc00::176      4  4200000001          0          0         0      0       0  never      Active          ARISTA02PT0

Total number of neighbors 12
admin@sonic:~$

* [dualtor_io] Add test_tor_switchover_impact test (#15262)

* [dualtor_io] Add test_tor_switchover_impact test

Test will send traffic from T1 -> server and perform switchover. It will then
collect the logs and process the results to test_tor_switchover_impact.json

Any disruptions that break the threshold will cause test failure.

Signed-off-by: Nikola Dancejic <[email protected]>

* [test_switchover_impact] Moved to new file and refactored

Steps:
1. set up ipv4 and ipv6 neighbors. default 10 ipv4 and 64 ipv6.
2. set dut to active.
3. start traffic test.
4. switch interface to standby.
5. record and validate results.

by default the test runs 100 iterations, taking around 3 hours. The test
will fail if one of the following conditions occur:
- Traffic drop exceeds threshold. (100ms for planned, 400ms for
unplanned)
- Switchover metrics on at least one of the duts do not match within
threshold of measured traffic impact. (100ms for planned, 400ms for
unplanned)
- Metrics on either device are not present.
- If there are multiple disruptions during a single
switchover

Signed-off-by: Nikola Dancejic <[email protected]>

* Update tests_mark_conditions.yaml

switchover impact test takes hours to complete, skip until we set up a way to make it run weekly

* Update tests_mark_conditions.yaml

fixing order of conditions for switchover_impact

---------

Signed-off-by: Nikola Dancejic <[email protected]>

* Fix srv6/test_srv6_dataplane.py (#17896)

Fix srv6/test_srv6_dataplane.py

* Fix pl test to handle outbound_direction_lookup (#17764)

* Fix pl test to handle outbound_direction_lookup #17764
* Default mac for direction lookup is src_mac, so outbound_direction_lookup  needs to be explicitly set to "dst_mac"

* Only print the matched syslog in loganalzyer teardown check, no traceback info printed (#17926)

What is the motivation for this PR?
To make the failed summary of teardown loganalyzer shorter and clearer. It can make the summary easy to understand and downstream failure analyzer can do analysis based on clean summaries.

The summary when a case failed in loganalzyer teardown phase:
Before change:

E               Failed: Processes "['analyze_logs--<MultiAsicSonicHost str-msn4700-02>']" failed with exit code "1"
E               Exception:
E               match: 1
E               expected_match: 0
E               expected_missing_match: 0
E               
E               Match Messages:
E               2025 Apr  9 02:42:13.609855 str-msn4700-02 ERR kernel: [ 1820.284908] sxd_kernel: [error] Failed to bind BFD socket to local_addr (ip:104.0.0.74 ,port:49282) (err:-98).
        
E               Traceback:
E               Traceback (most recent call last):
E                 File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/helpers/parallel.py", line 35, in run
E                   Process.run(self)
E                 File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
E                   self._target(*self._args, **self._kwargs)
E                 File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/helpers/parallel.py", line 245, in wrapper
E                   target(*args, **kwargs)
E                 File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/plugins/loganalyzer/__init__.py", line 45, in analyze_logs
E                   dut_analyzer.analyze(markers[node.hostname], fail_test, store_la_logs=store_la_logs)
E                 File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/plugins/loganalyzer/loganalyzer.py", line 409, in analyze
E                   self._verify_log(analyzer_summary)
E                 File "/var/src/sonic-mgmt_vms11-t1-4700-6_6670b1f1e72b94cbabbdfe65/tests/common/plugins/loganalyzer/loganalyzer.py", line 140, in _verify_log
E                   raise LogAnalyzerError(result_str)
E               tests.common.plugins.loganalyzer.loganalyzer.LogAnalyzerError: match: 1
E               expected_match: 0
E               expected_missing_match: 0
E               
E               Match Messages:
DEBUG:tests.conftest:[log_custom_msg] item: <Function test_bfd_multihop[ipv6]>
INFO:root:Can not get Allure report URL. Please check logs
E               2025 Apr  9 02:42:13.609855 str-msn4700-02 ERR kernel: [ 1820.284908] sxd_kernel: [error] Failed to bind BFD socket to local_addr (ip:104.0.0.74 ,port:49282) (err:-98).
After change:

E               Failed: Got matched syslog in processes "analyze_logs--<MultiAsicSonicHost bjw2-can-7260-10>" exit code:"1"
E               match: 1
E               expected_match: 0
E               expected_missing_match: 0
E               
E               Match Messages:
E               2025 Apr 10 08:21:06.808698 bjw2-can-7260-10 ERR admin: [ 1820.284908] sxd_kernel: [error] Failed to bind BFD socket to local_addr (ip:104.0.0.74 ,port:49282) (err:-98)
How did you do it?
Check if the failed process is analyze_log and if there is Matched Messages in the exception, if so, just print the exception. Don't need to print the traceback, it ensures the summary is shorter and clearer.

How did you verify/test it?
Run a case failed in loganalyzer teardown phase, check the summary of the failed case
Signed-off-by: Zhaohui Sun <[email protected]>

* [dualtor_io] Allow duplications for link down downstream I/O (#17909)

What is the motivation for this PR?
The following two link failure cases are failing on Cisco/MLNX:

test_active_link_down_downstream_active
test_active_link_down_downstream_active_soc
The reason is that, after link down, between the fdb flush and tunnel route add (due to mux toggle-to-standby), the ASIC has no l2 information for server/soc neighbors, downstream traffic will flood to all vlan member ports on Cisco/MLNX platform.
Those two testcase has no tolerance for packet duplications due to that, on Broadcom platform, traffic to neighbors with no l2 information will be simply dropped.

Let's adapt to Cisco/MLNX platform, by allowing packet duplications for those two testcases.

How did you verify/test it?
dualtor_io/test_link_failure.py::test_active_link_down_downstream_active[active-active] PASSED                                                                                                                                                                         [100%]
dualtor_io/test_link_failure.py::test_active_link_down_downstream_active_soc[active-active] PASSED                                                                                                                                                                     [100%]

Signed-off-by: Longxiang Lyu <[email protected]>

* Revert "Skip test_vnet_decap on Cisco-8000 with 202411 (#17776)" (#17941)

This reverts commit 02ff5e8cb77feb4201b003d70e472f17b93d9db5.

* Support Ubuntu 24.04 server in KVM (#17883)

What is the motivation for this PR?
When running with Ubuntu 24.04 host, many tasks from a sonic-mgmt container may fail.
One reason is that a newer version of Python being used in Ubuntu 24.04 that doesn't support some of the packages that ansible modules attempt to import. Such as imp module is removed in Python 3.11, need to use importlib, and
PR #16039 did some replacement but still need to replace in somewhere else.
Another reason is that some ansible modules must be running in a python3 venv for scecurity reason.
Also there are some access issues need to fix.

How did you do it?
Try to use import importlib instead of importing imp
Install python3-venv in Ubuntu 24.04 server
Update libvirt qemu configuration in Ubuntu 24.04 server
* fix

* remove

---------

Co-authored-by: xwjiang2021 <[email protected]>

* add hwsku V64 (#17897)

* Default the inner dscp to outer dscp map to be 1-1. (#17860)

* Add dualtor fixtures to no_traffic test. (#17916)

* add template t0-isolated-d96u32s2-leaf.j2 and Arista-7060X6-16PE-384C-O128S2 into th5 hwsku  (#17898)

add moby hwsku into sonic-mgmt

Description of PR
Summary:
Fix template for d96u32s2 topo. also add missing hwsku into the list.

port_utils.py change is not included as the final port layout may still change, pending on sonic-net/sonic-buildimage#22086

* Restore config after vxlan_crm from vxlan_ecmp. (#17767)

What is the motivation for this PR?
test_vxlan_crm.py was recently enabled for smartswitch, which uses the same setup fixture as test_vxlan_ecmp.py. However, the tests in test_vxlan_ecmp.py have additional cleanup code to handle these BFD entries which is missing in test_vxlan_crm.py.

How did you do it?
Added a config restore fixture, same as in #17046, to test_vxlan_ecmp.py which is re-used by test_vxlan_crm.py

How did you verify/test it?
Ran test_cacl_application.py after running test_vxlan_crm.py and it's now passing with the fix in place. Also checked that iptable is cleaned up after running test_vxlan_crm.py.

* Fix loganalyzer regex to ignore chronyd related ERR logs (#17880)

What is the motivation for this PR?
PR #16239 ignored an error message coming from ntpd (actually written by nss_tacplus). As that PR explains, this can happen when a config reload is run, and chrony calls getpwnam when /etc/tacplus_nss.conf is still being regenerated. However, it is not a failure.

How did you do it?
With the change of NTP daemons, this loganalyzer ignore regex needs to be updated to match the right chrony log.

How did you verify/test it?
By running sonic-mgmt tests that perform config reload and verifying that these errors do not cause loganalyzer to mark the tests as failed.

Signed-off-by: Prabhat Aravind <[email protected]>

* AddCluster Test (#17744)

Approach
What is the motivation for this PR?
Validate Add Cluster Incremental Config Update.

How did you do it?
Remove one of downstream T1 from minigraph
Load updated minigraph
Apply AddCluster json patch
Check Port Status, BGP etc.
How did you verify/test it?
Running in Nokia-IXR7250E-36x400G Chassis linecard.

Any platform specific information?
This test only verified in Nokia-IXR7250E-36x400G, for other SKU, we have not verified it yet.

Supported testbed topology if it's a new test case?
T2 only since this is add cluster scenario which is only applicable in T2.

* Add test_srv6_uN_no_vlan_flooding test case (#17861)

Summary: This is to add a test case to verify that after enabling proxy_arp, there is no L2 flooding in downstream VLAN even though the switch does not have a FDB entry to forward a downstream SRv6 packet.

* Revert "Revert "[dhcp_relay] Remove test_dhcp_relay test in t0-2vlans (#17208)" (#17676)" (#17946)

This reverts commit 53e58628642db1dacc0c5374f63e3a4246c7c86c.

* Revert "Disable all bmp table after test to avoid potential impact to other t…" (#17969)

Reverts #17910

* [M1/M2/M3] Skip test_null_route_helper on M1/M2/M3 topo (#17962)

* [pretest] update collect_dut_lossless_prio method (#17907)

Fix method collect_dut_lossless_prio.
The pfc_enable parameter can be empty.
Without this change we can get error:

>   result = [int(x) for x in port_qos_map[intf]['pfc_enable'].split(',')]
E   ValueError: invalid literal for int() with base 10: ''
PRs with related changes:
sonic-net/sonic-buildimage#22252
sonic-net/sonic-buildimage#22067

Signed-off-by: AntonHryshchuk <[email protected]>

* [Mellanox] Update qos sai test for SN56xx buffer configuration alignment (#17788)

Update QoS sai test according to the PR: sonic-net/sonic-buildimage#22067

* Increase timeout for gathering sonic hosts facts (#17945)

What is the motivation for this PR?
Gather facts will run commands on sonic hosts and get the results of platform information. But in dualtor, hosts may need more time to get the return results and caused pretest failure.

How did you do it?
Increase timeout in gathering facts and check if the success rate increase in a week.

* Enable bmp table dump before all test cases (#17963)

What is the motivation for this PR?
Enable bmp table dump before all test cases as pre-condition

How did you do it?
update ansible/config_sonic_basedon_testbed.yml and add one blocl

How did you verify/test it?
validation will cover it.

* [GCU] Fixing argument passed to format_json function for multi-asic case (#17096)

* Adding fix for the namespace that is sent as argument in format_json_patch_for_multiasic
* Adding fix for the namespace that is sent as argument in format_json_patch_for_multiasic - ip bgp suite
* Adding fix for the namespace that is sent as argument in format_json_patch_for_multiasic - portchannel suite

* Fix BFD status check and ipv6 PTF intermittent issue. (#17819)

* Fix BFD status check and ipv6 PTF intermittent issue.

* Add RFC comment.

* Skip nasa proc check because that the usage of nasa is always high it expected (#17681)

* Fix test_radv_ipv6_ra (#17275)

Signed-off-by: Ze Gan <[email protected]>

* Improve arp_responder.py performance (#17280)

* Optimize performance of arp_responder by reusing sockets

Instead of having scapy.sendp create a socket, bind it, send the packet,
and then close the packet every time, just reuse the listening socket
that's been created.

Signed-off-by: Saikrishna Arcot <[email protected]>

* Use libpcap backend for scapy in arp_responder

Use the libpcap backend in arp_responder, which sets up a ring buffer
for incoming packets. This is expected to be more efficient
than making many recvmsg syscalls, especially during ARP/NDP floods.

Signed-off-by: Saikrishna Arcot <[email protected]>

* Enhance BPF filter for arp_responder

Enhance the BPF filter for arp_responder to only get ARP/ICMP packets
for the IPv4/IPv6 addresses that it is interested in. This is especially
useful in large VLANs with many hosts.

Signed-off-by: Saikrishna Arcot <[email protected]>

---------

Signed-off-by: Saikrishna Arcot <[email protected]>

* Switch from tcpdump to dumpcap (#17276)

* Switch from tcpdump to dumpcap

This is (hopefully) more efficient on the system by being a single
process (but multiple threads for each interface), and also stores
information about which interface each packet was seen on. This also
eliminates the need to run mergecap afterwards.

In addition, specify the snapshot length and the buffer size to optimize
for small packets, but maybe a large burst of them at times.

For now, capture_filtered.pcap will still be a regular pcap file;
if/when tshark is added to the PTF container, tshark can then be used
for creating this file and still preserve interface information.

Signed-off-by: Saikrishna Arcot <[email protected]>

* Actually catch and ignore TimeoutExpired for the dumpcap process

Signed-off-by: Saikrishna Arcot <[email protected]>

* Wait up to 15 seconds for the pcap file to be created

Signed-off-by: Saikrishna Arcot <[email protected]>

* Add a check afterwards to see if the pcap file was actually created; if not, bail out

Signed-off-by: Saikrishna Arcot <[email protected]>

---------

Signed-off-by: Saikrishna Arcot <[email protected]>

* Pfcwd multi port: Fix for Multi-asic Scenario and test_multi_port Requirement of 3 routed ports (#16778)

Summary:

Updated the test case logic to include active interfaces from all available ASICs, preventing unnecessary skips.
Ensured that the multi-port test case selects an appropriate RX port for the second port, avoiding the use of an already stormed port.
Fixes # (issue) #16777

* Disable PFC-WD during PCBB and some wmk test improvements (#17889)

* Improve validation so multiple failures can be reported.

* Disable PFC-WD during PCBB tests.

* Wait for PFC-WD to stop.

* Remove packet aging code from pfcwd fixture.

* [ondatra] Add ThinKit-on-Ondatra support/tests. (#17720)

* [ondatra] Add ThinKit-on-Ondatra interop layer.

* [ondatra] Add ThinKit-On-Ondatra fixtures.

* [ondatra] Add ThinKit-on-Ondatra support/tests.

---------

Co-authored-by: kishanps <[email protected]>

* Reduce flakiness of test_l2_configure.py. (#17577)

What is the motivation for this PR?
Address flakes such as https://elastictest.org/scheduler/testplan/67d6bf6e607a6896f60ddd2a?testcase=l2%2ftest_l2_configure.py&type=console

        if callback.unreachable:
>           raise AnsibleConnectionFailure(
                "Host unreachable in the inventory",
                dark=callback.unreachable,
                contacted=callback.contacted,
            )
E           pytest_ansible.errors.AnsibleConnectionFailure: Host unreachable in the inventory
During config reload, the ipv4 connection can get broken and the ansible will through an exception here.
This doesn't affect the assertion of this particular test.

How did you do it?
Catch the exception.

How did you verify/test it?
10 runs of test_l2_configure.py on kvm.

* Fix dependency issue in test_vxlan_crm.py (#17996)

* [watchdog] Add the x86_64-8101_32fh_o_c01-r0 config to the platform_tests/api/watchdog.yml (#17878)

Signed-off-by: vhlushko <[email protected]>

* Fix vlan vs router mac issue with test_qos_dscp_mapping.py (#17846)

* Enable test_qos_dscp_mapping.py to check for VLAN macs, especially re dualtor topos that use an explicit VLAN mac.

* Rewrite utility to be more generic.

* Fix loopback search to do a double-break.

* Add logging details and comments.

* add hwsku Cisco-8101-V64 in cisco-8000_gb_hwskus list (#17950)

* refactor: optimize snmp intf test (#17975)

Description of PR
Optimize the snmp/test_snmp_interfaces.py test to reduce the running time on multi-asic devices.

Summary:
Fixes # (issue) Microsoft ADO 32181200

Approach
What is the motivation for this PR?
The running time of the snmp/test_snmp_interfaces.py test is too long on multi-asic devices (~130 min on a T2 device). The reason is that it will call get_snmp_facts() to get snmp_facts multiple times if it's a multi-asic device due to the enum_asic_index fixture. This is unnecessary because we only care about snmp_facts["snmp_interfaces"] and its value won't change in the entire test. Therefore, we can simply call it once and use it for all ASICs.

How did you do it?
How did you verify/test it?
I ran the updated code on T2 topo and can confirm it's working well: https://elastictest.org/scheduler/testplan/67fc8b281b14df63e35b863a. The running time has decreased from ~130 min to ~30 min with parallel run enabled.

Single-asic device (T1) regression test also passed: https://elastictest.org/scheduler/testplan/67fc93ed73da730645330ba4

co-authorized by: [email protected]

* fix the issue when the timezone on the DUT is not UTC (#17863)

In the ntp test case, if the timezone is earlier than the UTC, then test will fail
The issue is introduced by PR #17554

* Align the sensors data for SN4280 with sonic-buildimage PR#21845 (#17747)

* Add retry when checking fec mode restore status (#17679)

Add retry when checking fec mode restore status in test_intf_fec.py

Change-Id: I4058e2a5d7472375cbfefd23e952065e45e212eb

* Enhance arp update test to support port toggle at dualtor active active testbed (#17253)

Enhance arp update test to support port toggle at dualtor active active testbed

Change-Id: I884d383b8472f64eb283f12c92f0f0b7cc4e7607

* limit parallel_run cct tasks number from 24 to 8 for fixture setup_bgp_graceful_restart (#16989)

* Update CPU threshold for telemetry test_events case (#16611)

* Add SmartSwitch HA feature test plan  (#13043)

Adding test plan for smart switch HA feature.

* Update generic hash test to support dualtor active active topology (#16217)

Update generic hash test to support dualtor active active topology

* [M1] Add M1 up/down stream neighbor type (#17978)

What is the motivation for this PR?
Add M1 up/down stream neighbor type.

How did you do it?
Add M1 up/down stream neighbor type.

How did you verify/test it?
Verified by testcase route/test_default_route.py.

* [acl/test_acl.py] Ensure frontend dut is used in ACL testing (#17582)

What is the motivation for this PR?
Previously ACL test change picks up loopback ip from rand_selected_dut, on T2 devices supervisor card doesn't have a loopback ip causing errors

How did you do it?
Use rand_selected_front_end_dut instead of rand_selected_dut for T2

How did you verify/test it?
Ran 61 iterations of basic ACL test to make sure it passes on T2 device
Additionally:
T0: Test plan 67daa46be946313de56c8bf2: ACL changes regression test T0 - Elastictest (elastictest.org)
T1: Test plan 67daa4a42f30cd54926394de: ACL changes regression test T1 - Elastictest (elastictest.org)
T2: Test plan 67daa4ebe946313de56c8bf4: ACL changes regression test T2 - Elastictest (elastictest.org)
Signed-off-by: Javier Tan [email protected]

Signed-off-by: Javier Tan [email protected]

* [iface_namingmode/test_iface_namingmode] Ensure LLDP neighbor comes back after link flap (#17603)

What is the motivation for this PR?
test_show_lldp_table sometimes fails because of missing lldp entry from interface flapped in test_config_interface_state

How did you do it?
Add an assert wait_until for lldp neighbor coming back before proceeding in test_config_interface_State

How did you verify/test it?
T0: Test plan 67da932be946313de56c8bd5: ifacenamingmode changes regression test T0 - Elastictest (elastictest.org)
T1: Test plan 67db4c7f2d7dcfaa9d43f859: ifacenamingmode changes regression test T1 - Elastictest (elastictest.org)
T2: Test plan 67da9530e3ea980065c43a91: ifacenamingmode changes regression test T2 - Elastictest (elastictest.org)

Any platform specific information?

Signed-off-by: Javier Tan [email protected]

* Fix route/test_static_route.py (#17998)

What is the motivation for this PR?
The test is flaky.

self = <tests.common.plugins.ptfadapter.ptfadapter.PtfTestAdapter testMethod=runTest>
msg = 'Received expected packet on port 5 for device 0, but it should have arrived on one of these ports: [1].\n========== R...65 2E  ute tests.route.\n0060  74 65 73 74                                      test\n==============================\n'

    def fail(self, msg=None):
        """Fail immediately, with the given message."""
>       raise self.failureException(msg)
E       AssertionError: Received expected packet on port 5 for device 0, but it should have arrived on one of these ports: [1].
E       ========== RECEIVED ==========
E       0000  EE 0E C9 56 A0 05 00 AA BB CC DD EE 08 00 45 00  ...V..........E.
E       0010  00 56 00 01 00 00 3F 06 77 9E 01 01 01 01 01 01  .V....?.w.......
E       0020  01 01 04 D2 10 E1 00 00 00 00 00 00 00 00 50 02  ..............P.
E       0030  20 00 5D 73 00 00 74 65 73 74 73 2E 72 6F 75 74   .]s..tests.rout
E       0040  65 2E 74 65 73 74 5F 73 74 61 74 69 63 5F 72 6F  e.test_static_ro
E       0050  75 74 65 20 74 65 73 74 73 2E 72 6F 75 74 65 2E  ute tests.route.
E       0060  74 65 73 74                                      test
E       ==============================

msg        = 'Received expected packet on port 5 for device 0, but it should have arrived on one of these ports: [1].\n========== R...65 2E  ute tests.route.\n0060  74 65 73 74                                      test\n==============================\n'
self       = <tests.common.plugins.ptfadapter.ptfadapter.PtfTestAdapter testMethod=runTest>
pfcwd/test_pfcwd_warm_reboot.py configures an IP on a PTF interface but doesn't clean that up.
route/test_static_route.py doesn't configures an IP but rather only sets up arp_responder for the IP.

If both tests pick up the same IP but map it to different MACs (either by assigning or setting up ARP responder)
then for the same ARP request the DUT can either receive reply from kernel or arp_responder. If the kernel responds the mismatch in ARP table can cause misforwarding.

How did you do it?
Proposed fix is to request remove_ip_addresses to clean up the IP addresses inside the PTF

How did you verify/test it?
Verified on Arista-7050CX3 with dualtor topology

* Fix telemetry/test_events.py for dualtor (#17448)

What is the motivation for this PR?
telemetry/test_events.py is flaky

The verification of events wrt DHCP via telemetry involves sending traffic. In that case it is intended that -

For active-standby dualtor the randomly selected ToR is active.
For active-active dualtor the traffic doesn't go to the unselected ToR
How did you do it?
Proposed fix is to introduce following new fixtures -

toggle_all_simulator_ports_to_enum_rand_one_per_hwsku_host_m which will make the randomly selected ToR as active in active-standby dualtor
setup_standby_ports_on_non_enum_rand_one_per_hwsku_host_m will make the unselected ToR as standby in active-active dualtor
These fixtures rely on enum_rand_one_per_hwsku_hostname for the ToR being selected.

How did you verify/test it?
Ran on Arista-7050CX3 platform with 202411 image and dualtor/dualtor-aa topologies.

* fix incompatible import of scapy and skip when no lldp neigh (#18020)

Fix incompatible import of scapy and skip when no lldp neigh in test_srv6_dataplane.py

* [CI]Add trigger type for test plan creation, add source repo and branch to test plan name (#17783)

What is the motivation for this PR?
Due to historical reason, test plan type is designed to either nightly or pr. But there are other types of test plan, baseline, recover and custom test.
But from az pipline, there are only 3 type: nightly, pr or baseline. This PR fill trigger type by the test plan type and build reason.
This will help to categorize the test plan.

How did you do it?
Add trigger type.
Add repo name and branch name to PR test plan
image
How did you verify/test it?
New filed is already ready but always null, no impact.
image

New pipeline will fill the filed.
image

Manually set build reason to simulate baseline test.
image
https://dev.azure.com/mssonic/build/_build/results?buildId=813692&view=logs&j=c4781836-bf04-5f60-aed4-f5b1830934f2&t=0a02131c-cb1d-58ae-2efd-7b654448dcb1

Any platform specific information?

Signed-off-by: Chun'ang Li <[email protected]>

* [M1] Update everflow testcase to support M1 topo (#18027)

What is the motivation for this PR?
Update everflow testcase to support M1 topo

How did you do it?
Update common functions

How did you verify/test it?
Verified by run everflow testcases on Arista-7050CX3 M1-48 testbed.

* [Snappi] - Infra change for dynamic port selection from the setup replacing variables.py file. (#15069)

Description of PR
The purpose of the pull-request is dynamic port_selection from available setup rather than relying on variables.py.

Pull-request adds a function snappi_port_selection in snappi_fixtures.py file.

Summary:
Fixes # (issue)

Type of change
#13769

 Bug fix
 Testbed and Framework(new/improvement)
 Test case(new/improvement)
Back port request
 202012
 202205
 202305
 202311
 202405
Approach
What is the motivation for this PR?
Existing variables.py had following drawbacks:

Various line-cards and ports had to be manually added in this file, making it dependent on that particular setup. For different setup, user had to re-configure this file. This is not scalable. This also hindered selecting setups on run-time.
The variables.py did not have any provision for the interface-speed selection. The user had no provision to mention the speeds of the interfaces selected. For example, if the setup had both 100 and 400Gbps ports, user would have to define two different files or create additional dictionaries to accommodate 100 and 400Gbps interface separately.
If a line-card is added or removed, then variables.py will require manual modification.
To counter the above drawbacks, function snappi_port_selection is added in snappi_fixtures.py

How did you do it?
Following are the changes and reasoning behind the changes:

Each testbed has to re-run test_pretest.py to generate a .JSON file in tests/metadata/snappi_tests/ folder. Metadata file generations will be in metadata/snappi_tests/ folder. This is avoid modification to the current metadata folder, therefore addressing our concern of conflicting with the current code base.
Syntax:

./run_tests.sh -n TESTBED_NAME -c test_pretest.py::test_update_snappi_testbed_metadata -i ../ansible/INVENTORY,../ansible/veos -e "--topology=multidut-tgen,any --skip_sanity --trim_inv --disable_loganalyzer" -u
If the topology is not 'multi-tgen' or 'tgen', then a skip message for non-tgen topology has been added.

Function 'generate_skeleton_port_info' parses the above JSON file and creates template to fetch port-data from output of 'snappi_port_selection'. Skeleton parameterization format will be -, for example: 400.0-single_linecard_single_asic. The reason for this change is to follow the Pytest standard of using delimiter "-" for parameterization.

This also skips the speed-category combination if it's not available with comes to 'snappi_port_selection' fixtures.

The conditions for skip are:

Speed or category is not in snappi_port_selection
Or snappi_port_selection return None for the combination
Function snappi_port_selection parses through all the available ports used in the testbed and generates a dictionary with ::.

The line-card combination has three available modes - single line-card single asic, single line-card multiple asic and multiple linecard.

The set of ports are determined by fixture number_of_tx_rx ports with scope "module" defined in each test.

We don't need the setup_ports_and_dut as well now and we can simply call the snappi_testbed_config in the test itself and iterate through the available ports.

Tagging for relevant reach:
@sdszhang , @vmittal-msft , @rawal01 , @selldinesh, @developfast

How did you verify/test it?
Snapshot of the log:

AzDevOps@68684a43ec9e:/data/tests$ python3 -m pytest --inventory ../ansible/ixia-sonic --host-pattern board71,board72,board73,board74 --testbed ixre-chassis117-t2 --testbed_file ../ansible/testbed.csv --log-cli-level info --log-file-level info --kube_master unset --showlocals -ra --show-capture stdout --junit-xml=/tmp/f.xml --skip_sanity --log-file=/tmp/f.log  --disable_loganalyzer --topology multidut-tgen,any --cache-clear snappi_tests/pfc/test_lossless_response_to_external_pause_storms.py --pdb
====================================================================================================================== test session starts =======================================================================================================================
platform linux -- Python 3.8.10, pytest-7.4.0, pluggy-1.4.0
ansible: 2.13.13
rootdir: /data/tests
configfile: pytest.ini
------------ curtailing irrelevant output ----------------
20:06:33 __init__.store_fixture_values            L0017 INFO   | store memory_utilization test_lossless_response_to_external_pause_storms_test[400.0-multiple_linecard_multiple_asic]
20:06:33 __init__.pytest_runtest_setup            L0024 INFO   | collect memory before test test_lossless_response_to_external_pause_storms_test[400.0-multiple_linecard_multiple_asic]
20:06:33 __init__.pytest_runtest_setup            L0044 INFO   | Before test: collected memory_values {'before_test': {}, 'after_test': {}}
------------------------------------------------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------------------------------------------------
20:06:33 test_lossless_response_to_external_pause L0070 INFO   | Ports:[{'ip': '100.117.59.187', 'port_id': '1', 'location': '100.117.59.187/1', 'peer_port': 'Ethernet0', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '2', 'location': '100.117.59.187/2', 'peer_port': 'Ethernet8', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '4', 'location': '100.117.59.187/4', 'peer_port': 'Ethernet0', 'peer_device': 'board74', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board74>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}]
20:06:38 snappi_fixtures.__intf_config_multidut   L0934 INFO   | Configuring Dut: board73 with port Ethernet0 with IP 20.10.1.0/31
20:06:39 snappi_fixtures.__intf_config_multidut   L0934 INFO   | Configuring Dut: board73 with port Ethernet8 with IP 20.10.1.2/31
20:06:41 snappi_fixtures.__intf_config_multidut   L0934 INFO   | Configuring Dut: board74 with port Ethernet0 with IP 20.10.1.4/31
--------------- curtailed irrelevant output ----------
20:11:02 snappi_fixtures.cleanup_config           L1159 INFO   | Removing Configuration on Dut: board73 with port Ethernet0 with ip :20.10.1.0/31
20:11:03 snappi_fixtures.cleanup_config           L1159 INFO   | Removing Configuration on Dut: board73 with port Ethernet8 with ip :20.10.1.2/31
20:11:04 snappi_fixtures.cleanup_config           L1159 INFO   | Removing Configuration on Dut: board74 with port Ethernet0 with ip :20.10.1.4/31
PASSED                                                                                                                                                                                                                                                     [ 16%]
----------------------------------------------------------------------------------------------------------------------- live log teardown ------------------------------------------------------------------------------------------------------------------------
20:11:04 __init__.pytest_runtest_teardown         L0049 INFO   | collect memory after test test_lossless_response_to_external_pause_storms_test[400.0-multiple_linecard_multiple_asic]
20:11:04 __init__.pytest_runtest_teardown         L0072 INFO   | After test: collected memory_values {'before_test': {}, 'after_test': {}}

snappi_tests/multidut/pfc/test_lossless_response_to_external_pause_storms.py::test_lossless_response_to_external_pause_storms_test[400.0-single_linecard_single_asic] 
------------------------------------------------------------------------------------------------------------------------- live log setup -------------------------------------------------------------------------------------------------------------------------
20:11:04 __init__.set_default                     L0053 INFO   | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic
20:11:04 __init__.check_test_completeness         L0151 INFO   | Test has no defined levels. Continue without test completeness checks
20:11:04 __init__.loganalyzer                     L0051 INFO   | Log analyzer is disabled
20:11:04 __init__.store_fixture_values            L0017 INFO   | store memory_utilization test_lossless_response_to_external_pause_storms_test[400.0-single_linecard_single_asic]
20:11:04 __init__.pytest_runtest_setup            L0024 INFO   | collect memory before test test_lossless_response_to_external_pause_storms_test[400.0-single_linecard_single_asic]
20:11:04 __init__.pytest_runtest_setup            L0044 INFO   | Before test: collected memory_values {'before_test': {}, 'after_test': {}}
------------------------------------------------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------------------------------------------------
20:11:04 test_lossless_response_to_external_pause L0070 INFO   | Ports:[{'ip': '100.117.59.187', 'port_id': '1', 'location': '100.117.59.187/1', 'peer_port': 'Ethernet0', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '2', 'location': '100.117.59.187/2', 'peer_port': 'Ethernet8', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '5', 'location': '100.117.59.187/5', 'peer_port': 'Ethernet16', 'peer_device': 'board73', 'speed': '400000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board73>, 'snappi_speed_type': 'speed_400_gbps', 'asic_value': 'asic0'}]
20:11:10 snappi_fixtures.__intf_config_multidut   L0934 INFO   | Configuring Dut: board73 with port Ethernet0 with IP 20.10.1.0/31
20:11:12 snappi_fixtures.__intf_config_multidut   L0934 INFO   | Configuring Dut: board73 with port Ethernet8 with IP 20.10.1.2/31
20:11:13 snappi_fixtures.__intf_config_multidut   L0934 INFO   | Configuring Dut: board73 with port Ethernet16 with IP 20.10.1.4/31
--------------- curtailed irrelevant output ----------
20:14:48 snappi_fixtures.cleanup_config           L1159 INFO   | Removing Configuration on Dut: board73 with port Ethernet0 with ip :20.10.1.0/31
20:14:49 snappi_fixtures.cleanup_config           L1159 INFO   | Removing Configuration on Dut: board73 with port Ethernet8 with ip :20.10.1.2/31
20:14:50 snappi_fixtures.cleanup_config           L1159 INFO   | Removing Configuration on Dut: board73 with port Ethernet16 with ip :20.10.1.4/31
PASSED                                                                                                                                                                                                                                                     [ 33%]
----------------------------------------------------------------------------------------------------------------------- live log teardown ------------------------------------------------------------------------------------------------------------------------
20:14:51 __init__.pytest_runtest_teardown         L0049 INFO   | collect memory after test test_lossless_response_to_external_pause_storms_test[400.0-single_linecard_single_asic]
20:14:51 __init__.pytest_runtest_teardown         L0072 INFO   | After test: collected memory_values {'before_test': {}, 'after_test': {}}

snappi_tests/multidut/pfc/test_lossless_response_to_external_pause_storms.py::test_lossless_response_to_external_pause_storms_test[100.0-single_linecard_multiple_asic] 
------------------------------------------------------------------------------------------------------------------------- live log setup -------------------------------------------------------------------------------------------------------------------------
20:14:51 __init__.set_default                     L0053 INFO   | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic
20:14:51 __init__.check_test_completeness         L0151 INFO   | Test has no defined levels. Continue without test completeness checks
20:14:51 __init__.loganalyzer                     L0051 INFO   | Log analyzer is disabled
20:14:51 __init__.store_fixture_values            L0017 INFO   | store memory_utilization test_lossless_response_to_external_pause_storms_test[100.0-single_linecard_multiple_asic]
20:14:51 __init__.pytest_runtest_setup            L0024 INFO   | collect memory before test test_lossless_response_to_external_pause_storms_test[100.0-single_linecard_multiple_asic]
20:14:51 __init__.pytest_runtest_setup            L0044 INFO   | Before test: collected memory_values {'before_test': {}, 'after_test': {}}
------------------------------------------------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------------------------------------------------
20:14:51 test_lossless_response_to_external_pause L0070 INFO   | Ports:[{'ip': '100.117.59.187', 'port_id': '9.1', 'location': '100.117.59.187/9.1', 'peer_port': 'Ethernet0', 'peer_device': 'board71', 'speed': '100000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board71>, 'snappi_speed_type': 'speed_100_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '9.2', 'location': '100.117.59.187/9.2', 'peer_port': 'Ethernet8', 'peer_device': 'board71', 'speed': '100000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board71>, 'snappi_speed_type': 'speed_100_gbps', 'asic_value': 'asic0'}, {'ip': '100.117.59.187', 'port_id': '9.3', 'location': '100.117.59.187/9.3', 'peer_port': 'Ethernet144', 'peer_device': 'board71', 'speed': '100000', 'intf_config_changed': False, 'api_server_ip': '10.251.30.110', 'asic_type': 'broadcom', 'duthost': <MultiAsicSonicHost board71>, 'snappi_speed_type': 'speed_100_gbps', 'asic_value': 'asic1'}]
20:14:57 snappi_fixtures.__intf_config_multidut   L0934 INFO   | Configuring Dut: board71 with port Ethernet0 with IP 20.10.1.0/31
20:14:58 snappi_fixtures.__intf_config_multidut   L0934 INFO   | Configuring Dut: board71 with port Ethernet8 with IP 20.10.1.2/31
20:14:59 snappi_fixtures.__intf_config_multidut   L0934 INFO   | Configuring Dut: board71 with port Ethernet144 with IP 20.10.1.4/31
--------------- curtailed irrelevant output ----------
20:18:20 snappi_fixtures.cleanup_config           L1159 INFO   | Removing Configuration on Dut: board71 with port Ethernet0 with ip :20.10.1.0/31
20:18:21 snappi_fixtures.cleanup_config           L1159 INFO   | Removing Configuration on Dut: board71 with port Ethernet8 with ip :20.10.1.2/31
20:18:22 snappi_fixtures.cleanup_config           L1159 INFO   | Removing Configuration on Dut: board71 with port Ethernet144 with ip :20.10.1.4/31
PASSED                                                                                                                                                                                                                                                     [ 50%]
----------------------------------------------------------------------------------------------------------------------- live log teardown ------------------------------------------------------------------------------------------------------------------------
20:18:23 __init__.pytest_runtest_teardown         L0049 INFO   | collect memory after test test_lossless_response_to_external_pause_storms_test[100.0-single_linecard_multiple_asic]
20:18:23 __init__.pytest_runtest_teardown         L0072 INFO   | After test: collected memory_values {'before_test': {}, 'after_test': {}}

snappi_tests/multidut/pfc/test_lossless_response_to_external_pause_storms.py::test_lossless_response_to_external_pause_storms_test[100.0-multiple_linecard_multiple_asic] 
------------------------------------------------------------------------------------------------------------------------- live log setup -------------------------------------------------------------------------------------------------------------------------
20:18:23 __init__.set_default                     L0053 INFO   | Completeness level not set during test execution. Setting to default level: CompletenessLevel.basic
20:18:23 __init__.check_test_completeness         L0151 INFO   | Test has no defined levels. Continue without test completeness checks
20:18:23 __init__.loganalyzer                     L0051 INFO   | Log analyzer is disabled
20:18:23 __init__.store_fixture_values            L0017 INFO   | store memory_utilization test_lossless_response_to_external_pause_storms_test[100.0-multiple_linecard_multiple_asic]
20:18:23 __init__.pytest_runtest_setup            L0024 INFO   | collect memory before test test_lossless_response_to_external_pause_storms_test[100.0-multiple_linecard_multiple_asic]
20:18:23 __init__.pytest_runtest_setup            L0044 INFO   | Before test: collected memory_values {'before_test': {}, 'after_test': {}}
------------------------------------------------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------------------------------------------------
20:18:23 test_lossless_res…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants