Fix flakiness with poll mode tests when client has not connected yet or default route has not populated yet #19316

zbud-msft · 2025-07-01T23:15:02Z

Description of PR

Summary:
MS ADO: 33256194

Type of change

Back port request

Approach

What is the motivation for this PR?

This PR does the following:

Uses netstat to ensure that there is an established TCP connection between the client and server which is a more reliable check instead of pid check. This new check will help ensure that client is connected and can receive all notifications.
Ensures that we are giving enough time for default route to populate in APPL_DB after bgp sessions have been restored. There are some situations where we grab 10 updates, but default route has not been populated yet, so we miss the update. We will let the query run for longer and check less frequently to ensure that we see the default route entries after restoring bgp sessions.

How did you do it?

Code change

How did you verify/test it?

202411 test

telemetry/test_telemetry_poll.py::test_poll_mode_default_route[xxxxxxxx-False] PASSED  

2025-07-01 23:11:48.341175 response received:
sync_response: true

2025-07-01 23:11:58.355388 response received:
update {
  timestamp: 1751411518201902684
  prefix {
    target: "APPL_DB/"
  }
  update {
    path {
      elem {
        name: "ROUTE_TABLE"
      }
      elem {
        name: "0.0.0.0/0"
      }
    }
    val {                                                                                                                                                                                                                                                                                                                                                                                       json_ietf_val: "{\"ifname\":\"PortChannel101,PortChannel102,PortChannel103,PortChannel104\",\"nexthop\":\"xxxx,xxxx,xxxx,xxxx\",\"protocol\":\"bgp\",\"weight\":\"1,1,1,1\"}"
    }
  }
}

2025-07-01 23:11:58.356885 response received:
update {
  timestamp: 1751411518206189131
  prefix {
    target: "APPL_DB/"
  }
  update {
    path {                                                                                                                                                                                                                                                                                                                                                                                      elem {                                                                                                                                                                                                                                                                                                                                                                                      name: "FAKE_APPL_DB_TABLE_0"                                                                                                                                                                                                                                                                                                                                                            }
    }
    val {                                                                                                                                                                                                                                                                                                                                                                                       json_ietf_val: "{\"fake_key0\":{\"dummy0\":\"val\"}}"
    }
  }
}

Max update count reached 10

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…ute not populating

mssonicbld · 2025-07-01T23:15:10Z

/azp run

azure-pipelines · 2025-07-01T23:15:23Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-07-01T23:19:56Z

/azp run

azure-pipelines · 2025-07-01T23:20:09Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-07-01T23:26:35Z

/azp run

azure-pipelines · 2025-07-01T23:26:49Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull Request Overview

This PR improves the reliability of poll-mode telemetry tests by verifying a real TCP connection before proceeding and by extending timeouts and intervals to wait for default route propagation.

Switched from a PID check to an established‐connection check using netstat
Updated all wait_until invocations to pass duthost so the new check can access the correct port
Increased polling intervals, timeouts, and client‐join durations to accommodate slower default‐route population

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
tests/telemetry/telemetry_utils.py	Replaced PID check with `netstat`-based check; updated function signature to accept `duthost`
tests/telemetry/test_telemetry.py	Updated `wait_until` call to pass `duthost`
tests/telemetry/test_telemetry_poll.py	Updated `wait_until` calls, increased polling interval, timeouts, and join durations

Comments suppressed due to low confidence (1)

tests/telemetry/telemetry_utils.py:58

The embedded double quotes in the grep pattern may lead to shell quoting issues. Consider using single quotes (') around the pattern or escaping properly, e.g., grep ':${env.gnmi_port} .*ESTABLISHED', to ensure the command runs as intended.

    rc = ptfhost.shell(f"netstat -tn | grep \":{env.gnmi_port} .*ESTABLISHED\"")["rc"]

Copilot · 2025-07-02T22:45:39Z

tests/telemetry/telemetry_utils.py

+def check_gnmi_cli_running(duthost, ptfhost):
+    env = GNMIEnvironment(duthost, GNMIEnvironment.TELEMETRY_MODE)
+    rc = ptfhost.shell(f"netstat -tn | grep \":{env.gnmi_port} .*ESTABLISHED\"")["rc"]
+    return rc == "0"


Comparing the shell return code to the string "0" may always evaluate to false because rc is likely an integer. Change to return rc == 0 or convert rc to a string before comparison.

Suggested change

return rc == "0"

return rc == 0

make1980 · 2025-07-03T16:39:12Z

tests/telemetry/telemetry_utils.py

-    return len(program_list) > 0
+def check_gnmi_cli_running(duthost, ptfhost):
+    env = GNMIEnvironment(duthost, GNMIEnvironment.TELEMETRY_MODE)
+    rc = ptfhost.shell(f"netstat -tn | grep \":{env.gnmi_port} .*ESTABLISHED\"")["rc"]


ptfhost.shell(f"netstat -tn | grep ":{env.gnmi_port} .*ESTABLISHED"")

should we check the return value of shell()? if it returns None or a dictionary that doesn't contain "rc" we'll throw exception here which doesn't seem to be what we want?

Checked the result of ptfhost.shell() and verify that its not none and that "rc" is equal to 0. I think we can safely assume that shell will return a dictionary with rc.

mssonicbld · 2025-07-03T17:59:56Z

/azp run

azure-pipelines · 2025-07-03T18:00:10Z

Azure Pipelines successfully started running 1 pipeline(s).

qiluo-msft · 2025-07-03T20:56:56Z

tests/telemetry/test_telemetry_poll.py


    # Add back default route
    duthost.shell("config bgp startup all")
    pytest_assert(wait_until(60, 5, 0, verify_route_table_status, duthost, namespace, "1"),
                  "ROUTE_TABLE default route missing")

-    client_thread.join(60)
+    # Give 60 seconds for client to connect to server and then 60 for default route to populate after bgp session start
+    client_thread.join(120)


after this statement, will you check if it timeout at 120 second or successfully join earlier?

Yes, we still do

"target="APPL_DB", max_sync_count=-1, update_count=10, timeout=120, namespace=namespace)"

This lets the query run up to a maximum of 120 seconds, but if we get 10 update count successfully before then, the ptf call will return early and we will join the thread earlier. This is a max allotted time for 120, ideally we will always join earlier than that.

Fix flakiness with poll mode such as client not running or default ro…

61198e3

…ute not populating

Wait for 10 updates with 10 polling interval

3ad59f2

Fix precommit checks

878af91

zbud-msft added the Request for 202505 branch label Jul 2, 2025

zbud-msft requested a review from Copilot July 2, 2025 22:44

Copilot AI reviewed Jul 2, 2025

View reviewed changes

zbud-msft requested review from make1980 and qiluo-msft July 2, 2025 23:08

make1980 reviewed Jul 3, 2025

View reviewed changes

Address comments

23bfff4

qiluo-msft reviewed Jul 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flakiness with poll mode tests when client has not connected yet or default route has not populated yet #19316

Fix flakiness with poll mode tests when client has not connected yet or default route has not populated yet #19316

Uh oh!

zbud-msft commented Jul 1, 2025 •

edited

Loading

Uh oh!

mssonicbld commented Jul 1, 2025

Uh oh!

azure-pipelines bot commented Jul 1, 2025

Uh oh!

mssonicbld commented Jul 1, 2025

Uh oh!

azure-pipelines bot commented Jul 1, 2025

Uh oh!

mssonicbld commented Jul 1, 2025

Uh oh!

azure-pipelines bot commented Jul 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 2, 2025

Uh oh!

make1980 Jul 3, 2025

Uh oh!

zbud-msft Jul 3, 2025

Uh oh!

mssonicbld commented Jul 3, 2025

Uh oh!

azure-pipelines bot commented Jul 3, 2025

Uh oh!

qiluo-msft Jul 3, 2025

Uh oh!

zbud-msft Jul 3, 2025

Uh oh!

Uh oh!

Fix flakiness with poll mode tests when client has not connected yet or default route has not populated yet #19316

Are you sure you want to change the base?

Fix flakiness with poll mode tests when client has not connected yet or default route has not populated yet #19316

Uh oh!

Conversation

zbud-msft commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Jul 1, 2025

Uh oh!

azure-pipelines bot commented Jul 1, 2025

Uh oh!

mssonicbld commented Jul 1, 2025

Uh oh!

azure-pipelines bot commented Jul 1, 2025

Uh oh!

mssonicbld commented Jul 1, 2025

Uh oh!

azure-pipelines bot commented Jul 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

make1980 Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

zbud-msft Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Jul 3, 2025

Uh oh!

azure-pipelines bot commented Jul 3, 2025

Uh oh!

qiluo-msft Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

zbud-msft Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zbud-msft commented Jul 1, 2025 •

edited

Loading