Skip to content

Fix flakiness with poll mode tests when client has not connected yet or default route has not populated yet #19316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zbud-msft
Copy link
Contributor

@zbud-msft zbud-msft commented Jul 1, 2025

Description of PR

Summary:
MS ADO: 33256194

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

This PR does the following:

  1. Uses netstat to ensure that there is an established TCP connection between the client and server which is a more reliable check instead of pid check. This new check will help ensure that client is connected and can receive all notifications.

  2. Ensures that we are giving enough time for default route to populate in APPL_DB after bgp sessions have been restored. There are some situations where we grab 10 updates, but default route has not been populated yet, so we miss the update. We will let the query run for longer and check less frequently to ensure that we see the default route entries after restoring bgp sessions.

How did you do it?

Code change

How did you verify/test it?

202411 test

telemetry/test_telemetry_poll.py::test_poll_mode_default_route[xxxxxxxx-False] PASSED  

2025-07-01 23:11:48.341175 response received:
sync_response: true

2025-07-01 23:11:58.355388 response received:
update {
  timestamp: 1751411518201902684
  prefix {
    target: "APPL_DB/"
  }
  update {
    path {
      elem {
        name: "ROUTE_TABLE"
      }
      elem {
        name: "0.0.0.0/0"
      }
    }
    val {                                                                                                                                                                                                                                                                                                                                                                                       json_ietf_val: "{\"ifname\":\"PortChannel101,PortChannel102,PortChannel103,PortChannel104\",\"nexthop\":\"xxxx,xxxx,xxxx,xxxx\",\"protocol\":\"bgp\",\"weight\":\"1,1,1,1\"}"
    }
  }
}

2025-07-01 23:11:58.356885 response received:
update {
  timestamp: 1751411518206189131
  prefix {
    target: "APPL_DB/"
  }
  update {
    path {                                                                                                                                                                                                                                                                                                                                                                                      elem {                                                                                                                                                                                                                                                                                                                                                                                      name: "FAKE_APPL_DB_TABLE_0"                                                                                                                                                                                                                                                                                                                                                            }
    }
    val {                                                                                                                                                                                                                                                                                                                                                                                       json_ietf_val: "{\"fake_key0\":{\"dummy0\":\"val\"}}"
    }
  }
}

Max update count reached 10

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the reliability of poll-mode telemetry tests by verifying a real TCP connection before proceeding and by extending timeouts and intervals to wait for default route propagation.

  • Switched from a PID check to an established‐connection check using netstat
  • Updated all wait_until invocations to pass duthost so the new check can access the correct port
  • Increased polling intervals, timeouts, and client‐join durations to accommodate slower default‐route population

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/telemetry/telemetry_utils.py Replaced PID check with netstat-based check; updated function signature to accept duthost
tests/telemetry/test_telemetry.py Updated wait_until call to pass duthost
tests/telemetry/test_telemetry_poll.py Updated wait_until calls, increased polling interval, timeouts, and join durations
Comments suppressed due to low confidence (1)

tests/telemetry/telemetry_utils.py:58

  • The embedded double quotes in the grep pattern may lead to shell quoting issues. Consider using single quotes (') around the pattern or escaping properly, e.g., grep ':${env.gnmi_port} .*ESTABLISHED', to ensure the command runs as intended.
    rc = ptfhost.shell(f"netstat -tn | grep \":{env.gnmi_port} .*ESTABLISHED\"")["rc"]

def check_gnmi_cli_running(duthost, ptfhost):
env = GNMIEnvironment(duthost, GNMIEnvironment.TELEMETRY_MODE)
rc = ptfhost.shell(f"netstat -tn | grep \":{env.gnmi_port} .*ESTABLISHED\"")["rc"]
return rc == "0"
Copy link
Preview

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing the shell return code to the string "0" may always evaluate to false because rc is likely an integer. Change to return rc == 0 or convert rc to a string before comparison.

Suggested change
return rc == "0"
return rc == 0

Copilot uses AI. Check for mistakes.

@zbud-msft zbud-msft requested review from make1980 and qiluo-msft July 2, 2025 23:08
return len(program_list) > 0
def check_gnmi_cli_running(duthost, ptfhost):
env = GNMIEnvironment(duthost, GNMIEnvironment.TELEMETRY_MODE)
rc = ptfhost.shell(f"netstat -tn | grep \":{env.gnmi_port} .*ESTABLISHED\"")["rc"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptfhost.shell(f"netstat -tn | grep ":{env.gnmi_port} .*ESTABLISHED"")

should we check the return value of shell()? if it returns None or a dictionary that doesn't contain "rc" we'll throw exception here which doesn't seem to be what we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the result of ptfhost.shell() and verify that its not none and that "rc" is equal to 0. I think we can safely assume that shell will return a dictionary with rc.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).


# Add back default route
duthost.shell("config bgp startup all")
pytest_assert(wait_until(60, 5, 0, verify_route_table_status, duthost, namespace, "1"),
"ROUTE_TABLE default route missing")

client_thread.join(60)
# Give 60 seconds for client to connect to server and then 60 for default route to populate after bgp session start
client_thread.join(120)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this statement, will you check if it timeout at 120 second or successfully join earlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we still do

"target="APPL_DB", max_sync_count=-1, update_count=10, timeout=120, namespace=namespace)"

This lets the query run up to a maximum of 120 seconds, but if we get 10 update count successfully before then, the ptf call will return early and we will join the thread earlier. This is a max allotted time for 120, ideally we will always join earlier than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants