-
Notifications
You must be signed in to change notification settings - Fork 850
Fix flakiness with poll mode tests when client has not connected yet or default route has not populated yet #19316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ute not populating
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the reliability of poll-mode telemetry tests by verifying a real TCP connection before proceeding and by extending timeouts and intervals to wait for default route propagation.
- Switched from a PID check to an established‐connection check using
netstat
- Updated all
wait_until
invocations to passduthost
so the new check can access the correct port - Increased polling intervals, timeouts, and client‐join durations to accommodate slower default‐route population
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/telemetry/telemetry_utils.py | Replaced PID check with netstat -based check; updated function signature to accept duthost |
tests/telemetry/test_telemetry.py | Updated wait_until call to pass duthost |
tests/telemetry/test_telemetry_poll.py | Updated wait_until calls, increased polling interval, timeouts, and join durations |
Comments suppressed due to low confidence (1)
tests/telemetry/telemetry_utils.py:58
- The embedded double quotes in the
grep
pattern may lead to shell quoting issues. Consider using single quotes ('
) around the pattern or escaping properly, e.g.,grep ':${env.gnmi_port} .*ESTABLISHED'
, to ensure the command runs as intended.
rc = ptfhost.shell(f"netstat -tn | grep \":{env.gnmi_port} .*ESTABLISHED\"")["rc"]
tests/telemetry/telemetry_utils.py
Outdated
def check_gnmi_cli_running(duthost, ptfhost): | ||
env = GNMIEnvironment(duthost, GNMIEnvironment.TELEMETRY_MODE) | ||
rc = ptfhost.shell(f"netstat -tn | grep \":{env.gnmi_port} .*ESTABLISHED\"")["rc"] | ||
return rc == "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing the shell return code to the string "0" may always evaluate to false because rc
is likely an integer. Change to return rc == 0
or convert rc
to a string before comparison.
return rc == "0" | |
return rc == 0 |
Copilot uses AI. Check for mistakes.
tests/telemetry/telemetry_utils.py
Outdated
return len(program_list) > 0 | ||
def check_gnmi_cli_running(duthost, ptfhost): | ||
env = GNMIEnvironment(duthost, GNMIEnvironment.TELEMETRY_MODE) | ||
rc = ptfhost.shell(f"netstat -tn | grep \":{env.gnmi_port} .*ESTABLISHED\"")["rc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked the result of ptfhost.shell() and verify that its not none and that "rc" is equal to 0. I think we can safely assume that shell will return a dictionary with rc.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
|
||
# Add back default route | ||
duthost.shell("config bgp startup all") | ||
pytest_assert(wait_until(60, 5, 0, verify_route_table_status, duthost, namespace, "1"), | ||
"ROUTE_TABLE default route missing") | ||
|
||
client_thread.join(60) | ||
# Give 60 seconds for client to connect to server and then 60 for default route to populate after bgp session start | ||
client_thread.join(120) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after this statement, will you check if it timeout at 120 second or successfully join earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we still do
"target="APPL_DB", max_sync_count=-1, update_count=10, timeout=120, namespace=namespace)"
This lets the query run up to a maximum of 120 seconds, but if we get 10 update count successfully before then, the ptf call will return early and we will join the thread earlier. This is a max allotted time for 120, ideally we will always join earlier than that.
Description of PR
Summary:
MS ADO: 33256194
Type of change
Back port request
Approach
What is the motivation for this PR?
This PR does the following:
Uses netstat to ensure that there is an established TCP connection between the client and server which is a more reliable check instead of pid check. This new check will help ensure that client is connected and can receive all notifications.
Ensures that we are giving enough time for default route to populate in APPL_DB after bgp sessions have been restored. There are some situations where we grab 10 updates, but default route has not been populated yet, so we miss the update. We will let the query run for longer and check less frequently to ensure that we see the default route entries after restoring bgp sessions.
How did you do it?
Code change
How did you verify/test it?
202411 test
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation