-
Notifications
You must be signed in to change notification settings - Fork 817
[warm-reboot]Fix warm-reboot stuck issue #5313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[warm-reboot]Fix warm-reboot stuck issue #5313
Conversation
This pull request introduces 1 alert when merging ef83d2bccb4507a17f5f663fdce5fe6b46ad681e into 9ecb0e5 - view on LGTM.com new alerts:
|
@@ -558,6 +558,9 @@ def check_last_lacppdu_time(self): | |||
.get("Ethernet1", {})\ | |||
.get("details", {})\ | |||
.get("lastRxTime") | |||
if last_lacp_pdu_time == None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is None
In self.check_last_lacppdu_time() function, it will get the lastRxTime from the command "show lacp internal detailed | json" on arista, and somet Example for unexpected command output: show lacp internal detailed | json { "orphanPorts": {}, "systemId": "8000,a6-21-d5-41-4a-3a", "portChannels": { "Port-Channel1": { "interfaces": {} } }, "miscLacpSysIds": [] } Example for expected command output: show lacp internal detailed | json { "orphanPorts": {}, "systemId": "8000,46-ef-1e-ef-4e-53", "portChannels": { "Port-Channel1": { "interfaces": { "Ethernet1": { "actorPortState": { "collecting": true, "distributing": true, "synchronization": true, "defaulted": false, "timeout": false, "activity": true, "expired": false, "aggregation": true }, "partnerSystemId": "FFFF,50-6b-4b-a3-53-80", "actorPortPriority": 32768, "actorPortStatus": "bundled", "actorOperKey": "0x0001", "actorPortId": 1, "details": { "actorChurnState": "noChurn", "lastRxTime": 1646835407.7603946, "allowToWarmup": false, "actorRxSmState": "rxSmCurrent", "actorMuxReason": "muxActorCollectingDistributing", "actorAdminKey": "0x0001", "actorMuxSmState": "muxSmCollectingDistributing", "actorTimeoutMultiplier": 3, "aggSelectionState": "selectedStateSelected" } } } } }, "miscLacpSysIds": [] }
This reverts commit 1d270a399e0e651430275a5708c57d44e96bf35e.
1d270a3
to
5fb4a57
Compare
@@ -558,6 +558,9 @@ def check_last_lacppdu_time(self): | |||
.get("Ethernet1", {})\ | |||
.get("details", {})\ | |||
.get("lastRxTime") | |||
if last_lacp_pdu_time is None: | |||
last_lacp_pdu_time = 0 | |||
self.log("Warn: the lastRxTime does not exist!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nhe-NV, thank you for finding and fixing this. I agree we should handle the case where lastRxTime
is not returned in the JSON. But setting the PDU time to 0 can lead to
- Test reporting incorrect values in the report (timestamp can be set to 0, which is actually not true).
- In the unlikely event when
last_lacppdu_time_before_reboot
gets set to 0 (due to this change), the collection forlast_lacppdu_time_after_reboot
is impacted, as now the code will append any value greater than 0 to the list, and this may give false values for both PDU before and after reboot.
I suggest that instead of storing the value as 0
, do no operation for the case when lastRxTime
is not returned. In that case 1) we never report timing as 0 and 2) the calculation of lacp-session time will remain based on true values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vaibhavhd I just changed the code as your suggestion, can you help to check if it is OK to fix it in this way.
if (cmd == 'cpu_going_up' or cmd == 'cpu_up') and self.collect_lacppdu_time: | ||
# control plane is back up, start polling for new lacp-pdu | ||
last_lacppdu_time_after_reboot = self.check_last_lacppdu_time() | ||
if int(last_lacppdu_time_after_reboot) > int(last_lacppdu_time_before_reboot): | ||
if last_lacppdu_time_before_reboot is not None and last_lacppdu_time_after_reboot is not None and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the change in collecting last_lacppdu_time_before_reboot
, we are not interested in comparing last_lacppdu_time_before_reboot
anymore.
Rather, the check here should be for the last acceptable (not None) value which is in `self.lacp_pdu_time_on_down[-1].
The latest change can cause problem when for ay reason last_lacppdu_time_before_reboot
is None, however we still have a non-None value stopred in list self.lacp_pdu_time_on_down
. The comparison should be done against value stored inside list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment pending response. Otherwise, LGTM. Thanks.
@liat-grozovik I re-added the 202012 label. Is there a specific reason you removed the label to pick this change on 202012? |
* Fix warm-reboot stuck issue In self.check_last_lacppdu_time() function, it will get the lastRxTime from the command "show lacp internal detailed | json" on arista, and sometimes the cli command will not return the info includ lastRxTime, then self.check_last_lacppdu_time() will return None, in the arista ssh thread, it will use the return value with converting it to int, if the return value is None, then the "int(last_lacppdu_time_after_reboot) > int(last_lacppdu_time_before_reboot)" will give exception, since this function is executed in the thread queue, then this thread will killed, then in the following code, when call q.put('quit'), the code is stuck. Now when RXtime for LACPDU is not available, no operations are done, and next available timestamp is collected. This prevents exception in the thread.
* Fix warm-reboot stuck issue In self.check_last_lacppdu_time() function, it will get the lastRxTime from the command "show lacp internal detailed | json" on arista, and sometimes the cli command will not return the info includ lastRxTime, then self.check_last_lacppdu_time() will return None, in the arista ssh thread, it will use the return value with converting it to int, if the return value is None, then the "int(last_lacppdu_time_after_reboot) > int(last_lacppdu_time_before_reboot)" will give exception, since this function is executed in the thread queue, then this thread will killed, then in the following code, when call q.put('quit'), the code is stuck. Now when RXtime for LACPDU is not available, no operations are done, and next available timestamp is collected. This prevents exception in the thread.
In self.check_last_lacppdu_time() function, it will get the lastRxTime from the command "show lacp internal detailed | json" on arista, and sometimes the cli command will not return the info includ lastRxTime, then self.check_last_lacppdu_time() will return None, in the arista ssh thread, it will use the return value with converting it to int, if the return value is None, then the "int(last_lacppdu_time_after_reboot) > int(last_lacppdu_time_before_reboot)" will give exception, since this function is executed in the thread queue, then this thread will killed, then in the following code, when call q.put('quit'), the code is stuck.
Example for unexpected command output:
show lacp internal detailed | json
{
"orphanPorts": {},
"systemId": "8000,a6-21-d5-41-4a-3a",
"portChannels": {
"Port-Channel1": {
"interfaces": {}
}
},
"miscLacpSysIds": []
}
Example for expected command output:
show lacp internal detailed | json
{
"orphanPorts": {},
"systemId": "8000,46-ef-1e-ef-4e-53",
"portChannels": {
"Port-Channel1": {
"interfaces": {
"Ethernet1": {
"actorPortState": {
"collecting": true,
"distributing": true,
"synchronization": true,
"defaulted": false,
"timeout": false,
"activity": true,
"expired": false,
"aggregation": true
},
"partnerSystemId": "FFFF,50-6b-4b-a3-53-80",
"actorPortPriority": 32768,
"actorPortStatus": "bundled",
"actorOperKey": "0x0001",
"actorPortId": 1,
"details": {
"actorChurnState": "noChurn",
"lastRxTime": 1646835407.7603946,
"allowToWarmup": false,
"actorRxSmState": "rxSmCurrent",
"actorMuxReason": "muxActorCollectingDistributing",
"actorAdminKey": "0x0001",
"actorMuxSmState": "muxSmCollectingDistributing",
"actorTimeoutMultiplier": 3,
"aggSelectionState": "selectedStateSelected"
}
}
}
}
},
"miscLacpSysIds": []
}
Description of PR
Summary: Fix warm-reboot stuck issue
Fixes # (issue)
In sonic-mgmt/ansible/roles/test/files/ptftests/arista.py::run(), it will check the "lastRxTime" with "show lacp internal detailed | json", in sonic-mgmt/ansible/roles/test/files/ptftests/arista.py::check_last_lacppdu_time(), but after the reboot, sometimes, it output of "show lacp internal detailed | json" will not include "lastRxTime", so it means that the return of check_last_lacppdu_time() will be None, and in sonic-mgmt/ansible/roles/test/files/ptftests/arista.py::run(), it will compare the return value of check_last_lacppdu_time() with coverting the value to int:
if the value is None, then the expection will happen in sonic-mgmt/ansible/roles/test/files/ptftests/arista.py::run(), since this run function is executed in one thread, in this time, the thread will stuck, and will not repsond to anything. And it will cause the warm-reboot can not be finished. in the following code is: due to the exception, the code will stuck when execute the q.put('quit')
Type of change
Back port request
Approach
What is the motivation for this PR?
Fix warm-reboot stuck issue
How did you do it?
How did you verify/test it?
Run the warm-reboot in advanced-reboot.py, and the test will not stuch even there is no expected info in output of "show lacp internal detailed | json"
Any platform specific information?
NO
Supported testbed topology if it's a new test case?
Documentation