Skip to content

[warm-reboot]Fix warm-reboot stuck issue #5313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 25, 2022

Conversation

nhe-NV
Copy link
Contributor

@nhe-NV nhe-NV commented Mar 11, 2022

In self.check_last_lacppdu_time() function, it will get the lastRxTime from the command "show lacp internal detailed | json" on arista, and sometimes the cli command will not return the info includ lastRxTime, then self.check_last_lacppdu_time() will return None, in the arista ssh thread, it will use the return value with converting it to int, if the return value is None, then the "int(last_lacppdu_time_after_reboot) > int(last_lacppdu_time_before_reboot)" will give exception, since this function is executed in the thread queue, then this thread will killed, then in the following code, when call q.put('quit'), the code is stuck.

Example for unexpected command output:
show lacp internal detailed | json
{
"orphanPorts": {},
"systemId": "8000,a6-21-d5-41-4a-3a",
"portChannels": {
"Port-Channel1": {
"interfaces": {}
}
},
"miscLacpSysIds": []
}

Example for expected command output:
show lacp internal detailed | json
{
"orphanPorts": {},
"systemId": "8000,46-ef-1e-ef-4e-53",
"portChannels": {
"Port-Channel1": {
"interfaces": {
"Ethernet1": {
"actorPortState": {
"collecting": true,
"distributing": true,
"synchronization": true,
"defaulted": false,
"timeout": false,
"activity": true,
"expired": false,
"aggregation": true
},
"partnerSystemId": "FFFF,50-6b-4b-a3-53-80",
"actorPortPriority": 32768,
"actorPortStatus": "bundled",
"actorOperKey": "0x0001",
"actorPortId": 1,
"details": {
"actorChurnState": "noChurn",
"lastRxTime": 1646835407.7603946,
"allowToWarmup": false,
"actorRxSmState": "rxSmCurrent",
"actorMuxReason": "muxActorCollectingDistributing",
"actorAdminKey": "0x0001",
"actorMuxSmState": "muxSmCollectingDistributing",
"actorTimeoutMultiplier": 3,
"aggSelectionState": "selectedStateSelected"
}
}
}
}
},
"miscLacpSysIds": []
}

Description of PR

Summary: Fix warm-reboot stuck issue
Fixes # (issue)
In sonic-mgmt/ansible/roles/test/files/ptftests/arista.py::run(), it will check the "lastRxTime" with "show lacp internal detailed | json", in sonic-mgmt/ansible/roles/test/files/ptftests/arista.py::check_last_lacppdu_time(), but after the reboot, sometimes, it output of "show lacp internal detailed | json" will not include "lastRxTime", so it means that the return of check_last_lacppdu_time() will be None, and in sonic-mgmt/ansible/roles/test/files/ptftests/arista.py::run(), it will compare the return value of check_last_lacppdu_time() with coverting the value to int:

                if cmd == 'cpu_down':
                    last_lacppdu_time_before_reboot = self.check_last_lacppdu_time()
                    self.lacp_pdu_time_on_down.append(last_lacppdu_time_before_reboot)
                if (cmd == 'cpu_going_up' or cmd == 'cpu_up') and self.collect_lacppdu_time:
                    # control plane is back up, start polling for new lacp-pdu
                    last_lacppdu_time_after_reboot = self.check_last_lacppdu_time()
                    if **int(last_lacppdu_time_after_reboot) > int(last_lacppdu_time_before_reboot)**:
                        self.lacp_pdu_time_on_up.append(last_lacppdu_time_after_reboot)
                        self.collect_lacppdu_time = False # post-reboot lacp-pdu is received, stop the polling

if the value is None, then the expection will happen in sonic-mgmt/ansible/roles/test/files/ptftests/arista.py::run(), since this run function is executed in one thread, in this time, the thread will stuck, and will not repsond to anything. And it will cause the warm-reboot can not be finished. in the following code is: due to the exception, the code will stuck when execute the q.put('quit')

    def handle_post_reboot_health_check(self):
        # wait until all bgp session are established
        self.log("Wait until bgp routing is up on all devices")
        for _, q in self.ssh_jobs:
            **q.put('quit')**

        def wait_for_ssh_threads(signal):
            while any(thr.is_alive() for thr, _ in self.ssh_jobs) and not signal.is_set():
                self.log('Waiting till SSH threads stop')
                time.sleep(self.TIMEOUT)

            for thr, _ in self.ssh_jobs:
                thr.join()

        self.timeout(wait_for_ssh_threads, self.task_timeout, "SSH threads haven't finished for %d seconds" % self.task_timeout)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012

Approach

What is the motivation for this PR?

Fix warm-reboot stuck issue

How did you do it?

How did you verify/test it?

Run the warm-reboot in advanced-reboot.py, and the test will not stuch even there is no expected info in output of "show lacp internal detailed | json"

Any platform specific information?

NO

Supported testbed topology if it's a new test case?

Documentation

@nhe-NV nhe-NV requested a review from a team as a code owner March 11, 2022 03:02
@lgtm-com
Copy link

lgtm-com bot commented Mar 11, 2022

This pull request introduces 1 alert when merging ef83d2bccb4507a17f5f663fdce5fe6b46ad681e into 9ecb0e5 - view on LGTM.com

new alerts:

  • 1 for Testing equality to None

@liat-grozovik
Copy link
Collaborator

@nhe-NV LGTM but please fix new alerts from LFTM so we can signoff the PR.
@yxieca FYI

@@ -558,6 +558,9 @@ def check_last_lacppdu_time(self):
.get("Ethernet1", {})\
.get("details", {})\
.get("lastRxTime")
if last_lacp_pdu_time == None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is None

@yxieca yxieca requested a review from vaibhavhd March 15, 2022 23:26
nhe-NV added 4 commits March 19, 2022 09:47
In self.check_last_lacppdu_time() function, it will get the lastRxTime from the command "show lacp internal detailed | json" on arista, and somet

Example for unexpected command output:
    show lacp internal detailed | json
    {
        "orphanPorts": {},
        "systemId": "8000,a6-21-d5-41-4a-3a",
        "portChannels": {
            "Port-Channel1": {
                "interfaces": {}
            }
        },
        "miscLacpSysIds": []
    }

Example for expected command output:
    show lacp internal detailed | json
    {
        "orphanPorts": {},
        "systemId": "8000,46-ef-1e-ef-4e-53",
        "portChannels": {
            "Port-Channel1": {
                "interfaces": {
                    "Ethernet1": {
                        "actorPortState": {
                            "collecting": true,
                            "distributing": true,
                            "synchronization": true,
                            "defaulted": false,
                            "timeout": false,
                            "activity": true,
                            "expired": false,
                            "aggregation": true
                        },
                        "partnerSystemId": "FFFF,50-6b-4b-a3-53-80",
                        "actorPortPriority": 32768,
                        "actorPortStatus": "bundled",
                        "actorOperKey": "0x0001",
                        "actorPortId": 1,
                        "details": {
                            "actorChurnState": "noChurn",
                            "lastRxTime": 1646835407.7603946,
                            "allowToWarmup": false,
                            "actorRxSmState": "rxSmCurrent",
                            "actorMuxReason": "muxActorCollectingDistributing",
                            "actorAdminKey": "0x0001",
                            "actorMuxSmState": "muxSmCollectingDistributing",
                            "actorTimeoutMultiplier": 3,
                            "aggSelectionState": "selectedStateSelected"
                        }
                    }
                }
            }
        },
        "miscLacpSysIds": []
    }
This reverts commit 1d270a399e0e651430275a5708c57d44e96bf35e.
@nhe-NV nhe-NV force-pushed the fix_warm_reboot_stuck_issue branch from 1d270a3 to 5fb4a57 Compare March 19, 2022 01:56
@@ -558,6 +558,9 @@ def check_last_lacppdu_time(self):
.get("Ethernet1", {})\
.get("details", {})\
.get("lastRxTime")
if last_lacp_pdu_time is None:
last_lacp_pdu_time = 0
self.log("Warn: the lastRxTime does not exist!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nhe-NV, thank you for finding and fixing this. I agree we should handle the case where lastRxTime is not returned in the JSON. But setting the PDU time to 0 can lead to

  1. Test reporting incorrect values in the report (timestamp can be set to 0, which is actually not true).
  2. In the unlikely event when last_lacppdu_time_before_reboot gets set to 0 (due to this change), the collection for last_lacppdu_time_after_reboot is impacted, as now the code will append any value greater than 0 to the list, and this may give false values for both PDU before and after reboot.

I suggest that instead of storing the value as 0, do no operation for the case when lastRxTime is not returned. In that case 1) we never report timing as 0 and 2) the calculation of lacp-session time will remain based on true values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vaibhavhd I just changed the code as your suggestion, can you help to check if it is OK to fix it in this way.

if (cmd == 'cpu_going_up' or cmd == 'cpu_up') and self.collect_lacppdu_time:
# control plane is back up, start polling for new lacp-pdu
last_lacppdu_time_after_reboot = self.check_last_lacppdu_time()
if int(last_lacppdu_time_after_reboot) > int(last_lacppdu_time_before_reboot):
if last_lacppdu_time_before_reboot is not None and last_lacppdu_time_after_reboot is not None and \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the change in collecting last_lacppdu_time_before_reboot, we are not interested in comparing last_lacppdu_time_before_reboot anymore.
Rather, the check here should be for the last acceptable (not None) value which is in `self.lacp_pdu_time_on_down[-1].

The latest change can cause problem when for ay reason last_lacppdu_time_before_reboot is None, however we still have a non-None value stopred in list self.lacp_pdu_time_on_down. The comparison should be done against value stored inside list.

@vaibhavhd
Copy link
Contributor

@nhe-NV , I also raised this another PR to prevent the testcase from getting hung forever:
#5365
Your change is still meaningful, but we should also timeout the test inside PTF, and report all hung issues as an error.
This would help in letting subsequent tests continure.
Please review the PR.

Copy link
Contributor

@vaibhavhd vaibhavhd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment pending response. Otherwise, LGTM. Thanks.

@vaibhavhd
Copy link
Contributor

@liat-grozovik I re-added the 202012 label. Is there a specific reason you removed the label to pick this change on 202012?

@liat-grozovik
Copy link
Collaborator

liat-grozovik commented Mar 24, 2022 via email

@vaibhavhd
Copy link
Contributor

IMO there is no branch of sonic mgmt of 202012 Is there such a plan? Get Outlook for iOShttps://aka.ms/o0ukef

________________________________ From: Vaibhav Hemant Dixit @.> Sent: Thursday, March 24, 2022 6:56:37 PM To: Azure/sonic-mgmt @.> Cc: Liat Grozovik @.>; Mention @.> Subject: Re: [Azure/sonic-mgmt] [warm-reboot]Fix warm-reboot stuck issue (PR #5313) @liat-grozovikhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fliat-grozovik&data=04%7C01%7Cliatg%40nvidia.com%7Cfd78ee337cc14db826ef08da0db742ee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637837378001408089%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=tZGA%2BJJ6FYh5VpZ2V5mMEhv0e1tGFz2dibz4D4uGZq8%3D&reserved=0 I re-added the 202012 label. Is there a specific reason you removed the label to pick this change on 202012? — Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fsonic-mgmt%2Fpull%2F5313%23issuecomment-1077836493&data=04%7C01%7Cliatg%40nvidia.com%7Cfd78ee337cc14db826ef08da0db742ee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637837378001408089%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QMUTzp%2BiLYHyY1ZeUWbAoadt3HapVGhtuBT%2F2oeD%2FP8%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKTABA6UM3ZPKR6F3YO7JN3VBSNELANCNFSM5QONISMA&data=04%7C01%7Cliatg%40nvidia.com%7Cfd78ee337cc14db826ef08da0db742ee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637837378001408089%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2yBNg9ZSeLp%2BG0W%2BY7JNr%2FySaBgm9Km51ebrDOa3M1E%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

A few weeks ago we cut 202012 branch - https://github.com/Azure/sonic-mgmt/tree/202012

@vaibhavhd vaibhavhd merged commit 65a3a84 into sonic-net:master Mar 25, 2022
wangxin pushed a commit that referenced this pull request Mar 27, 2022
* Fix warm-reboot stuck issue
In self.check_last_lacppdu_time() function, it will get the lastRxTime from the command "show lacp internal detailed | json" on arista, and sometimes the cli command will not return the info includ lastRxTime, then self.check_last_lacppdu_time() will return None, in the arista ssh thread, it will use the return value with converting it to int, if the return value is None, then the "int(last_lacppdu_time_after_reboot) > int(last_lacppdu_time_before_reboot)" will give exception, since this function is executed in the thread queue, then this thread will killed, then in the following code, when call q.put('quit'), the code is stuck.

Now when RXtime for LACPDU is not available, no operations are done, and next available timestamp is collected. This prevents exception in the thread.
xwjiang-ms pushed a commit to xwjiang-ms/sonic-mgmt that referenced this pull request Apr 13, 2022
* Fix warm-reboot stuck issue
In self.check_last_lacppdu_time() function, it will get the lastRxTime from the command "show lacp internal detailed | json" on arista, and sometimes the cli command will not return the info includ lastRxTime, then self.check_last_lacppdu_time() will return None, in the arista ssh thread, it will use the return value with converting it to int, if the return value is None, then the "int(last_lacppdu_time_after_reboot) > int(last_lacppdu_time_before_reboot)" will give exception, since this function is executed in the thread queue, then this thread will killed, then in the following code, when call q.put('quit'), the code is stuck.

Now when RXtime for LACPDU is not available, no operations are done, and next available timestamp is collected. This prevents exception in the thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants