[tlm teamd] Add retry mechanism before logging the ERR in get_dumps. #1629

judyjoseph · 2021-02-08T07:18:05Z

Why I did
Fix sonic-net/sonic-buildimage#6632

There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message.

The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called.

When the portchannel object is add/removed. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L101]
On timeout of 1 sec. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L108]

In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module.

Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received.

On a good case

Feb  7 02:03:27.576078 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed.
Feb  7 02:03:28.453829 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:28,451 INFO reaped unknown pid 4747 (exit status 0)
Feb  7 02:03:28.458616 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999

On a bad case

Feb  7 02:03:33.037401 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping
Feb  7 02:03:33.046179 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed.
Feb  7 02:03:33.997639 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:33,996 INFO reaped unknown pid 4775 (exit status 0)
Feb  7 02:03:34.040126 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999

How I did it

Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error.

Additionally added a to_retry flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not.

How I verified it
Verified that the error message is no more seen in the syslog.
Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed).

The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here.

admin@vlab-01:/var/log$ sudo zgrep -i "get dump for LAG" syslog*; sudo zgrep -i "clearing it" syslog*
syslog.1:Feb  8 06:41:54.995716 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.2.gz:Feb  8 06:31:32.360135 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.2.gz:Feb  8 06:36:16.617283 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.2.gz:Feb  8 06:37:56.906306 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb  8 06:25:44.442474 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb  8 06:27:02.539413 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb  8 06:27:42.785533 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb  8 06:29:33.510933 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.5.gz:Feb  8 06:08:03.643106 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it

Details if related

judyjoseph · 2021-02-08T18:42:31Z

retest vs please

judyjoseph · 2021-02-09T16:23:30Z

retest vs please

tlm_teamd/teamdctl_mgr.cpp

lguohan · 2021-02-09T17:01:51Z

tlm_teamd/teamdctl_mgr.cpp

@@ -136,6 +142,15 @@ bool TeamdCtlMgr::remove_lag(const std::string & lag_name)
    {
        SWSS_LOG_WARN("The LAG '%s' hasn't been added. Can't remove it", lag_name.c_str());
    }
+
+    // If this lag interface errored last time, clear it
+    if ((lag_name.compare(last_errored_lag_name) == 0) && (no_of_retry != 0))


should we move this to get_dump() around line 196?

We would need this check here in remove_lag()... The lag resource could get cleared from the m_handlers here (teamdctl_mgr.cpp#L133) due to call from in main.cpp#L98 --> update_interfaces() --> remove_lag(). Since it can get cleared here we will not reach teamdctl_mgr.cpp#L196 in the lag resource remove case which is the failure case here.
This is happening as main.cpp#L101 and main.cpp#L114 can happen asynchronously.

lguohan · 2021-02-09T17:03:03Z

tlm_teamd/teamdctl_mgr.cpp

+
+/// Store the last errored lag name and the retry count.
+static std::string last_errored_lag_name= std::string("");
+static int no_of_retry = 0;


wonder if we can move all these static variable into get_dump() function?

We would need to access this variables in remove_lag API also.

I have kept these variables global ( it is not static anymore, in updated code.). It is used in remove_lag() and get_dump() APIs

judyjoseph · 2021-02-11T16:41:17Z

@lguohan Could you take a look at the changes again - thanks

judyjoseph · 2021-02-17T00:54:48Z

/AzurePipelines run

azure-pipelines · 2021-02-17T00:55:01Z

Azure Pipelines successfully started running 1 pipeline(s).

judyjoseph · 2021-03-01T15:50:15Z

@lguohan. I had made the changes you recommended in L#196 - but I am keeping the logic in removeLag for the reason I put earlier in comment.
<<<
We would need this check here in remove_lag()... The lag resource could get cleared from the m_handlers data structure here (teamdctl_mgr.cpp#L133) due to call from in main.cpp#L98 --> update_interfaces() --> remove_lag(). Since it can get cleared as above when get_dumps() is called on timeout (main.cpp#L114), it won't reach teamdctl_mgr.cpp#L196 as m_handlers[lag_name] for that "lag" don't exist.
<<<
Could you take a look again ?

judyjoseph · 2021-04-28T06:35:26Z

Closing this PR for now, as this error is not seen any more.

liat-grozovik · 2021-08-10T13:00:52Z

@judyjoseph the issue of sonic-net/sonic-buildimage#6632 can be reproduced without this patch easily but not with it. I strongly suggest to move on with the review and approval flow and get it to 20212 and above.
We were able to reproduce it while with a simple script of a loop of add bulk of multiple port channels and then remote them.

judyjoseph · 2021-08-18T20:00:37Z

@prsunny @lguohan could you review this PR again. This is an old PR which we closed earlier as it was not reproducible.
The test failures are unrelated, will trigger a re-run.

judyjoseph · 2021-08-18T20:00:48Z

/azp run

azure-pipelines · 2021-08-18T20:00:58Z

Azure Pipelines successfully started running 1 pipeline(s).

judyjoseph · 2021-08-19T20:11:36Z

/azp run

azure-pipelines · 2021-08-19T20:11:45Z

Azure Pipelines successfully started running 1 pipeline(s).

liat-grozovik · 2021-09-12T11:28:12Z

@judyjoseph kindly reminder. what is the plan to merge and take to 202012 and 202106?

tlm_teamd/main.cpp

…ehavior. Instead of storing the lagname in a string, introduced a set [lag_name, retry_count]

…1629) Fix sonic-net/sonic-buildimage#6632 There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message. The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called. 1. When the portchannel object is add/removed. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L101] 2. On timeout of 1 sec. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L108] In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module. Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received. On a good case ``` Feb 7 02:03:27.576078 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:28.453829 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:28,451 INFO reaped unknown pid 4747 (exit status 0) Feb 7 02:03:28.458616 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` On a bad case ``` Feb 7 02:03:33.037401 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping Feb 7 02:03:33.046179 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:33.997639 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:33,996 INFO reaped unknown pid 4775 (exit status 0) Feb 7 02:03:34.040126 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` **How I did it** Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error. Additionally added a **to_retry** flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not. **How I verified it** Verified that the error message is no more seen in the syslog. Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed). The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here. ``` admin@vlab-01:/var/log$ sudo zgrep -i "get dump for LAG" syslog*; sudo zgrep -i "clearing it" syslog* syslog.1:Feb 8 06:41:54.995716 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:31:32.360135 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:36:16.617283 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:37:56.906306 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:25:44.442474 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:02.539413 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:42.785533 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:29:33.510933 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.5.gz:Feb 8 06:08:03.643106 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it ```

…onic-net#1629) Fix sonic-net/sonic-buildimage#6632 There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message. The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called. 1. When the portchannel object is add/removed. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L101] 2. On timeout of 1 sec. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L108] In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module. Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received. On a good case ``` Feb 7 02:03:27.576078 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:28.453829 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:28,451 INFO reaped unknown pid 4747 (exit status 0) Feb 7 02:03:28.458616 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` On a bad case ``` Feb 7 02:03:33.037401 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping Feb 7 02:03:33.046179 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:33.997639 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:33,996 INFO reaped unknown pid 4775 (exit status 0) Feb 7 02:03:34.040126 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` **How I did it** Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error. Additionally added a **to_retry** flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not. **How I verified it** Verified that the error message is no more seen in the syslog. Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed). The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here. ``` admin@vlab-01:/var/log$ sudo zgrep -i "get dump for LAG" syslog*; sudo zgrep -i "clearing it" syslog* syslog.1:Feb 8 06:41:54.995716 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:31:32.360135 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:36:16.617283 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:37:56.906306 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:25:44.442474 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:02.539413 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:42.785533 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:29:33.510933 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.5.gz:Feb 8 06:08:03.643106 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it ```

sonic-net#1629) Signed-off-by: vaibhav-dahiya [email protected] This PR adds support for an option to display firmware version of muxcable of only active banks. The new output would look like this in case an active flag is passed to the command line admin@STR43-0101-0101-01LT0:~$ show muxcable firmware version Ethernet0 --active { "version_self_active": "0.7MS", "version_peer_active": "0.7MS", "version_nic_active": "0.7MS", } What I did added an option to display active banks only for display muxcable firmware version Signed-off-by: vaibhav-dahiya <[email protected]>

judyjoseph requested a review from lguohan February 8, 2021 07:25

lguohan reviewed Feb 9, 2021

View reviewed changes

tlm_teamd/teamdctl_mgr.cpp Show resolved Hide resolved

lguohan reviewed Feb 9, 2021

View reviewed changes

judyjoseph requested a review from lguohan February 10, 2021 17:10

judyjoseph changed the title ~~Add retry mechanism before logging the ERR in get_dumps API()~~ [tlm teamd] Add retry mechanism before logging the ERR in get_dumps API() Feb 12, 2021

judyjoseph changed the title ~~[tlm teamd] Add retry mechanism before logging the ERR in get_dumps API()~~ [tlm teamd] Add retry mechanism before logging the ERR in get_dumps. Feb 17, 2021

judyjoseph closed this Apr 28, 2021

judyjoseph deleted the tlm_teamd_err branch June 3, 2021 05:45

judyjoseph restored the tlm_teamd_err branch August 5, 2021 19:20

judyjoseph reopened this Aug 5, 2021

liat-grozovik added the Bug 🐛 label Aug 10, 2021

judyjoseph requested a review from prsunny as a code owner August 11, 2021 02:11

judyjoseph force-pushed the tlm_teamd_err branch from 9703c99 to b1284e1 Compare August 11, 2021 02:14

judyjoseph force-pushed the tlm_teamd_err branch from b1284e1 to 9426c4b Compare August 20, 2021 21:12

judyjoseph added 2 commits August 30, 2021 22:50

Add retry mechanishm before logging the ERR in get_dumps API()

ec98450

Fix alignment

e3a1b89

judyjoseph and others added 2 commits August 30, 2021 22:50

Additional check in get_dump to clear the no_retry count

a1c7c22

Updates to comments and keep the global variables as not static.

2eef7a3

judyjoseph force-pushed the tlm_teamd_err branch from b206695 to 2eef7a3 Compare August 31, 2021 05:55

liat-grozovik added Request for 202012 Branch Request for 202106 Branch labels Sep 12, 2021

lguohan reviewed Sep 14, 2021

View reviewed changes

tlm_teamd/main.cpp Outdated Show resolved Hide resolved

Fix comments and take care of case when more portchannel exibit ERR b…

8aacabc

…ehavior. Instead of storing the lagname in a string, introduced a set [lag_name, retry_count]

lguohan approved these changes Sep 17, 2021

View reviewed changes

lguohan merged commit 002bb1d into sonic-net:master Sep 17, 2021

qiluo-msft added the Included in 202012 Branch label Sep 17, 2021

judyjoseph added the Included in 202106 Branch label Sep 27, 2021

vivekrnv mentioned this pull request Jan 22, 2022

[teamd] config portchannel del is resulting in an extra keyspace notification sonic-net/sonic-buildimage#9831

Open

[tlm teamd] Add retry mechanism before logging the ERR in get_dumps. #1629

[tlm teamd] Add retry mechanism before logging the ERR in get_dumps. #1629

Uh oh!

Conversation

judyjoseph commented Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

judyjoseph commented Feb 8, 2021

Uh oh!

judyjoseph commented Feb 9, 2021

Uh oh!

Uh oh!

lguohan Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

judyjoseph Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lguohan Feb 9, 2021

Choose a reason for hiding this comment

Uh oh!

judyjoseph Feb 10, 2021

Choose a reason for hiding this comment

Uh oh!

judyjoseph Aug 18, 2021

Choose a reason for hiding this comment

Uh oh!

judyjoseph commented Feb 11, 2021

Uh oh!

judyjoseph commented Feb 17, 2021

Uh oh!

azure-pipelines bot commented Feb 17, 2021

Uh oh!

judyjoseph commented Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

judyjoseph commented Apr 28, 2021

Uh oh!

liat-grozovik commented Aug 10, 2021

Uh oh!

judyjoseph commented Aug 18, 2021

Uh oh!

judyjoseph commented Aug 18, 2021

Uh oh!

azure-pipelines bot commented Aug 18, 2021

Uh oh!

judyjoseph commented Aug 19, 2021

Uh oh!

azure-pipelines bot commented Aug 19, 2021

Uh oh!

liat-grozovik commented Sep 12, 2021

Uh oh!

Uh oh!

Uh oh!

judyjoseph commented Feb 8, 2021 •

edited

Loading

lguohan Feb 9, 2021 •

edited

Loading

judyjoseph Feb 10, 2021 •

edited

Loading

judyjoseph commented Mar 1, 2021 •

edited

Loading