-
Notifications
You must be signed in to change notification settings - Fork 587
[tlm teamd] Add retry mechanism before logging the ERR in get_dumps. #1629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
retest vs please |
1 similar comment
retest vs please |
tlm_teamd/teamdctl_mgr.cpp
Outdated
@@ -136,6 +142,15 @@ bool TeamdCtlMgr::remove_lag(const std::string & lag_name) | |||
{ | |||
SWSS_LOG_WARN("The LAG '%s' hasn't been added. Can't remove it", lag_name.c_str()); | |||
} | |||
|
|||
// If this lag interface errored last time, clear it | |||
if ((lag_name.compare(last_errored_lag_name) == 0) && (no_of_retry != 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move this to get_dump() around line 196?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need this check here in remove_lag()... The lag resource could get cleared from the m_handlers here (teamdctl_mgr.cpp#L133) due to call from in main.cpp#L98 --> update_interfaces() --> remove_lag(). Since it can get cleared here we will not reach teamdctl_mgr.cpp#L196 in the lag resource remove case which is the failure case here.
This is happening as main.cpp#L101 and main.cpp#L114 can happen asynchronously.
tlm_teamd/teamdctl_mgr.cpp
Outdated
|
||
/// Store the last errored lag name and the retry count. | ||
static std::string last_errored_lag_name= std::string(""); | ||
static int no_of_retry = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if we can move all these static variable into get_dump() function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to access this variables in remove_lag API also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kept these variables global ( it is not static anymore, in updated code.). It is used in remove_lag() and get_dump() APIs
@lguohan Could you take a look at the changes again - thanks |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
@lguohan. I had made the changes you recommended in L#196 - but I am keeping the logic in removeLag for the reason I put earlier in comment. |
Closing this PR for now, as this error is not seen any more. |
@judyjoseph the issue of sonic-net/sonic-buildimage#6632 can be reproduced without this patch easily but not with it. I strongly suggest to move on with the review and approval flow and get it to 20212 and above. |
9703c99
to
b1284e1
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
b1284e1
to
9426c4b
Compare
b206695
to
2eef7a3
Compare
@judyjoseph kindly reminder. what is the plan to merge and take to 202012 and 202106? |
…ehavior. Instead of storing the lagname in a string, introduced a set [lag_name, retry_count]
…1629) Fix sonic-net/sonic-buildimage#6632 There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message. The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called. 1. When the portchannel object is add/removed. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L101] 2. On timeout of 1 sec. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L108] In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module. Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received. On a good case ``` Feb 7 02:03:27.576078 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:28.453829 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:28,451 INFO reaped unknown pid 4747 (exit status 0) Feb 7 02:03:28.458616 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` On a bad case ``` Feb 7 02:03:33.037401 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping Feb 7 02:03:33.046179 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:33.997639 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:33,996 INFO reaped unknown pid 4775 (exit status 0) Feb 7 02:03:34.040126 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` **How I did it** Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error. Additionally added a **to_retry** flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not. **How I verified it** Verified that the error message is no more seen in the syslog. Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed). The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here. ``` admin@vlab-01:/var/log$ sudo zgrep -i "get dump for LAG" syslog*; sudo zgrep -i "clearing it" syslog* syslog.1:Feb 8 06:41:54.995716 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:31:32.360135 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:36:16.617283 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:37:56.906306 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:25:44.442474 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:02.539413 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:42.785533 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:29:33.510933 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.5.gz:Feb 8 06:08:03.643106 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it ```
…1629) Fix sonic-net/sonic-buildimage#6632 There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message. The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called. 1. When the portchannel object is add/removed. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L101] 2. On timeout of 1 sec. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L108] In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module. Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received. On a good case ``` Feb 7 02:03:27.576078 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:28.453829 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:28,451 INFO reaped unknown pid 4747 (exit status 0) Feb 7 02:03:28.458616 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` On a bad case ``` Feb 7 02:03:33.037401 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping Feb 7 02:03:33.046179 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:33.997639 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:33,996 INFO reaped unknown pid 4775 (exit status 0) Feb 7 02:03:34.040126 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` **How I did it** Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error. Additionally added a **to_retry** flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not. **How I verified it** Verified that the error message is no more seen in the syslog. Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed). The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here. ``` admin@vlab-01:/var/log$ sudo zgrep -i "get dump for LAG" syslog*; sudo zgrep -i "clearing it" syslog* syslog.1:Feb 8 06:41:54.995716 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:31:32.360135 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:36:16.617283 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:37:56.906306 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:25:44.442474 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:02.539413 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:42.785533 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:29:33.510933 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.5.gz:Feb 8 06:08:03.643106 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it ```
…onic-net#1629) Fix sonic-net/sonic-buildimage#6632 There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message. The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called. 1. When the portchannel object is add/removed. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L101] 2. On timeout of 1 sec. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L108] In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module. Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received. On a good case ``` Feb 7 02:03:27.576078 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:28.453829 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:28,451 INFO reaped unknown pid 4747 (exit status 0) Feb 7 02:03:28.458616 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` On a bad case ``` Feb 7 02:03:33.037401 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping Feb 7 02:03:33.046179 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:33.997639 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:33,996 INFO reaped unknown pid 4775 (exit status 0) Feb 7 02:03:34.040126 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` **How I did it** Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error. Additionally added a **to_retry** flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not. **How I verified it** Verified that the error message is no more seen in the syslog. Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed). The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here. ``` admin@vlab-01:/var/log$ sudo zgrep -i "get dump for LAG" syslog*; sudo zgrep -i "clearing it" syslog* syslog.1:Feb 8 06:41:54.995716 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:31:32.360135 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:36:16.617283 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:37:56.906306 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:25:44.442474 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:02.539413 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:42.785533 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:29:33.510933 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.5.gz:Feb 8 06:08:03.643106 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it ```
sonic-net#1629) Signed-off-by: vaibhav-dahiya [email protected] This PR adds support for an option to display firmware version of muxcable of only active banks. The new output would look like this in case an active flag is passed to the command line admin@STR43-0101-0101-01LT0:~$ show muxcable firmware version Ethernet0 --active { "version_self_active": "0.7MS", "version_peer_active": "0.7MS", "version_nic_active": "0.7MS", } What I did added an option to display active banks only for display muxcable firmware version Signed-off-by: vaibhav-dahiya <[email protected]>
Why I did
Fix sonic-net/sonic-buildimage#6632
There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message.
The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called.
In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module.
Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received.
On a good case
On a bad case
How I did it
Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error.
Additionally added a to_retry flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not.
How I verified it
Verified that the error message is no more seen in the syslog.
Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed).
The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here.
Details if related