Skip to content

Commit 7ee8d26

Browse files
committed
[tlm teamd] Add retry mechanism before logging the ERR in get_dumps. (#1629)
Fix sonic-net/sonic-buildimage#6632 There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message. The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called. 1. When the portchannel object is add/removed. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L101] 2. On timeout of 1 sec. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L108] In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module. Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received. On a good case ``` Feb 7 02:03:27.576078 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:28.453829 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:28,451 INFO reaped unknown pid 4747 (exit status 0) Feb 7 02:03:28.458616 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` On a bad case ``` Feb 7 02:03:33.037401 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping Feb 7 02:03:33.046179 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed. Feb 7 02:03:33.997639 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:33,996 INFO reaped unknown pid 4775 (exit status 0) Feb 7 02:03:34.040126 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999 ``` **How I did it** Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error. Additionally added a **to_retry** flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not. **How I verified it** Verified that the error message is no more seen in the syslog. Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed). The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here. ``` admin@vlab-01:/var/log$ sudo zgrep -i "get dump for LAG" syslog*; sudo zgrep -i "clearing it" syslog* syslog.1:Feb 8 06:41:54.995716 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:31:32.360135 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:36:16.617283 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.2.gz:Feb 8 06:37:56.906306 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:25:44.442474 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:02.539413 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:27:42.785533 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.3.gz:Feb 8 06:29:33.510933 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it syslog.5.gz:Feb 8 06:08:03.643106 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it ```
1 parent 7f57d3d commit 7ee8d26

File tree

3 files changed

+56
-8
lines changed

3 files changed

+56
-8
lines changed

tlm_teamd/main.cpp

+5-2
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ int main()
9898
if (res == swss::Select::OBJECT)
9999
{
100100
update_interfaces(sst_lag, teamdctl_mgr);
101-
values_store.update(teamdctl_mgr.get_dumps());
101+
values_store.update(teamdctl_mgr.get_dumps(false));
102102
}
103103
else if (res == swss::Select::ERROR)
104104
{
@@ -108,7 +108,10 @@ int main()
108108
else if (res == swss::Select::TIMEOUT)
109109
{
110110
teamdctl_mgr.process_add_queue();
111-
values_store.update(teamdctl_mgr.get_dumps());
111+
// In the case of lag removal, there is a scenario where the select::TIMEOUT
112+
// occurs, it triggers get_dumps incorrectly for resource which was in process of
113+
// getting deleted. The fix here is to retry and check if this is a real failure.
114+
values_store.update(teamdctl_mgr.get_dumps(true));
112115
}
113116
else
114117
{

tlm_teamd/teamdctl_mgr.cpp

+47-4
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55

66
#include "teamdctl_mgr.h"
77

8+
#define MAX_RETRY 3
9+
810
///
911
/// Custom function for libteamdctl logger. IT is empty to prevent libteamdctl to spam us with the error messages
1012
/// @param tdc teamdctl descriptor
@@ -136,6 +138,15 @@ bool TeamdCtlMgr::remove_lag(const std::string & lag_name)
136138
{
137139
SWSS_LOG_WARN("The LAG '%s' hasn't been added. Can't remove it", lag_name.c_str());
138140
}
141+
142+
// If this lag interface errored last time, remove the entry.
143+
// This is needed as here in this remove API, we do m_handlers.erase(lag_name).
144+
if (m_lags_err_retry.find(lag_name) != m_lags_err_retry.end())
145+
{
146+
SWSS_LOG_NOTICE("The LAG '%s' had errored while getting dump, removing it", lag_name.c_str());
147+
m_lags_err_retry.erase(lag_name);
148+
}
149+
139150
return true;
140151
}
141152

@@ -163,11 +174,12 @@ void TeamdCtlMgr::process_add_queue()
163174
///
164175
/// Get json dump from teamd for LAG interface with name lag_name
165176
/// @param lag_name a name for LAG interface
177+
/// @param to_retry is the flag used to do retry or not.
166178
/// @return a pair. First element of the pair is true, if the method is successful
167179
/// false otherwise. If the first element is true, the second element has a dump
168180
/// otherwise the second element is an empty string
169181
///
170-
TeamdCtlDump TeamdCtlMgr::get_dump(const std::string & lag_name)
182+
TeamdCtlDump TeamdCtlMgr::get_dump(const std::string & lag_name, bool to_retry)
171183
{
172184
TeamdCtlDump res = { false, "" };
173185
if (has_key(lag_name))
@@ -178,10 +190,41 @@ TeamdCtlDump TeamdCtlMgr::get_dump(const std::string & lag_name)
178190
if (r == 0)
179191
{
180192
res = { true, std::string(dump) };
193+
194+
// If this lag interface errored last time, remove the entry
195+
if (m_lags_err_retry.find(lag_name) != m_lags_err_retry.end())
196+
{
197+
SWSS_LOG_NOTICE("The LAG '%s' had errored in get_dump earlier, removing it", lag_name.c_str());
198+
m_lags_err_retry.erase(lag_name);
199+
}
181200
}
182201
else
183202
{
184-
SWSS_LOG_ERROR("Can't get dump for LAG '%s'. Skipping", lag_name.c_str());
203+
// In case of failure and retry flag is set, check if it fails for MAX_RETRY times.
204+
if (to_retry)
205+
{
206+
if (m_lags_err_retry.find(lag_name) != m_lags_err_retry.end())
207+
{
208+
if (m_lags_err_retry[lag_name] == MAX_RETRY)
209+
{
210+
SWSS_LOG_ERROR("Can't get dump for LAG '%s'. Skipping", lag_name.c_str());
211+
m_lags_err_retry.erase(lag_name);
212+
}
213+
else
214+
m_lags_err_retry[lag_name]++;
215+
}
216+
else
217+
{
218+
219+
// This time a different lag interface errored out.
220+
m_lags_err_retry[lag_name] = 1;
221+
}
222+
}
223+
else
224+
{
225+
// No need to retry if the flag is not set.
226+
SWSS_LOG_ERROR("Can't get dump for LAG '%s'. Skipping", lag_name.c_str());
227+
}
185228
}
186229
}
187230
else
@@ -196,14 +239,14 @@ TeamdCtlDump TeamdCtlMgr::get_dump(const std::string & lag_name)
196239
/// Get dumps for all registered LAG interfaces
197240
/// @return vector of pairs. Each pair first value is a name of LAG, second value is a dump
198241
///
199-
TeamdCtlDumps TeamdCtlMgr::get_dumps()
242+
TeamdCtlDumps TeamdCtlMgr::get_dumps(bool to_retry)
200243
{
201244
TeamdCtlDumps res;
202245

203246
for (const auto & p: m_handlers)
204247
{
205248
const auto & lag_name = p.first;
206-
const auto & result = get_dump(lag_name);
249+
const auto & result = get_dump(lag_name, to_retry);
207250
const auto & status = result.first;
208251
const auto & dump = result.second;
209252
if (status)

tlm_teamd/teamdctl_mgr.h

+4-2
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,17 @@ class TeamdCtlMgr
1818
bool add_lag(const std::string & lag_name);
1919
bool remove_lag(const std::string & lag_name);
2020
void process_add_queue();
21-
TeamdCtlDump get_dump(const std::string & lag_name);
22-
TeamdCtlDumps get_dumps();
21+
// Retry logic added to prevent incorrect error reporting in dump API's
22+
TeamdCtlDump get_dump(const std::string & lag_name, bool to_retry);
23+
TeamdCtlDumps get_dumps(bool to_retry);
2324

2425
private:
2526
bool has_key(const std::string & lag_name) const;
2627
bool try_add_lag(const std::string & lag_name);
2728

2829
std::unordered_map<std::string, struct teamdctl*> m_handlers;
2930
std::unordered_map<std::string, int> m_lags_to_add;
31+
std::unordered_map<std::string, int> m_lags_err_retry;
3032

3133
const int max_attempts_to_add = 10;
3234
};

0 commit comments

Comments
 (0)