You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[tlm teamd] Add retry mechanism before logging the ERR in get_dumps. (#1629)
Fixsonic-net/sonic-buildimage#6632
There has been cases when the get_dumps API in tlm_teamd process is not able to get the right data and logs an error message.
The issue occurs very rarely and it is due to the race condition between teammgrd/teamsyncd/tlm_teamd when a Portchannel is removed. In the teamd telemetry module there are two places where the get_dumps() is called.
1. When the portchannel object is add/removed. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L101]
2. On timeout of 1 sec. [https://github.com/Azure/sonic-swss/blob/master/tlm_teamd/main.cpp#L108]
In case of timeout call for get_dumps(), there could be an inconsistent state where the portchannel/teamd process is getting removed by teammgrd but the STATE table update to remove the lag interface is still not received by the tlm_teamd module.
Seen below on a bad case where the get_dumps() call from TIMEOUT handler throws an ERR message - as the remove_lag message is not yet received.
On a good case
```
Feb 7 02:03:27.576078 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed.
Feb 7 02:03:28.453829 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:28,451 INFO reaped unknown pid 4747 (exit status 0)
Feb 7 02:03:28.458616 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999
```
On a bad case
```
Feb 7 02:03:33.037401 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping
Feb 7 02:03:33.046179 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' has been removed.
Feb 7 02:03:33.997639 vlab-01 INFO teamd#supervisord 2021-02-07 02:03:33,996 INFO reaped unknown pid 4775 (exit status 0)
Feb 7 02:03:34.040126 vlab-01 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel999
```
**How I did it**
Add retry mechanism before logging the ERR in get_dumps API(). The number of retries is set as 3. So that if the same error repeats 3 times - it is logged, other wise it is considered a transient condition - not an error.
Additionally added a **to_retry** flag to get_dumps() API so that the caller can decide whether to use the retry mechanism or not.
**How I verified it**
Verified that the error message is no more seen in the syslog.
Confirmed by running ~ 200 times portchannel creation (which had reproduced the issue earlier on VS testbed).
The new NOTICE message added in remove_lag shows that we had indeed hit the original issue earlier and clearing flags here.
```
admin@vlab-01:/var/log$ sudo zgrep -i "get dump for LAG" syslog*; sudo zgrep -i "clearing it" syslog*
syslog.1:Feb 8 06:41:54.995716 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.2.gz:Feb 8 06:31:32.360135 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.2.gz:Feb 8 06:36:16.617283 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.2.gz:Feb 8 06:37:56.906306 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb 8 06:25:44.442474 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb 8 06:27:02.539413 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb 8 06:27:42.785533 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.3.gz:Feb 8 06:29:33.510933 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
syslog.5.gz:Feb 8 06:08:03.643106 vlab-01 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel999' had errored while getting dump, clearing it
```
0 commit comments