-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[FDB]All MACs are not synced to the kernel in scale scenario #12502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@kishorekunal01 from BRCM to take a look |
This is not issue, this is kernel behavior, and reprogramming is done so that MAC will be added back to kernel as per design. The second scenario where the problem happens is when MAC ages out in kernel but not in switch (if FDB aging time is increased in switch). This results in fdbsyncd processing stale mac notifications and reprogramming them based on status in state_db. In this scenario it is always observed that exactly 8K macs are reprogrammed (could be size of netlink buffer queue?) |
When interface goes down all the entry in the FDB_TABLE for the interface should get deleted by the switch, And when interface comes up(in kernel also interface will come up) and now due to traffic MAC learning will again happen in ASIC and same will be added back in FDB_TABLE. If it is not working please enable below debug "swssloglevel -l INFO -c fdbsyncd" and provide the tech support. |
@kishorekunal01 The issue is not with reprogramming the MACs but the consistency of mac between ASIC and kernel. One of the dumps captured has swssloglevel enabled as info. As mentioned there are two scenarios where macs are not properly reprogrammed in the kernel. I believe fdbsyncd is not robust enough to handle these scenarios. Please let me know if you need more details. |
@dgsudharsan I tried the interface up/down test case on Broadcom chipset. And I don't see any issue with MAC sync between the ASIC and kernel. Attaching Tech support in next comment As I have earlier replied |
I have enable debug "swssloglevel -l INFO -c fdbsyncd" and collected the tech support. Log file attached |
@kishorekunal01 Thanks. For the port up down scenario, I did some more analysis and found the rc to be due to SAI notification issue which I am handling internally. However I also specified the netlink buffer issue which you can see from the logs Oct 23 20:24:52.707880 qa-eth-vt03-2-3700v ERR swss#fdbsyncd: :- readData: netlink reports out of memory on reading a netlink socket. High possibility of a lost message Please check sonic_dump_qa-eth-vt03-2-3700v_20221023_204313.tar attached in the bug. This happens when I have 10K macs on one port. |
@prsunny Can we increase the netlink buffer. Currently it is set to 3MB, Can we increase the netlink buffer size to 16MB.
When there is a dump from kernel it is possible that netlink buffer run out of memory with 10k MAC scale. Hence this error is reported. |
|
@adyeung Can you please provide ETA for increasing netlink buffer size? |
Expecting a fix to be posted by 1/20/23 |
@adyeung @kishorekunal01 Can you please share the fix if it is ready? |
I have created the pull request for the fix on 18th Jan
sonic-net/sonic-swss-common#739
Thanks and Regards,
Kishore Kunal
…On Fri, Jan 20, 2023 at 6:22 PM Sudharsan Dhamal Gopalarathnam < ***@***.***> wrote:
Expecting a fix to be posted by 1/20/23
@adyeung <https://github.com/adyeung> @kishorekunal01
<https://github.com/kishorekunal01> Can you please share the fix if it is
ready?
—
Reply to this email directly, view it on GitHub
<#12502 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/APIREPFA25QJIGLPFHFQQRLWTNB5ZANCNFSM6AAAAAAROSSUFM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.
|
Hi @kishorekunal01 We are still seeing the issue even after the fix. Should some parameters be adjusted? |
sonic_dump_qa-eth-vt03-2-3700v_20230222_185508.tar.gz I am attaching the techsupport here. You can see there are 10K MACS learnt in ASIC, However the kernel shows only 8K MACS |
On a note someone had reported the same and increasing to 16 MB didn't help https://groups.google.com/g/sonicproject/c/Lc0cs-RzNSE |
@kishorekunal01 Should we increase the netlink memory here? https://github.com/sonic-net/sonic-buildimage/blob/9ff2e2cff38fa71d0e5ce38f92d4339206849a74/files/image_config/sysctl/sysctl-net.conf. Currently it is 3MB |
Yes, the net.core.rmem_max setting in the referenced file is the proper place to modify this setting. Please see my associated comment here at sonic-net/sonic-swss-common#739 (comment) Issue #12587 relates to netlink messages originated at the kernel and sent to application (kernel-to-socket), thus only the socket rcv buffer needs increasing in the context of that issue. If the problematic path here is instead from netlink socket-to-kernel as opposed to from kernel-to-socket then the net.core.wmem_max setting will need to be increased for the sake of larger socket tx buffer. |
Description
During scale testing, all MACs are not synced to kernel. This can be easily reproduced by learning 10K MACs on a port and doing a shutdown/no shut. Could this be due to fdbsyncd not checking if port is up in kernel before programming the mac?
The second scenario where the problem happens is when MAC ages out in kernel but not in switch (if FDB aging time is increased in switch). This results in fdbsyncd processing stale mac notifications and reprogramming them based on status in state_db. In this scenario it is always observed that exactly 8K macs are reprogrammed (could be size of netlink buffer queue?)
The above two scenarios will result in many MACs not synced to remote VTEPS in EVPN.
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
Output of
show version
:Output of
show techsupport
:Issue 1 - sonic_dump_qa-eth-vt05-2-2700a1_20221026_070014
Issue 2 - sonic_dump_qa-eth-vt03-2-3700v_20221023_204313
Additional information you deem important (e.g. issue happens only occasionally):
sonic_dump_qa-eth-vt05-2-2700a1_20221026_070014.tar.gz
sonic_dump_qa-eth-vt03-2-3700v_20221023_204313.tar.gz
The text was updated successfully, but these errors were encountered: