Skip to content

SWSS logs lower than current log level still count towards rate-limiting threshhold. #5667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
theasianpianist opened this issue Oct 19, 2020 · 2 comments
Assignees
Labels
Master Branch Quality P0 Priority of the issue

Comments

@theasianpianist
Copy link
Contributor

theasianpianist commented Oct 19, 2020

Description

Revert #5666 once a permanent fix is implemented.

Syslog messages from orchagent will sometimes be rate-limited for no apparent reason. The rate-limiting threshhold is set at 20k messages sent in 5 minutes, however syslog will sometimes start rate-limiting messages from orchagent when fewer than 20k messages have been logged in the past 5 minutes.

I believe this occurs because lower priority messages which are not written to syslog are still being counted towards the rate-limit thresshold.
Steps to reproduce the issue:

  1. Ensure that the orchagent loglevel is set to NOTICE
  2. Run a sonic-mgmt test module that generates lots of orchagent activity, such as the ACL test suite

Describe the results you received:
In the syslog, there are messages indicating that rate-limiting is occurring:
Oct 19 12:19:08.599970 str2-7050cx3-acs-03 INFO swss#rsyslogd: imuxsock[pid: 109, name: /usr/bin/orchagent] from <str2-7050cx3-acs-03:orchagent>: begin to drop messages due to rate-limiting.

However, there are fewer than 20k messages sent in the five minutes before rate-limiting began.

If the orchagent log level is lowered to INFO and the test is repeated, then enough messages are written to the syslog to trigger rate-limiting.

Describe the results you expected:
No rate-limiting should occur

Additional information you deem important (e.g. issue happens only occasionally):

SONiC Software Version: SONiC.master.447-bef76b10
Distribution: Debian 10.6
Kernel: 4.19.0-9-2-amd64
Build commit: bef76b10
Build date: Sat Oct 17 08:37:50 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-arista_7050cx3_32s
HwSKU: Arista-7050CX3-32S-C32
ASIC: broadcom
Serial Number: JPE20181528
Uptime: 19:11:38 up  2:08,  1 user,  load average: 1.19, 1.43, 1.26

Docker images:
REPOSITORY                    TAG                   IMAGE ID            SIZE
docker-teamd                  latest                0c741aa9b72a        396MB
docker-teamd                  master.447-bef76b10   0c741aa9b72a        396MB
docker-sonic-mgmt-framework   latest                3ff9a035b403        486MB
docker-sonic-mgmt-framework   master.447-bef76b10   3ff9a035b403        486MB
docker-router-advertiser      latest                2ce6eb6aeab1        359MB
docker-router-advertiser      master.447-bef76b10   2ce6eb6aeab1        359MB
docker-platform-monitor       latest                7896df88ed11        443MB
docker-platform-monitor       master.447-bef76b10   7896df88ed11        443MB
docker-lldp                   latest                098d645602e0        391MB
docker-lldp                   master.447-bef76b10   098d645602e0        391MB
docker-dhcp-relay             latest                1112421cadb3        366MB
docker-dhcp-relay             master.447-bef76b10   1112421cadb3        366MB
docker-database               latest                4bdc434b9c1b        359MB
docker-database               master.447-bef76b10   4bdc434b9c1b        359MB
docker-orchagent              latest                76f804dff230        410MB
docker-orchagent              master.447-bef76b10   76f804dff230        410MB
docker-nat                    latest                07a9c7ed02c3        399MB
docker-nat                    master.447-bef76b10   07a9c7ed02c3        399MB
docker-sonic-telemetry        latest                700be2fb0fde        429MB
docker-sonic-telemetry        master.447-bef76b10   700be2fb0fde        429MB
docker-fpm-frr                latest                32f730963804        413MB
docker-fpm-frr                master.447-bef76b10   32f730963804        413MB
docker-sflow                  latest                c6569cf0ec41        400MB
docker-sflow                  master.447-bef76b10   c6569cf0ec41        400MB
docker-snmp                   latest                197fb7f650c3        399MB
docker-snmp                   master.447-bef76b10   197fb7f650c3        399MB
docker-syncd-brcm             latest                f0029106f338        542MB
docker-syncd-brcm             master.447-bef76b10   f0029106f338        542MB
@theasianpianist
Copy link
Contributor Author

Temporary fix in #5666 but still need to RCA this to figure out why the issue is occurring in the first place

@theasianpianist
Copy link
Contributor Author

Turns out this is a non-issue. #5666 should be reverted. It only appears that there aren't enough messages to justify rate limiting because of the way rsyslog handles repeated messages. Repeated messages are usually handled as follows:

  1. Some message A is sent to rsyslog once.
  2. Message A is written to the log file.
    • The number of messages that count towards the rate-limit threshhold is 1.
  3. n copies of message A are sent to rsyslog. rsyslog does not write these copies to the log file but instead records the message and counts the number of duplicates.
    • The number of messages that count towards the rate-limit threshhold is n + 1. (Note that this increase happens gradually, as each duplicate is received.)
  4. Some message B (that is different from A) is sent to rsyslog.
  5. rsyslog will write a message repeated n times log to the log file (indicating the number of duplicates of message A received), and will now count all duplicates of message A as only a single message.
    • The number of messages that count towards the rate-limit threshhold is 2.
  6. Then rsyslog will write B to the log file.
    • The number of messages that count towards the rate-limit threshhold is 3.

The log file will look like this:

A
message repeated n times: [ A]
B

However, if in step 3 the rate-limit threshhold is exceeded, rate-limiting is triggered and all duplicate messages are dropped. If this happens, no message repeated n times log will be generated. This can give the appearance that rate-limiting has been triggered without a sufficient number of messages. In reality, the messages that crossed the rate-limit threshhold are dropped as a result of the rate limiting.

In this case, the log file will look like this:

A
begin to drop messages due to rate-limiting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Master Branch Quality P0 Priority of the issue
Projects
None yet
Development

No branches or pull requests

2 participants