Skip to content

Cached counter file is not cleared during config reload, causing portstat give wrong counter values #9817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stephenxs opened this issue Jan 20, 2022 · 4 comments · Fixed by sonic-net/sonic-utilities#2232 or #11244
Assignees
Labels
Help Wanted 🆘 Triaged this issue has been triaged

Comments

@stephenxs
Copy link
Collaborator

Description

Steps to reproduce the issue:

  1. send some traffic to generate a non-zero counter value. let's say the amount of traffic is A.
    The counter content is:
root@arc-mtbc-1001:/tmp/portstat-0# portstat 
      IFACE    STATE    RX_OK      RX_BPS    RX_UTIL    RX_ERR    RX_DRP    RX_OVR    TX_OK      TX_BPS    TX_UTIL    TX_ERR    TX_DRP    TX_OVR
-----------  -------  -------  ----------  ---------  --------  --------  --------  -------  ----------  ---------  --------  --------  --------
  Ethernet0        X        0    0.00 B/s      0.00%         0         0         0        0    0.00 B/s      0.00%         0         0         0
  Ethernet4        U      159    0.83 B/s      0.00%         0         0         0    1,432   14.85 B/s      0.00%         0         0         0
  Ethernet8        U      159    0.83 B/s      0.00%         0         0         0    1,433   14.85 B/s      0.00%         0         0         0
 Ethernet12        U      159    0.83 B/s      0.00%         0         0         0    1,433   14.90 B/s      0.00%         0         0         0
 Ethernet16        U      159    0.83 B/s      0.00%         0         0         0    1,431   14.90 B/s      0.00%         0         0         0
 Ethernet20        U      159    0.83 B/s      0.00%         0         0         0    1,431   14.90 B/s      0.00%         0         0         0
 Ethernet24        U      159    0.83 B/s      0.00%         0         0         0    1,433   14.90 B/s      0.00%         0         0         0
 Ethernet28        U      159    0.83 B/s      0.00%         0         0         0    1,434   14.90 B/s      0.00%         0         0         0
  1. portstat -c to clear counter
  2. config reload -y
  3. send some traffic to generate a non-zero counter value. let's say the amount of traffic is B. B should be less than A
  4. portstat
    The content of admin user is like this (which imo is the correct value since we didn't clear counter on admin)
admin@arc-mtbc-1001:~$ portstat 
      IFACE    STATE    RX_OK      RX_BPS    RX_UTIL    RX_ERR    RX_DRP    RX_OVR    TX_OK      TX_BPS    TX_UTIL    TX_ERR    TX_DRP    TX_OVR
-----------  -------  -------  ----------  ---------  --------  --------  --------  -------  ----------  ---------  --------  --------  --------
  Ethernet0        X        0    0.00 B/s      0.00%         0         0         0        0    0.00 B/s      0.00%         0         0         0
  Ethernet4        U        3    0.00 B/s      0.00%         0         0         0       29    3.53 B/s      0.00%         0         0         0
  Ethernet8        U        3    0.00 B/s      0.00%         0         0         0       29    3.53 B/s      0.00%         0         0         0
 Ethernet12        U        3    0.00 B/s      0.00%         0         0         0       29    4.31 B/s      0.00%         0         0         0
 Ethernet16        U        3    0.00 B/s      0.00%         0         0         0       29    4.32 B/s      0.00%         0         0         0
 Ethernet20        U        3    0.00 B/s      0.00%         0         0         0       29    3.54 B/s      0.00%         0         0         0
 Ethernet24        U        3    0.00 B/s      0.00%         0         0         0       29    9.56 B/s      0.00%         0         0         0

However, for root user the output is:

root@arc-mtbc-1001:/home/admin# portstat 
Last cached time was 2022-01-20 14:44:31.430297
      IFACE    STATE    RX_OK      RX_BPS    RX_UTIL    RX_ERR    RX_DRP    RX_OVR    TX_OK      TX_BPS    TX_UTIL    TX_ERR    TX_DRP    TX_OVR
-----------  -------  -------  ----------  ---------  --------  --------  --------  -------  ----------  ---------  --------  --------  --------
  Ethernet0        X        0    0.00 B/s      0.00%         0         0         0        0    0.00 B/s      0.00%         0         0         0
  Ethernet4        U        0    0.00 B/s      0.00%         0         0         0        0   14.16 B/s      0.00%         0         0         0
  Ethernet8        U        0    0.00 B/s      0.00%         0         0         0        0   14.16 B/s      0.00%         0         0         0
 Ethernet12        U        0    0.00 B/s      0.00%         0         0         0        0   17.31 B/s      0.00%         0         0         0
 Ethernet16        U        0    0.00 B/s      0.00%         0         0         0        0   17.34 B/s      0.00%         0         0         0
 Ethernet20        U        0    0.00 B/s      0.00%         0         0         0        0   14.19 B/s      0.00%         0         0         0
 Ethernet24        U        0    0.00 B/s      0.00%         0         0         0        0   38.34 B/s      0.00%         0         0         0

We observe the issue on 202012 but I think it should be reproduced on other branches as well.

Describe the results you received:

In portstat in step 5, all counter contains 0, which is wrong since traffic has been sent.

Describe the results you expected:

In portstat in step 5, counters should contain the correct value.

Output of show version:

root@arc-mtbc-1001:/home/admin# show version

SONiC Software Version: SONiC.202012.208-8e924b9a7_Internal
Distribution: Debian 10.11
Kernel: 4.19.0-12-2-amd64
Build commit: 8e924b9a7
Build date: Wed Jan 19 10:16:34 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci02-244

Platform: x86_64-mlnx_msn2410-r0
HwSKU: ACS-MSN2410
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1848K10624
Uptime: 15:17:27 up  6:51,  1 user,  load average: 2.41, 3.80, 2.68

Docker images:
REPOSITORY                    TAG                                     IMAGE ID            SIZE
docker-syncd-mlnx             202012.208-8e924b9a7_Internal           91e0857fe0eb        970MB
docker-syncd-mlnx             latest                                  91e0857fe0eb        970MB
docker-platform-monitor       202012.208-8e924b9a7_Internal           bf71face5dd7        685MB
docker-platform-monitor       latest                                  bf71face5dd7        685MB
docker-snmp                   202012.208-8e924b9a7_Internal           c56ccfac940b        427MB
docker-snmp                   latest                                  c56ccfac940b        427MB
docker-teamd                  202012.208-8e924b9a7_Internal           df70c714874b        395MB
docker-teamd                  latest                                  df70c714874b        395MB
docker-wjh                    202012.202012.0-dirty-20211230.065251   592ae65e7d63        520MB
docker-wjh                    latest                                  592ae65e7d63        520MB
docker-sflow                  202012.208-8e924b9a7_Internal           69fb819114b5        391MB
docker-sflow                  latest                                  69fb819114b5        391MB
docker-router-advertiser      202012.208-8e924b9a7_Internal           7c14699b69ca        380MB
docker-router-advertiser      latest                                  7c14699b69ca        380MB
docker-nat                    202012.208-8e924b9a7_Internal           d5d60742e3ce        393MB
docker-nat                    latest                                  d5d60742e3ce        393MB
docker-lldp                   202012.208-8e924b9a7_Internal           4086facc8fc3        420MB
docker-lldp                   latest                                  4086facc8fc3        420MB
docker-database               202012.208-8e924b9a7_Internal           eb326db92704        380MB
docker-database               latest                                  eb326db92704        380MB
docker-sonic-mgmt-framework   202012.208-8e924b9a7_Internal           ad183bf610a8        793MB
docker-sonic-mgmt-framework   latest                                  ad183bf610a8        793MB
docker-orchagent              202012.208-8e924b9a7_Internal           9fa6d1f72aa2        409MB
docker-orchagent              latest                                  9fa6d1f72aa2        409MB
docker-sonic-telemetry        202012.208-8e924b9a7_Internal           e03068e245e7        469MB
docker-sonic-telemetry        latest                                  e03068e245e7        469MB
docker-mux                    202012.208-8e924b9a7_Internal           7dcd0b71d7dd        432MB
docker-mux                    latest                                  7dcd0b71d7dd        432MB
docker-fpm-frr                202012.208-8e924b9a7_Internal           c280b5311557        409MB
docker-fpm-frr                latest                                  c280b5311557        409MB
docker-dhcp-relay             202012.208-8e924b9a7_Internal           edcf26f69c29        393MB
docker-dhcp-relay             latest                                  edcf26f69c29        393MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

In SONiC, the utilities like "portstat" show counter on a per-user basis especially after a user clears the counters.

It works like this: if a user clears counters, it caches the current values fetched from COUNTER_DB in the /tmp/ folder. Each time it is executed, it fetches the counters from COUNTER_DB, and then compares them with the cache, and then displays the difference.
Eg. the cached counter is 10, and the counter in COUNTER_DB is 15, it will display 5 which is 15 - 10.
However, when a user executes "config reload", the COUNTER_DB will be zeroed, which means the cached counter will be greater than the one fetched from COUNTER_DB. After that, the utilities can give wrong data.

sonic_dump_arc-mtbc-1001_20220120_151843.tar.gz

@stephenxs
Copy link
Collaborator Author

This kind of cached files should be cleared during config reload.

@zhangyanzhao zhangyanzhao added Help Wanted 🆘 Triaged this issue has been triaged labels Feb 2, 2022
@nazariig
Copy link
Collaborator

nazariig commented Mar 9, 2022

The bug is also valid for ACL/PBH counters:

  1. /tmp/.pbh_counters.txt
  2. /tmp/.counters_acl.p

@stepanblyschak
Copy link
Collaborator

I see that for some counters (e.g dropstat) the cache file is cleared on config reload, code.
This has to be done for every counter cache file, however there is another issue, e.g swss restart due to a crash, counters cache isn't cleared.

To solve this I propose:

  1. Have a single cache registry/directory where every application has to put their cache files. To do this, in utilities_common we add a UserCache class that provides a per-application, per-user directory in a global cache directory (e.g /tmp/cache/portstat/1000/, where 1000 is an UID of a user). Every stats application uses this API.
  2. swss.sh script flushing COUNTERS DB and after that also clears the entire /tmp/cache/. This should solve swss restarts as well.

@stepanblyschak stepanblyschak self-assigned this Jun 8, 2022
@nazariig
Copy link
Collaborator

nazariig commented Jun 8, 2022

@stepanblyschak how about /var/cache/sonic/ path?

liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this issue Jul 27, 2022
- What I did
To fix sonic-net/sonic-buildimage#9817. Cache all counters in the same place. Created a UserCache helper class to access the cache directory.

- How I did it
Implemented UserCache class. Changed all stats commands to use new class. Adopted fast-reboot script.

- How to verify it
Run on the switch and verify counters stats command and clear commands work correctly. After config reload or cold/fast reboot counters cache is removed.

Signed-off-by: Stepan Blyschak <[email protected]>
liat-grozovik pushed a commit that referenced this issue Jul 28, 2022
A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.

- Why I did it
To fix #9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.

- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.

- How to verify it
Run togather with sonic-net/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.

Signed-off-by: Stepan Blyschak <[email protected]>
stepanblyschak added a commit to stepanblyschak/sonic-utilities that referenced this issue Jul 29, 2022
- What I did
To fix sonic-net/sonic-buildimage#9817. Cache all counters in the same place. Created a UserCache helper class to access the cache directory.

- How I did it
Implemented UserCache class. Changed all stats commands to use new class. Adopted fast-reboot script.

- How to verify it
Run on the switch and verify counters stats command and clear commands work correctly. After config reload or cold/fast reboot counters cache is removed.

Signed-off-by: Stepan Blyschak <[email protected]>
yxieca pushed a commit to sonic-net/sonic-utilities that referenced this issue Jul 29, 2022
- What I did
To fix sonic-net/sonic-buildimage#9817. Cache all counters in the same place. Created a UserCache helper class to access the cache directory.

- How I did it
Implemented UserCache class. Changed all stats commands to use new class. Adopted fast-reboot script.

- How to verify it
Run on the switch and verify counters stats command and clear commands work correctly. After config reload or cold/fast reboot counters cache is removed.

Signed-off-by: Stepan Blyschak <[email protected]>
yxieca pushed a commit that referenced this issue Aug 8, 2022
A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.

- Why I did it
To fix #9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.

- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.

- How to verify it
Run togather with sonic-net/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.

Signed-off-by: Stepan Blyschak <[email protected]>
skbarista pushed a commit to skbarista/sonic-buildimage that referenced this issue Aug 17, 2022
…-net#11244)

A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.

- Why I did it
To fix sonic-net#9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.

- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.

- How to verify it
Run togather with sonic-net/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.

Signed-off-by: Stepan Blyschak <[email protected]>
preetham-singh pushed a commit to preetham-singh/sonic-utilities that referenced this issue Nov 21, 2022
- What I did
To fix sonic-net/sonic-buildimage#9817. Cache all counters in the same place. Created a UserCache helper class to access the cache directory.

- How I did it
Implemented UserCache class. Changed all stats commands to use new class. Adopted fast-reboot script.

- How to verify it
Run on the switch and verify counters stats command and clear commands work correctly. After config reload or cold/fast reboot counters cache is removed.

Signed-off-by: Stepan Blyschak <[email protected]>
malletvapid23 added a commit to malletvapid23/Sonic-Utility that referenced this issue Aug 3, 2023
- What I did
To fix sonic-net/sonic-buildimage#9817. Cache all counters in the same place. Created a UserCache helper class to access the cache directory.

- How I did it
Implemented UserCache class. Changed all stats commands to use new class. Adopted fast-reboot script.

- How to verify it
Run on the switch and verify counters stats command and clear commands work correctly. After config reload or cold/fast reboot counters cache is removed.

Signed-off-by: Stepan Blyschak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help Wanted 🆘 Triaged this issue has been triaged
Projects
None yet
4 participants