Description
Description
[Topo summary]:
T0 with 20k ipv4 routes, each route with 2 ecmp members.
Steps to reproduce the issue
-
enable warm
sudo config warm_restart enable system -
warm restart swss/system
sudo systemctl restart warm -
WARMBOOT_FINALIZER found bgp did not change state to RECONCILED
NOTICE root: WARMBOOT_FINALIZER : Some components didn't finish reconcile: bgp ...
NOTICE root: WARMBOOT_FINALIZER : Finalizing warmboot...
Describe the results you received:
bgp did not finish reconcile.
Also, other apps(e.g. lldp-syncd) will throw an error whenever they try to access db but db is busy running 'table_dump.lua'.
Describe the results you expected:
All components should change the status to reconciled.
Additional information you deem important (e.g. issue happens only occasionally):
debug
- Via the syslog, we found that fpmsyncd throw a exception due to redis is busy running a script, when fpmsyncd try to set RECONCILED to state db.
Jun 11 17:43:07.204680 tau1t8-01 NOTICE syncd#syncd: :- dump: getting took 1.139821 sec
Jun 11 17:43:07.797314 tau1t8-01 NOTICE syncd#syncd: :- redisGetAsicView: objects count for ASIC_STATE: 22697
Jun 11 17:43:07.818872 tau1t8-01 NOTICE syncd#syncd: :- redisGetAsicView: get asic view from ASIC_STATE took 1.916766 sec
Jun 11 17:43:11.644892 tau1t8-01 INFO syncd#supervisord: syncd NPS#
Jun 11 17:43:11.645907 tau1t8-01 INFO syncd#supervisord: syncd #015
Jun 11 17:43:11.645995 tau1t8-01 INFO syncd#supervisord: syncd unit 0, temperature 70.175501 degree centigrade#015
Jun 11 17:43:11.788528 sonic NOTICE bgp#fpmsyncd: :- main: Warm-Restart timer expired.
Jun 11 17:43:11.788528 sonic NOTICE bgp#fpmsyncd: :- reconcile: Warm-Restart: Initiating reconciliation process for bgp application.
Jun 11 17:43:14.099929 sonic INFO database#supervisord: redis-server 32:M 11 Jun 09:43:14.097 # Lua slow script detected: still in execution after 6274 milliseconds. You can try kii
lling the script using the SCRIPT KILL command.
Jun 11 17:43:14.135476 sonic ERR bgp#fpmsyncd: :- checkReplyType: Expected to get redis type 3 got type 6, err: BUSY Redis is busy running a script. You can only call SCRIPT KILL oo
r SHUTDOWN NOSAVE.
Jun 11 17:43:14.136122 sonic INFO bgp#supervisord: fpmsyncd Exception "Wrong expected type of result: Input/output error" had been thrown in deamon
Jun 11 17:43:15.471834 sonic INFO bgp#supervisord 2019-06-11 09:43:14,138 INFO exited: fpmsyncd (exit status 0; expected)
Jun 11 17:43:15.959560 tau1t8-01 NOTICE syncd#syncd: :- dump: getting took 8.131201 sec
Jun 11 17:43:16.574011 tau1t8-01 NOTICE syncd#syncd: :- redisGetAsicView: objects count for TEMP_ASIC_STATE: 22691
Jun 11 17:43:16.598789 tau1t8-01 NOTICE syncd#syncd: :- redisGetAsicView: get asic view from TEMP_ASIC_STATE took 8.775713 sec
Jun 11 17:43:16.607765 tau1t8-01 NOTICE syncd#syncd: :- matchOids: matched oids
- we do not clearly known why redis dump TEMP_ASIC_STATE(8.13s) 7 times slower than do the same action with ASIC_STATE(1.14s), while objects of the two table are almost the same.
test more with 22694 objects in ASIC_STATE table, 22559 objects in TEMP_ASIC_STATE table
- measure time of "eval" the table_dump.lua
root@sonic:/# time redis-cli -n 1 --eval table_dump.lua ASIC_STATE >asic_state.dump
real 0m0.625s
user 0m0.008s
sys 0m0.016s
root@sonic:/# time redis-cli -n 1 --eval table_dump.lua TEMP_ASIC_STATE >asic_state.dump
real 0m7.065s
user 0m0.008s
sys 0m0.048s
- What table_dump.lua do is:
- use "keys" to get all keys match pattern
- use a for loop to hgetall fvs of each key.
- measure the time of "keys" with lua
root@sonic:/# time redis-cli -n 1 eval 'redis.call("keys", KEYS[1].."*")' 1 TEMP_ASIC_STATE >asic_state.dump
real 0m6.544s
user 0m0.004s
sys 0m0.020s
root@sonic:/# time redis-cli -n 1 eval 'redis.call("keys", KEYS[1].."*")' 1 ASIC_STATE >asic_state.dump
real 0m0.618s
user 0m0.016s
sys 0m0.012s
- measure the time of "keys" without lua
root@sonic:/# time redis-cli -n 1 keys TEMP_ASIC_STATE:* >asic_state.dump
real 0m0.117s
user 0m0.016s
sys 0m0.040s
root@sonic:/# time redis-cli -n 1 keys ASIC_STATE:* >asic_state.dump
real 0m0.120s
user 0m0.024s
sys 0m0.040s
The test result show us
- command "keys" with lua cost a great deal of time, more than command "hgetall"
- redis command with lua script slower than directly redis command.
KEYS ASIC_STATE: the fastest method
KEYS TEMP_ASIC_STATE: nearly the same as KEYS ASIC_STATE
KEYS ASIC_STATE with lua:~9 times slower than KEYS ASIC_STATE
KEYS TEMP_ASIC_STATE with lua: the slowest method, ~65 times slower than KEYS ASIC_STATE
It seems like we should improve the dump method of table for prevent blocking redis too long.
Or split each database to individual redis instances(processes).
Workaround
Currently, we can set lua-time-limit larger enough to prevent redis from sending reply BUSY
to other client requests.
runtime: redis-cli -n 0 config set lua-time-limit 10000
redis.conf: lua-time-limit 10000