Skip to content

bgpd: stuck in unresponsive state #18606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
g00g1 opened this issue Apr 8, 2025 · 4 comments
Open
2 tasks done

bgpd: stuck in unresponsive state #18606

g00g1 opened this issue Apr 8, 2025 · 4 comments
Assignees
Labels
triage Needs further investigation

Comments

@g00g1
Copy link

g00g1 commented Apr 8, 2025

Description

Honestly I don't understand what exactly happened, I can only attach relevant logs: frr-bgpd.txt

Version

FRRouting 10.2.1 (redacted) on Linux(5.14.0-427.22.1.el9_4.x86_64).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
    '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sbindir=/usr/lib/frr' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-static' '--disable-werror' '--enable-multipath=256' '--enable-vtysh' '--enable-ospfclient' '--enable-ospfapi' '--enable-rtadv' '--enable-ldpd' '--enable-pimd' '--enable-pim6d' '--enable-pbrd' '--enable-nhrpd' '--enable-eigrpd' '--enable-babeld' '--enable-vrrpd' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-fpm' '--enable-watchfrr' '--disable-bgp-vnc' '--enable-isisd' '--enable-rpki' '--enable-bfdd' '--enable-pathd' '--disable-grpc' '--enable-snmp' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig' 'CC=gcc' 'CXX=g++' 'LT_SYS_LIBRARY_PATH=/usr/lib64:'

How to reproduce

N/A

Expected behavior

watchfrr issuing kill -9 on timeout for kill -15 for unresponsive bgpd process

Actual behavior

bgpd become unresponsive and watchfrr haven't killed it for two hours

Additional context

This happened just before bgpd gone unresponsive:

Apr 8, 2025 @ 07:19:58.706	[VH6Z7-MNSN0][EC 33554511] 2001:7b8:62b:1:0:d4ff:fe72:7848(Unknown) has not made any SendQ progress for 1 holdtime (9s), peer overloaded?
Apr 8, 2025 @ 07:20:04.448	[VH6Z7-MNSN0][EC 33554511] 2001:7b8:62b:1:0:d4ff:fe72:7848(Unknown) has not made any SendQ progress for 1 holdtime (9s), peer overloaded?
Apr 8, 2025 @ 07:20:07.355	[JQ5A9-TEQYM][EC 33554512] 2001:7b8:62b:1:0:d4ff:fe72:7848(Unknown) has not made any SendQ progress for 2 holdtimes (18s), terminating session

I have resolved my problem by killing bgpd with SIGKILL.

Backtraces that are relevant (happened on systemctl restart frr):

Apr 8, 2025 @ 09:57:43.803  Received signal 11 at 1744106263 (si_addr 0x2d0, PC 0x7f441988b5f2); aborting...
Apr 8, 2025 @ 09:57:43.804  /lib64/libfrr.so.0(zlog_backtrace_sigsafe+0x71) [0x7f4419cceeb1]
Apr 8, 2025 @ 09:57:43.804  /lib64/libfrr.so.0(zlog_signal+0xf5) [0x7f4419ccf0b5]
Apr 8, 2025 @ 09:57:43.804  /lib64/libfrr.so.0(+0x109e45) [0x7f4419d09e45]
Apr 8, 2025 @ 09:57:43.804  /lib64/libc.so.6(+0x3e6f0) [0x7f441983e6f0]
Apr 8, 2025 @ 09:57:43.804  /lib64/libc.so.6(+0x8b5f2) [0x7f441988b5f2]
Apr 8, 2025 @ 09:57:43.804  /lib64/libfrr.so.0(+0xb69fd) [0x7f4419cb69fd]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_pthread_stop_all+0x57) [0x7f4419cb5087]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_pthread_finish+0x1d) [0x7f4419cb68dd]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_fini+0x78) [0x7f4419cc5b78]
Apr 8, 2025 @ 09:57:43.805  /usr/lib/frr/bgpd(sigint+0x20b) [0x55cb8d69d09b]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_sigevent_process+0x53) [0x7f4419d08b43]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(event_fetch+0x6b5) [0x7f4419d1cc85]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_run+0xe3) [0x7f4419cc5933]
Apr 8, 2025 @ 09:57:43.805  /usr/lib/frr/bgpd(main+0x3f2) [0x55cb8d6943e2]
Apr 8, 2025 @ 09:57:43.806  /lib64/libc.so.6(+0x29590) [0x7f4419829590]
Apr 8, 2025 @ 09:57:43.806  /lib64/libc.so.6(__libc_start_main+0x80) [0x7f4419829640]
Apr 8, 2025 @ 09:57:43.806  /usr/lib/frr/bgpd(_start+0x25) [0x55cb8d695125]

gdb symbols:

Reading symbols from /usr/lib/debug/usr/lib64/libfrr.so.0.0.0-10.2.1-01.el9.x86_64.debug...
(gdb) info symbol 0x109e45
core_handler + 181 in section .text of /usr/lib64/libfrr.so.0.0.0
(gdb) info symbol 0xb69fd
fpt_halt + 61 in section .text of /usr/lib64/libfrr.so.0.0.0
gdb /lib64/libc.so.6


(gdb) info symbol 0x3e6f0
__restore_rt in section .text of /usr/lib64/libc.so.6
(gdb) info symbol 0x8b5f2
__pthread_clockjoin_ex + 34 in section .text of /usr/lib64/libc.so.6

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@g00g1 g00g1 added the triage Needs further investigation label Apr 8, 2025
@ton31337
Copy link
Member

ton31337 commented Apr 8, 2025

Most likely, it was stuck on I/O operations in your system. Do you have some traces/logs how your system looked like at that moment? (memory, swap, CPU, I/O utilization).

@g00g1
Copy link
Author

g00g1 commented Apr 8, 2025

The nearest metrics I have are at 2025-04-08 07:21:15 UTC+0, nothing unusual here except lower network bandwidth.

However I do have logs from frr_exporter (https://github.com/tynany/frr_exporter):

time=2025-04-08T07:22:02.761Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp6 duration_seconds=0.000115161 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:22:02.761Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp duration_seconds=6.6771e-05 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:21:47.761Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp6 duration_seconds=9.925e-05 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:21:47.761Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp duration_seconds=8.7584e-05 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:21:32.761Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp6 duration_seconds=0.000103833 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:21:32.762Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp duration_seconds=8.7135e-05 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:21:17.761Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp6 duration_seconds=0.000110946 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:21:17.761Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp duration_seconds=6.8547e-05 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:21:07.766Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp duration_seconds=20.004462544 err="read unix @->/var/run/frr/bgpd.vty: i/o timeout"
time=2025-04-08T07:21:07.766Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp6 duration_seconds=20.004378732 err="read unix @->/var/run/frr/bgpd.vty: i/o timeout"
time=2025-04-08T07:21:02.762Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp6 duration_seconds=6.5865e-05 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:21:02.762Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp duration_seconds=7.9146e-05 err="dial unix /var/run/frr/bgpd.vty: connect: resource temporarily unavailable"
time=2025-04-08T07:20:52.767Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp6 duration_seconds=20.004349834 err="read unix @->/var/run/frr/bgpd.vty: i/o timeout"
time=2025-04-08T07:20:52.767Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp duration_seconds=20.004397137 err="read unix @->/var/run/frr/bgpd.vty: i/o timeout"
time=2025-04-08T07:20:37.766Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp duration_seconds=20.004252184 err="read unix @->/var/run/frr/bgpd.vty: i/o timeout"
time=2025-04-08T07:20:37.766Z level=ERROR source=/app/collector/collector.go:124 msg="collector scrape failed" name=bgp6 duration_seconds=20.004293393 err="read unix @->/var/run/frr/bgpd.vty: i/o timeout"

@ton31337
Copy link
Member

ton31337 commented Apr 9, 2025

This log is triggered when the system cannot send out the packets due to the receive window size being 0 on the other side.

has not made any SendQ progress

Could you (if you can replicate) show the output of strace for bgpd process?

@ton31337 ton31337 self-assigned this Apr 9, 2025
@g00g1
Copy link
Author

g00g1 commented Apr 9, 2025

I am afraid I can't because I wasn't able to reproduce this issue so far. Since this situation happened in prod environment I had to restart the process manually to restore operability, so current strace session would not provide any meaningful data if I am understanding your request correctly.

If that helps, the faulty session was a nlnog peer that had only advertised prefixes (IPv4 and IPv6 fullviews) and none received. Apart from it there were other sessions but none of them were mentioned at the logs before I restorted bgpd operability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

2 participants