SNMP trigger fails on peer state going to Established first time #5963

Try75 · 2020-03-11T00:15:34Z

In 6.0.3, with SNMP enabled for zebra and bgpd, establishing a connection with a peer would trigger an SNMP event and our trap-handler script would insert some routes. When the peer connection was no longer Established, an SNMP trigger would fire and the trap-handler script would remove the routes.

In 7.3, the SNMP trigger for bgpEstablished does not always get fired, therefore our trap-handler script does not get called and the routes are not inserted like they were in 6.0.3. The SNMP trigger for bgpBackwardTransition does not always seemed to get fired either like it did in 6.0.3. I think something changed with the logic around peer state "transition", but not sure. The problem might have surfaced around commit 7d8d0ea.

From the peer (also FRR 7.3), I can do "service frr restart", I will see a log message from the trap-handler script that shows only an Established connection, but not one for the connection going away. From the peer, I do "service frr stop; sleep 5; service frr start", I will only see a message saying that the connection has gone away and not one saying the re-established connection was Established. Doing either of these commands would trigger both the bgpEstablished and bgpBackwardTransition notifications in 6.0.3 and show in the logs of the trap-handler script.

[x] Did you check if this is a duplicate issue?
[ ] Did you test it on the latest FRRouting/frr master branch?

Steps to reproduce the behavior:

Set up two peering bgp routers with SNMP enabled (snmpd, snmptrapd, bgpd -Msnmp)
Configure a traphandler script for the two SNMP triggers
From the peer of SNMP enabled bgp router, Establish a session and bring it down any number of ways
Observe from the output (or lack of output) of the trap-handler script that it was not triggered every time it should have been triggered according to the state of the peer

Expected behavior
The behavior has changed from 6.0.3 to 7.3. I don't think this was an intentional change, but there should get a trigger every time a connection is Established with bgpEstablished MIB (.1.3.6.1.2.1.15.0.1) or the connection state goes backwards with bgpBackwardTransition MIB (.1.3.6.1.2.1.15.0.2)

Versions

OS Kernel: Ubuntu 18.04.4 using Linux 5.3.0-40-generic bgpd, ospfd, zebra: remove duplicate cli installs #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
FRR Version [6.0.3 works, 7.3 does not work]

Additional context
MIBs were obtained with the snmp-mibs-downloader ubuntu package.

/etc/frr/daemons:

# This file tells the frr package which daemons to start.
#
# Sample configurations for these daemons can be found in
# /usr/share/doc/frr/examples/.
#
# ATTENTION:
#
# When activating a daemon for the first time, a config file, even if it is
# empty, has to be present *and* be owned by the user and group "frr", else
# the daemon will not be started by /etc/init.d/frr. The permissions should
# be u=rw,g=r,o=.
# When using "vtysh" such a config file is also needed. It should be owned by
# group "frrvty" and set to ug=rw,o= though. Check /etc/pam.d/frr, too.
#
# The watchfrr and zebra daemons are always started.
#
bgpd=yes
ospfd=no
ospf6d=no
ripd=no
ripngd=no
isisd=no
pimd=no
ldpd=no
nhrpd=no
eigrpd=no
babeld=no
sharpd=no
pbrd=no
bfdd=yes
fabricd=no
vrrpd=no

#
# If this option is set the /etc/init.d/frr script automatically loads
# the config via "vtysh -b" when the servers are started.
# Check /etc/pam.d/frr if you intend to use "vtysh"!
#
vtysh_enable=yes
zebra_options="  -A 127.0.0.1 -s 90000000 -Msnmp"
bgpd_options="   -A 127.0.0.1 -Msnmp"
ospfd_options="  -A 127.0.0.1"
ospf6d_options=" -A ::1"
ripd_options="   -A 127.0.0.1"
ripngd_options=" -A ::1"
isisd_options="  -A 127.0.0.1"
pimd_options="   -A 127.0.0.1"
ldpd_options="   -A 127.0.0.1"
nhrpd_options="  -A 127.0.0.1"
eigrpd_options=" -A 127.0.0.1"
babeld_options=" -A 127.0.0.1"
sharpd_options=" -A 127.0.0.1"
pbrd_options="   -A 127.0.0.1"
staticd_options="-A 127.0.0.1"
bfdd_options="   -A 127.0.0.1"
fabricd_options="-A 127.0.0.1"
vrrpd_options="  -A 127.0.0.1"

# configuration profile
#
#frr_profile="traditional"
#frr_profile="datacenter"

#
# This is the maximum number of FD's that will be available.
# Upon startup this is read by the control files and ulimit
# is called.  Uncomment and use a reasonable value for your
# setup if you are expecting a large number of peers in
# say BGP.
#MAX_FDS=1024

# The list of daemons to watch is automatically generated by the init script.
#watchfrr_options=""

# for debugging purposes, you can specify a "wrap" command to start instead
# of starting the daemon directly, e.g. to use valgrind on ospfd:
#   ospfd_wrap="/usr/bin/valgrind"
# or you can use "all_wrap" for all daemons, e.g. to use perf record:
#   all_wrap="/usr/bin/perf record --call-graph -"
# the normal daemon command is added to this at the end.

/etc/snmp/snmptrapd.conf:

#
# snmptrapd.conf:
#
###############################################################################
# Access control
#  logs to syslog
authCommunity log,execute,net public

# Trap/Notification handlers
###############################################################################
# for bgpEstablished
traphandle .1.3.6.1.2.1.15.0.1 /etc/snmp/traphandler.sh
# for bgpBackwardTransition
traphandle .1.3.6.1.2.1.15.0.2 /etc/snmp/traphandler.sh

/etc/snmp/traphandler.sh"

#!/bin/bash
# Initial script from http://docs.frrouting.org/en/latest/snmp.html
# Modifications were made to handle route advertisement, but all of the
# transformation information was retained for logging or future use.

EBGP_PEER="169.254.0.10"
# this is run by root, so should be no permissions problems in /var/log
LOG_FILE=/var/log/snmp-traps.log

# local snmp community for getting AS belonging to peer
COMMUNITY="public"

# get stdin
INPUT=`cat -`

# get some vars from stdin
uptime=`echo $INPUT | cut -d' ' -f5`
peer=`echo $INPUT | cut -d' ' -f8 | sed -e 's/SNMPv2-SMI::mib-2.15.3.1.14.//g'`
# Regardless of FRR version (6.0.3 or 7.3) the peerstate given to the trap handler script is always "6" Established, even when the trigger is bgpBackwardTransition. I would love to see this fixed. To get the true peer state, we make a snmpget call below.
#peerstate=`echo $INPUT | cut -d' ' -f13`
errorcode=`echo $INPUT | cut -d' ' -f9 | sed -e 's/\"//g'`
suberrorcode=`echo $INPUT | cut -d' ' -f10 | sed -e 's/\"//g'`

# With FRR 7.3, a sleep of at least 1 second is required for the following snmpget commands to work. The sleep was not required with FRR 6.0.3 and would not be required at all if the peerstate handed to the script was correct (see above).
sleep 1
remoteas=`snmpget -v1 -c $COMMUNITY localhost SNMPv2-SMI::mib-2.15.3.1.9.$peer | cut -d' ' -f4`
peerstate=`snmpget -c $COMMUNITY -v1 localhost SNMPv2-SMI::mib-2.15.3.1.2.$peer | cut -d' ' -f4`

# convert peer state
case "$peerstate" in
  1) peerstate="Idle" ;;
  2) peerstate="Connect" ;;
  3) peerstate="Active" ;;
  4) peerstate="Opensent" ;;
  5) peerstate="Openconfirm" ;;
  6) peerstate="Established" ;;
  *) peerstate="Unknown" ;;
esac

# get textual messages for errors
case "$errorcode" in
  00)
    error="No error"
    suberror=""
    ;;
  01)
    error="Message Header Error"
    case "$suberrorcode" in
      01) suberror="Connection Not Synchronized" ;;
      02) suberror="Bad Message Length" ;;
      03) suberror="Bad Message Type" ;;
      *) suberror="Unknown" ;;
    esac
    ;;
  02)
    error="OPEN Message Error"
    case "$suberrorcode" in
      01) suberror="Unsupported Version Number" ;;
      02) suberror="Bad Peer AS" ;;
      03) suberror="Bad BGP Identifier" ;;
      04) suberror="Unsupported Optional Parameter" ;;
      05) suberror="Authentication Failure" ;;
      06) suberror="Unacceptable Hold Time" ;;
      *) suberror="Unknown" ;;
    esac
    ;;
  03)
    error="UPDATE Message Error"
    case "$suberrorcode" in
      01) suberror="Malformed Attribute List" ;;
      02) suberror="Unrecognized Well-known Attribute" ;;
      03) suberror="Missing Well-known Attribute" ;;
      04) suberror="Attribute Flags Error" ;;
      05) suberror="Attribute Length Error" ;;
      06) suberror="Invalid ORIGIN Attribute" ;;
      07) suberror="AS Routing Loop" ;;
      08) suberror="Invalid NEXT_HOP Attribute" ;;
      09) suberror="Optional Attribute Error" ;;
      10) suberror="Invalid Network Field" ;;
      11) suberror="Malformed AS_PATH" ;;
      *) suberror="Unknown" ;;
    esac
    ;;
  04)
    error="Hold Timer Expired"
    suberror=""
    ;;
  05)
    error="Finite State Machine Error"
    suberror=""
    ;;
  06)
    error="Cease"
    case "$suberrorcode" in
      01) suberror="Maximum Number of Prefixes Reached" ;;
      02) suberror="Administratively Shutdown" ;;
      03) suberror="Peer Unconfigured" ;;
      04) suberror="Administratively Reset" ;;
      05) suberror="Connection Rejected" ;;
      06) suberror="Other Configuration Change" ;;
      07) suberror="Connection collision resolution" ;;
      08) suberror="Out of Resource" ;;
      09) suberror="MAX" ;;
      *) suberror="Unknown" ;;
    esac
    ;;
  *)
    error="Unknown"
    suberror=""
    ;;
esac

# create textual message from errorcodes
if [ "x$suberror" == "x" ]; then
  NOTIFY="$errorcode ($error)"
else
  NOTIFY="$errorcode/$suberrorcode ($error/$suberror)"
fi

ACTION="None"
if [ $EBGP_PEER == $peer ]; then
  if [ "Established" == $peerstate ]; then
    ACTION="Adding advertisements"
    vtysh -c 'configure terminal' -c 'router bgp 4238440709' -c 'address-family ipv4 unicast' -c 'network 100.64.255.240/29 route-map RM_EXT_PEER_IN'  -c 'end' -c 'write'
    vtysh -c 'configure terminal' -c 'router bgp 4238440709' -c 'address-family ipv4 unicast' -c 'network 100.64.0.0/16 route-map RM_EXT_PEER_IN'  -c 'end' -c 'write'
  else
    ACTION="Removing advertisements"
    vtysh -c 'configure terminal' -c 'router bgp 4238440709' -c 'address-family ipv4 unicast' -c 'no network 100.64.255.240/29'  -c 'end' -c 'write'
    vtysh -c 'configure terminal' -c 'router bgp 4238440709' -c 'address-family ipv4 unicast' -c 'no network 100.64.0.0/16'  -c 'end' -c 'write'
  fi
fi

DATETIME=`date -Iseconds`
# create message
MSG=`cat << EOF
$DATETIME Snmpd uptime: $uptime Peer: $peer AS: $remoteas New state: $peerstate Notification: $NOTIFY Action: $ACTION
EOF`

echo "$MSG" >> $LOG_FILE

The text was updated successfully, but these errors were encountered:

It was previously comparing an fsm event variable with an fsm status constant. This fixes issue FRRouting#5963. Signed-off-by: Josh Cox <[email protected]>

In PR FRRouting#6052 which fixes issue FRRouting#5963 the bgp fsm events were confused with the bgp fsm status leading to a bug. Let's start separating those out so these types of failures cannot just easily occur. Signed-off-by: Donald Sharp <[email protected]>

It was previously comparing an fsm event variable with an fsm status constant. This fixes issue FRRouting#5963. Signed-off-by: Josh Cox <[email protected]>

In PR FRRouting#6052 which fixes issue FRRouting#5963 the bgp fsm events were confused with the bgp fsm status leading to a bug. Let's start separating those out so these types of failures cannot just easily occur. Signed-off-by: Donald Sharp <[email protected]>

It was previously comparing an fsm event variable with an fsm status constant. This fixes issue FRRouting#5963. Signed-off-by: Josh Cox <[email protected]>

In PR FRRouting#6052 which fixes issue FRRouting#5963 the bgp fsm events were confused with the bgp fsm status leading to a bug. Let's start separating those out so these types of failures cannot just easily occur. Signed-off-by: Donald Sharp <[email protected]>

joshdcox · 2020-05-05T22:08:03Z

I think this can be closed. It is fixed by #6052.

Try75 added the triage Needs further investigation label Mar 11, 2020

Try75 mentioned this issue Mar 11, 2020

FRR deb repo removed 6.0.3 packages for Ubuntu 18.04 #5964

Closed

donaldsharp self-assigned this Mar 17, 2020

qlyoung added snmp bgp labels Mar 19, 2020

joshdcox mentioned this issue Mar 19, 2020

bgpd: Fixed snmp and bmp 'just Established' test. #6052

Merged

ton31337 closed this as completed Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNMP trigger fails on peer state going to Established first time #5963

SNMP trigger fails on peer state going to Established first time #5963

Try75 commented Mar 11, 2020

joshdcox commented May 5, 2020

SNMP trigger fails on peer state going to Established first time #5963

SNMP trigger fails on peer state going to Established first time #5963

Comments

Try75 commented Mar 11, 2020

joshdcox commented May 5, 2020