Skip to content

SNMP trigger fails on peer state going to Established first time #5963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Try75 opened this issue Mar 11, 2020 · 1 comment
Closed

SNMP trigger fails on peer state going to Established first time #5963

Try75 opened this issue Mar 11, 2020 · 1 comment
Assignees
Labels
bgp snmp triage Needs further investigation

Comments

@Try75
Copy link

Try75 commented Mar 11, 2020

In 6.0.3, with SNMP enabled for zebra and bgpd, establishing a connection with a peer would trigger an SNMP event and our trap-handler script would insert some routes. When the peer connection was no longer Established, an SNMP trigger would fire and the trap-handler script would remove the routes.

In 7.3, the SNMP trigger for bgpEstablished does not always get fired, therefore our trap-handler script does not get called and the routes are not inserted like they were in 6.0.3. The SNMP trigger for bgpBackwardTransition does not always seemed to get fired either like it did in 6.0.3. I think something changed with the logic around peer state "transition", but not sure. The problem might have surfaced around commit 7d8d0ea.

From the peer (also FRR 7.3), I can do "service frr restart", I will see a log message from the trap-handler script that shows only an Established connection, but not one for the connection going away. From the peer, I do "service frr stop; sleep 5; service frr start", I will only see a message saying that the connection has gone away and not one saying the re-established connection was Established. Doing either of these commands would trigger both the bgpEstablished and bgpBackwardTransition notifications in 6.0.3 and show in the logs of the trap-handler script.

[x] Did you check if this is a duplicate issue?
[ ] Did you test it on the latest FRRouting/frr master branch?

Steps to reproduce the behavior:

  1. Set up two peering bgp routers with SNMP enabled (snmpd, snmptrapd, bgpd -Msnmp)
  2. Configure a traphandler script for the two SNMP triggers
  3. From the peer of SNMP enabled bgp router, Establish a session and bring it down any number of ways
  4. Observe from the output (or lack of output) of the trap-handler script that it was not triggered every time it should have been triggered according to the state of the peer

Expected behavior
The behavior has changed from 6.0.3 to 7.3. I don't think this was an intentional change, but there should get a trigger every time a connection is Established with bgpEstablished MIB (.1.3.6.1.2.1.15.0.1) or the connection state goes backwards with bgpBackwardTransition MIB (.1.3.6.1.2.1.15.0.2)

Versions

Additional context
MIBs were obtained with the snmp-mibs-downloader ubuntu package.

/etc/frr/daemons:

# This file tells the frr package which daemons to start.
#
# Sample configurations for these daemons can be found in
# /usr/share/doc/frr/examples/.
#
# ATTENTION:
#
# When activating a daemon for the first time, a config file, even if it is
# empty, has to be present *and* be owned by the user and group "frr", else
# the daemon will not be started by /etc/init.d/frr. The permissions should
# be u=rw,g=r,o=.
# When using "vtysh" such a config file is also needed. It should be owned by
# group "frrvty" and set to ug=rw,o= though. Check /etc/pam.d/frr, too.
#
# The watchfrr and zebra daemons are always started.
#
bgpd=yes
ospfd=no
ospf6d=no
ripd=no
ripngd=no
isisd=no
pimd=no
ldpd=no
nhrpd=no
eigrpd=no
babeld=no
sharpd=no
pbrd=no
bfdd=yes
fabricd=no
vrrpd=no

#
# If this option is set the /etc/init.d/frr script automatically loads
# the config via "vtysh -b" when the servers are started.
# Check /etc/pam.d/frr if you intend to use "vtysh"!
#
vtysh_enable=yes
zebra_options="  -A 127.0.0.1 -s 90000000 -Msnmp"
bgpd_options="   -A 127.0.0.1 -Msnmp"
ospfd_options="  -A 127.0.0.1"
ospf6d_options=" -A ::1"
ripd_options="   -A 127.0.0.1"
ripngd_options=" -A ::1"
isisd_options="  -A 127.0.0.1"
pimd_options="   -A 127.0.0.1"
ldpd_options="   -A 127.0.0.1"
nhrpd_options="  -A 127.0.0.1"
eigrpd_options=" -A 127.0.0.1"
babeld_options=" -A 127.0.0.1"
sharpd_options=" -A 127.0.0.1"
pbrd_options="   -A 127.0.0.1"
staticd_options="-A 127.0.0.1"
bfdd_options="   -A 127.0.0.1"
fabricd_options="-A 127.0.0.1"
vrrpd_options="  -A 127.0.0.1"

# configuration profile
#
#frr_profile="traditional"
#frr_profile="datacenter"

#
# This is the maximum number of FD's that will be available.
# Upon startup this is read by the control files and ulimit
# is called.  Uncomment and use a reasonable value for your
# setup if you are expecting a large number of peers in
# say BGP.
#MAX_FDS=1024

# The list of daemons to watch is automatically generated by the init script.
#watchfrr_options=""

# for debugging purposes, you can specify a "wrap" command to start instead
# of starting the daemon directly, e.g. to use valgrind on ospfd:
#   ospfd_wrap="/usr/bin/valgrind"
# or you can use "all_wrap" for all daemons, e.g. to use perf record:
#   all_wrap="/usr/bin/perf record --call-graph -"
# the normal daemon command is added to this at the end.

/etc/snmp/snmptrapd.conf:

#
# snmptrapd.conf:
#
###############################################################################
# Access control
#  logs to syslog
authCommunity log,execute,net public

# Trap/Notification handlers
###############################################################################
# for bgpEstablished
traphandle .1.3.6.1.2.1.15.0.1 /etc/snmp/traphandler.sh
# for bgpBackwardTransition
traphandle .1.3.6.1.2.1.15.0.2 /etc/snmp/traphandler.sh

/etc/snmp/traphandler.sh"

#!/bin/bash
# Initial script from http://docs.frrouting.org/en/latest/snmp.html
# Modifications were made to handle route advertisement, but all of the
# transformation information was retained for logging or future use.

EBGP_PEER="169.254.0.10"
# this is run by root, so should be no permissions problems in /var/log
LOG_FILE=/var/log/snmp-traps.log

# local snmp community for getting AS belonging to peer
COMMUNITY="public"

# get stdin
INPUT=`cat -`

# get some vars from stdin
uptime=`echo $INPUT | cut -d' ' -f5`
peer=`echo $INPUT | cut -d' ' -f8 | sed -e 's/SNMPv2-SMI::mib-2.15.3.1.14.//g'`
# Regardless of FRR version (6.0.3 or 7.3) the peerstate given to the trap handler script is always "6" Established, even when the trigger is bgpBackwardTransition. I would love to see this fixed. To get the true peer state, we make a snmpget call below.
#peerstate=`echo $INPUT | cut -d' ' -f13`
errorcode=`echo $INPUT | cut -d' ' -f9 | sed -e 's/\"//g'`
suberrorcode=`echo $INPUT | cut -d' ' -f10 | sed -e 's/\"//g'`

# With FRR 7.3, a sleep of at least 1 second is required for the following snmpget commands to work. The sleep was not required with FRR 6.0.3 and would not be required at all if the peerstate handed to the script was correct (see above).
sleep 1
remoteas=`snmpget -v1 -c $COMMUNITY localhost SNMPv2-SMI::mib-2.15.3.1.9.$peer | cut -d' ' -f4`
peerstate=`snmpget -c $COMMUNITY -v1 localhost SNMPv2-SMI::mib-2.15.3.1.2.$peer | cut -d' ' -f4`

# convert peer state
case "$peerstate" in
  1) peerstate="Idle" ;;
  2) peerstate="Connect" ;;
  3) peerstate="Active" ;;
  4) peerstate="Opensent" ;;
  5) peerstate="Openconfirm" ;;
  6) peerstate="Established" ;;
  *) peerstate="Unknown" ;;
esac

# get textual messages for errors
case "$errorcode" in
  00)
    error="No error"
    suberror=""
    ;;
  01)
    error="Message Header Error"
    case "$suberrorcode" in
      01) suberror="Connection Not Synchronized" ;;
      02) suberror="Bad Message Length" ;;
      03) suberror="Bad Message Type" ;;
      *) suberror="Unknown" ;;
    esac
    ;;
  02)
    error="OPEN Message Error"
    case "$suberrorcode" in
      01) suberror="Unsupported Version Number" ;;
      02) suberror="Bad Peer AS" ;;
      03) suberror="Bad BGP Identifier" ;;
      04) suberror="Unsupported Optional Parameter" ;;
      05) suberror="Authentication Failure" ;;
      06) suberror="Unacceptable Hold Time" ;;
      *) suberror="Unknown" ;;
    esac
    ;;
  03)
    error="UPDATE Message Error"
    case "$suberrorcode" in
      01) suberror="Malformed Attribute List" ;;
      02) suberror="Unrecognized Well-known Attribute" ;;
      03) suberror="Missing Well-known Attribute" ;;
      04) suberror="Attribute Flags Error" ;;
      05) suberror="Attribute Length Error" ;;
      06) suberror="Invalid ORIGIN Attribute" ;;
      07) suberror="AS Routing Loop" ;;
      08) suberror="Invalid NEXT_HOP Attribute" ;;
      09) suberror="Optional Attribute Error" ;;
      10) suberror="Invalid Network Field" ;;
      11) suberror="Malformed AS_PATH" ;;
      *) suberror="Unknown" ;;
    esac
    ;;
  04)
    error="Hold Timer Expired"
    suberror=""
    ;;
  05)
    error="Finite State Machine Error"
    suberror=""
    ;;
  06)
    error="Cease"
    case "$suberrorcode" in
      01) suberror="Maximum Number of Prefixes Reached" ;;
      02) suberror="Administratively Shutdown" ;;
      03) suberror="Peer Unconfigured" ;;
      04) suberror="Administratively Reset" ;;
      05) suberror="Connection Rejected" ;;
      06) suberror="Other Configuration Change" ;;
      07) suberror="Connection collision resolution" ;;
      08) suberror="Out of Resource" ;;
      09) suberror="MAX" ;;
      *) suberror="Unknown" ;;
    esac
    ;;
  *)
    error="Unknown"
    suberror=""
    ;;
esac

# create textual message from errorcodes
if [ "x$suberror" == "x" ]; then
  NOTIFY="$errorcode ($error)"
else
  NOTIFY="$errorcode/$suberrorcode ($error/$suberror)"
fi

ACTION="None"
if [ $EBGP_PEER == $peer ]; then
  if [ "Established" == $peerstate ]; then
    ACTION="Adding advertisements"
    vtysh -c 'configure terminal' -c 'router bgp 4238440709' -c 'address-family ipv4 unicast' -c 'network 100.64.255.240/29 route-map RM_EXT_PEER_IN'  -c 'end' -c 'write'
    vtysh -c 'configure terminal' -c 'router bgp 4238440709' -c 'address-family ipv4 unicast' -c 'network 100.64.0.0/16 route-map RM_EXT_PEER_IN'  -c 'end' -c 'write'
  else
    ACTION="Removing advertisements"
    vtysh -c 'configure terminal' -c 'router bgp 4238440709' -c 'address-family ipv4 unicast' -c 'no network 100.64.255.240/29'  -c 'end' -c 'write'
    vtysh -c 'configure terminal' -c 'router bgp 4238440709' -c 'address-family ipv4 unicast' -c 'no network 100.64.0.0/16'  -c 'end' -c 'write'
  fi
fi

DATETIME=`date -Iseconds`
# create message
MSG=`cat << EOF
$DATETIME Snmpd uptime: $uptime Peer: $peer AS: $remoteas New state: $peerstate Notification: $NOTIFY Action: $ACTION
EOF`

echo "$MSG" >> $LOG_FILE
@Try75 Try75 added the triage Needs further investigation label Mar 11, 2020
@donaldsharp donaldsharp self-assigned this Mar 17, 2020
joshdcox pushed a commit to pureport/frr that referenced this issue Mar 19, 2020
It was previously comparing an fsm event variable with an fsm status constant.
This fixes issue FRRouting#5963.

Signed-off-by: Josh Cox <[email protected]>
donaldsharp added a commit to donaldsharp/frr that referenced this issue Mar 20, 2020
In PR FRRouting#6052 which fixes issue FRRouting#5963 the bgp fsm events
were confused with the bgp fsm status leading
to a bug.  Let's start separating those out
so these types of failures cannot just
easily occur.

Signed-off-by: Donald Sharp <[email protected]>
joshdcox pushed a commit to pureport/frr that referenced this issue Mar 20, 2020
It was previously comparing an fsm event variable with an fsm status constant.
This fixes issue FRRouting#5963.

Signed-off-by: Josh Cox <[email protected]>
donaldsharp added a commit to donaldsharp/frr that referenced this issue Mar 20, 2020
In PR FRRouting#6052 which fixes issue FRRouting#5963 the bgp fsm events
were confused with the bgp fsm status leading
to a bug.  Let's start separating those out
so these types of failures cannot just
easily occur.

Signed-off-by: Donald Sharp <[email protected]>
gpziemba pushed a commit to LabNConsulting/frr that referenced this issue Apr 27, 2020
It was previously comparing an fsm event variable with an fsm status constant.
This fixes issue FRRouting#5963.

Signed-off-by: Josh Cox <[email protected]>
gpziemba pushed a commit to LabNConsulting/frr that referenced this issue Apr 27, 2020
In PR FRRouting#6052 which fixes issue FRRouting#5963 the bgp fsm events
were confused with the bgp fsm status leading
to a bug.  Let's start separating those out
so these types of failures cannot just
easily occur.

Signed-off-by: Donald Sharp <[email protected]>
@joshdcox
Copy link
Contributor

joshdcox commented May 5, 2020

I think this can be closed. It is fixed by #6052.

@ton31337 ton31337 closed this as completed Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgp snmp triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

5 participants