Skip to content

hostcfgd: Handle missed tacacs updates between load & listen #8223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 6, 2021

Conversation

renukamanavalan
Copy link
Contributor

Why I did it

The time gap between last config load & db-listen seem to have increased.
Any config updates that occurred in this gap gets missed by db-listen.
This could miss updating /etc/pam.d/common-auth-sonic

How I did it

Add a one shot timer, just before db-listen. The timer will fire after the subscribe is done
When the timer fires, reload tacacs & aaa

How to verify it

A tough repro. So you may explicitly add some sleep just before listen.
Make sure the sleep is at least 10 seconds less than timer.
Write a log just before sleep.
restart hostcfgd and watch syslog for the added log line
Upon this log, make some tacacs updates that would have some change in
/etc/pam.d/common-auth-sonic.
Watch for changes in that file

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

…pdates between last

config load and db-listen.
Copy link
Contributor

@arlakshm arlakshm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change in master as well ?

@@ -25,6 +26,20 @@ TACPLUS_SERVER_PASSKEY_DEFAULT = ""
TACPLUS_SERVER_TIMEOUT_DEFAULT = "5"
TACPLUS_SERVER_AUTH_TYPE_DEFAULT = "pap"

global_lock = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iso of global_lock, can we move this lock to the AaaCfg() ? To prevent another classes in this file from using this lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone else (another subscriber) gets a need, they could use this.

@abdosi
Copy link
Contributor

abdosi commented Jul 27, 2021

@renukamanavalan and @yxieca : when the issue is seen the steps are:-

a) load minigraph
b) config tacacs * commands

Basically delay bewteen restart hostcfgd and config command is 10-15 seconds. I think if we can randomize delay we might be able to repro this

Restart of hostcfgd service as part of load minigraph
image
Config save after load minigrpah

image

Tacacs configuration
image

@renukamanavalan
Copy link
Contributor Author

@renukamanavalan and @yxieca : when the issue is seen the steps are:-

a) load minigraph
b) config tacacs * commands

Basically delay bewteen restart hostcfgd and config command is 10-15 seconds. I think if we can randomize delay we might be able to repro this

Restart of hostcfgd service as part of load minigraph
Config save after load minigrpah

Tacacs configuration

Managed a repro with following script. I could repro reliably in two systems. In one case, the pause time of 5 seconds worked, but in another it took 11 seconds. Again this is pretty dependent on many other factors. With help of load_minigraph, we could put control plane on high load, which helped get a distinct pause between load & listen.

for ((;;))
do
    tsleep=1
    tinterval=1

    for ((i=0; i<16; i++))
    do
        for ((j=0; j<5; j++))
        do
            config load_minigraph -y
            sleep 30s
            systemctl restart hostcfgd

            # variable sleep since hostcfgd restart
            sleep ${tsleep}

            # Update config
            config tacacs passkey xxxx
            config tacacs  authtype login
            config aaa authentication failthrough enable
            config aaa authentication login tacacs+
            echo "tacacs configured; sleep 30s before testing pam file"

            # Pause to allow hostcfgd to react
            sleep 30s

            # Check pam.d conf file
            n=$(grep testing123 /etc/pam.d/common-auth-sonic | wc -l)
            if [ $n -ne 1 ]; then
                echo "$(date): sleep little longer"
                sleep 30s
                n=$(grep testing123 /etc/pam.d/common-auth-sonic | wc -l)
                if [ $n -ne 1 ]; then
                    echo "$(date): Failure encountered"
                    exit -2
                fi
            fi
            echo "$(date): Run i=$i j=$j tsleep=${tsleep} complete"
        done
        tsleep=$((tsleep + tinterval))
    done
done

def tacacs_server_update(self, key, data, modify_conf=True):
if data == {}:
if key in self.tacplus_servers:
del self.tacplus_servers[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we don't need this code anymore ? all these changes looks unrelated to this fix.
Can we have separate PR for reason of these changes or at least can we add to PR description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this code anymore. The whole purpose of this code is to keep a local cache of tacacs & aaa. Maintain the cache on updates and use the cache. We don't do caching anymore.

We get it when we need it. This handles the missed-updates in simple manner and btw, we don't gain any by keeping duplicate cache in mem.

@abdosi abdosi merged commit 8cd6714 into sonic-net:201911 Aug 6, 2021
renukamanavalan added a commit to renukamanavalan/sonic-buildimage that referenced this pull request Feb 15, 2022
renukamanavalan added a commit that referenced this pull request Feb 17, 2022
…9987)

Why I did it
There is a small window between load & listen to config-DB. If TACACS config got updated during that gap, the listen will not show it, hence hostcfgd would miss it, until another update.

How I did it
porting PR #8223, which uses one shot timer to reload tacacs config.
@renukamanavalan renukamanavalan deleted the hostcfgd_db branch April 17, 2022 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants