-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[fast-reboot] Optimize PMON and LLDP services start up CPU consumption #10242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,8 +42,8 @@ if [ -e /usr/share/sonic/platform/platform_wait ]; then | |
fi | ||
|
||
# If the Python 2 sonic-platform package is not installed, try to install it | ||
python2 -c "import sonic_platform" > /dev/null 2>&1 || pip2 show sonic-platform > /dev/null 2>&1 | ||
if [ $? -ne 0 ]; then | ||
PYTHON2_PLATFORM_PACKAGE=/usr/local/lib/python2.7/dist-packages/sonic_platform | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The package can be anywhere in PYTHONPATH. Suggest to use pkgutil. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the paths can change, and pkgutil should be better tool here. Also, will the new check of path existence validate both |
||
if [ ! -d "$PYTHON2_PLATFORM_PACKAGE" ]; then | ||
SONIC_PLATFORM_WHEEL="/usr/share/sonic/platform/sonic_platform-1.0-py2-none-any.whl" | ||
echo "sonic-platform package not installed, attempting to install..." | ||
if [ -e ${SONIC_PLATFORM_WHEEL} ]; then | ||
|
@@ -59,8 +59,8 @@ if [ $? -ne 0 ]; then | |
fi | ||
|
||
# If the Python 3 sonic-platform package is not installed, try to install it | ||
python3 -c "import sonic_platform" > /dev/null 2>&1 || pip3 show sonic-platform > /dev/null 2>&1 | ||
if [ $? -ne 0 ]; then | ||
PYTHON3_PLATFORM_PACKAGE=/usr/local/lib/python3.7/dist-packages/sonic_platform | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The python minor version might change. Suggest to use pkgutil. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment as for Python2. |
||
if [ ! -d "$PYTHON3_PLATFORM_PACKAGE" ]; then | ||
SONIC_PLATFORM_WHEEL="/usr/share/sonic/platform/sonic_platform-1.0-py3-none-any.whl" | ||
echo "sonic-platform package not installed, attempting to install..." | ||
if [ -e ${SONIC_PLATFORM_WHEEL} ]; then | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to make sure this won't interffer with create_switch in the future?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a way to make sure this won't interfere in the future.
I think the best approach to improve the performance and ensure the stability of the solution is to delay these services in 90 seconds (fastboot criteria for control plane), this improve the performance even more.
The suggested services to delay are: dhcp_relay, radv, pmon and lldp.
In the future, as part of the HLD I am currently working, we will have a better way to indicate the end of fast-reboot flow and we can start the services according to it.
@liat-grozovik @stepanblyschak @vaibhavhd @yxieca what do you think about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM please wait for @vaibhavhd and @yxieca to confirm if this is ok from user pov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused about what are we proposing. This PR is preempting lldp execution (
LLDP consume the CPU before create_switch is executed and not interfering it.
), but we are also saying that in future PR we are going to defer the execution post fast-reboot completion.Do we need this change right now, if we are soon going to depend upon fast-reboot end-indicator to start the deferred set of services?
I am not sure what is the impact of delaying LLDP.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested solution with fast-reboot end-indicator will indeed defer the set of services to start, but this is planned for next release and for this branch we can optimize it by delaying the services.
Notice that the new infrastructure will be supported only from next release and above since there are operations required on the old image which are used with the new image, just like in warm reboot.
On this case even if the new design will be implemented on 202205, fast-reboot to an image from this release will still fall back to the previous fast-reboot implementation, since DB dump is not available etc.
Regarding delaying all 4 suggested services: dhcp_relay, radv, pmon and lldp
By my testing I didn't see any impact and no errors appeared in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand the new-design in future-branch part.
Still some open questions. I agree that lldp startup sequence can be adjusted to prevent overlap with create-switch. But my doubt was - should it be done before create switch (this PR), or after create-switch (deferred start in 202205)?
Also I believe that this change will affect warm-boot process too. Do you agree? If yes, then title and description should be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to change this PR with the after create-switch approach - deferred start.
Following Stepan comment, this approach is lack of a sync mechanism to ensure it will not overlap again in the future.
The other suggestion is to delay the entire service startup until init flow has finished with a 90 seconds timer and this is much more reliable and a bit similar to the new design of 202205.
I mean, with the new implementation these services will be delayed as well, the only difference is to start them with a different mechanism and not a timer.
I agree it will affect warm-reboot and the description will be modified as well.
If this approach is acceptable I will update all as needed.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavhd what do you think? Should I proceed and modify the PR to this suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replied back on the email regarding this. Sorry for stalling this. The new suggested approach LGTM.
However, I am not confident about impact of delaying lldp with a 90s timer. May be @prsunny or @yxieca can vet this from lldp point of view.
At this stage, I also would suggest to put two different PRs for pmon and lldp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @vaibhavhd, I will create a PR and test it.