Skip to content

[fast-reboot] Optimize PMON and LLDP services start up CPU consumption #10242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

[fast-reboot] Optimize PMON and LLDP services start up CPU consumption #10242

wants to merge 1 commit into from

Conversation

shlomibitton
Copy link
Contributor

@shlomibitton shlomibitton commented Mar 15, 2022

Signed-off-by: Shlomi Bitton [email protected]

Why I did it

Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, some adaptations to init of PMON and LLDP ae required to off load the CPU consumption during create_switch execution.

How I did it

  • Change the way PMON init bash script validate the existence of python2 and python3 platform packages to a more efficient way.
  • On LLDP, move the sonic-cfggen call by "start.sh" script triggered by supervisord to the init bash script of this docker.

By doing this 2 adaptations PMON is not consuming a lot of CPU during init and LLDP consume the CPU before create_switch is executed and not interfering it.

How to verify it

Run fast-reboot on MLNX platform and observe ~2 seconds less in create_switch execution time.
Attached to this PR, bootchart outputs with the different utilizations

With no optimizations:

bootchart_no_optimizations

After PMON optimizations:

bootchart_202111_pmon_optimized_sonic_db_cli_optimized

After PMON and LLDP optimizations:

bootchart_202111_pmon_optimized_sonic_db_cli_optimized_lldp_optimized

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

…python3 platform package.

Optimize LLDP service start up by calling sonic-cfggen at the beginning script and not as part of supervisord.
These optimizations are for execution of create_switch function on syncd with less CPU consumers, mainly for fast-reboot but good to have in general

Signed-off-by: Shlomi Bitton <[email protected]>
@shlomibitton shlomibitton requested a review from lguohan as a code owner March 15, 2022 12:48
@shlomibitton shlomibitton marked this pull request as draft March 15, 2022 12:48
@liat-grozovik
Copy link
Collaborator

@yxieca , @stepanblyschak FYI, please review

@yxieca yxieca requested a review from vaibhavhd March 15, 2022 23:27
@liat-grozovik
Copy link
Collaborator

@vaibhavhd could you please help to review?

@shlomibitton shlomibitton marked this pull request as ready for review March 23, 2022 09:21
@dprital dprital added the Request for 202111 Branch For PRs being requested for 202111 branch label Mar 24, 2022
@liat-grozovik
Copy link
Collaborator

@yxieca , @stepanblyschak kindly reminder, please review

@liat-grozovik liat-grozovik requested review from yxieca and stepanblyschak and removed request for lguohan March 27, 2022 06:14
@@ -42,8 +42,8 @@ if [ -e /usr/share/sonic/platform/platform_wait ]; then
fi

# If the Python 2 sonic-platform package is not installed, try to install it
python2 -c "import sonic_platform" > /dev/null 2>&1 || pip2 show sonic-platform > /dev/null 2>&1
if [ $? -ne 0 ]; then
PYTHON2_PLATFORM_PACKAGE=/usr/local/lib/python2.7/dist-packages/sonic_platform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package can be anywhere in PYTHONPATH. Suggest to use pkgutil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the paths can change, and pkgutil should be better tool here.

Also, will the new check of path existence validate both sonic_platform and sonic-platform availability? The previous check was doing that, and I see only sonic_platform in new check.

@@ -59,8 +59,8 @@ if [ $? -ne 0 ]; then
fi

# If the Python 3 sonic-platform package is not installed, try to install it
python3 -c "import sonic_platform" > /dev/null 2>&1 || pip3 show sonic-platform > /dev/null 2>&1
if [ $? -ne 0 ]; then
PYTHON3_PLATFORM_PACKAGE=/usr/local/lib/python3.7/dist-packages/sonic_platform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python minor version might change. Suggest to use pkgutil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for Python2.

@@ -1,4 +1,7 @@
#!/usr/bin/env bash

. /usr/bin/start.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to make sure this won't interffer with create_switch in the future?

Copy link
Contributor Author

@shlomibitton shlomibitton Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a way to make sure this won't interfere in the future.
I think the best approach to improve the performance and ensure the stability of the solution is to delay these services in 90 seconds (fastboot criteria for control plane), this improve the performance even more.
The suggested services to delay are: dhcp_relay, radv, pmon and lldp.
In the future, as part of the HLD I am currently working, we will have a better way to indicate the end of fast-reboot flow and we can start the services according to it.
@liat-grozovik @stepanblyschak @vaibhavhd @yxieca what do you think about this approach?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM please wait for @vaibhavhd and @yxieca to confirm if this is ok from user pov

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about what are we proposing. This PR is preempting lldp execution (LLDP consume the CPU before create_switch is executed and not interfering it.), but we are also saying that in future PR we are going to defer the execution post fast-reboot completion.
Do we need this change right now, if we are soon going to depend upon fast-reboot end-indicator to start the deferred set of services?
I am not sure what is the impact of delaying LLDP.

Copy link
Contributor Author

@shlomibitton shlomibitton Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested solution with fast-reboot end-indicator will indeed defer the set of services to start, but this is planned for next release and for this branch we can optimize it by delaying the services.
Notice that the new infrastructure will be supported only from next release and above since there are operations required on the old image which are used with the new image, just like in warm reboot.
On this case even if the new design will be implemented on 202205, fast-reboot to an image from this release will still fall back to the previous fast-reboot implementation, since DB dump is not available etc.

Regarding delaying all 4 suggested services: dhcp_relay, radv, pmon and lldp
By my testing I didn't see any impact and no errors appeared in the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand the new-design in future-branch part.
Still some open questions. I agree that lldp startup sequence can be adjusted to prevent overlap with create-switch. But my doubt was - should it be done before create switch (this PR), or after create-switch (deferred start in 202205)?

Also I believe that this change will affect warm-boot process too. Do you agree? If yes, then title and description should be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to change this PR with the after create-switch approach - deferred start.
Following Stepan comment, this approach is lack of a sync mechanism to ensure it will not overlap again in the future.
The other suggestion is to delay the entire service startup until init flow has finished with a 90 seconds timer and this is much more reliable and a bit similar to the new design of 202205.
I mean, with the new implementation these services will be delayed as well, the only difference is to start them with a different mechanism and not a timer.
I agree it will affect warm-reboot and the description will be modified as well.
If this approach is acceptable I will update all as needed.

Copy link
Contributor Author

@shlomibitton shlomibitton Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd what do you think? Should I proceed and modify the PR to this suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replied back on the email regarding this. Sorry for stalling this. The new suggested approach LGTM.
However, I am not confident about impact of delaying lldp with a 90s timer. May be @prsunny or @yxieca can vet this from lldp point of view.

At this stage, I also would suggest to put two different PRs for pmon and lldp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @vaibhavhd, I will create a PR and test it.

@liat-grozovik
Copy link
Collaborator

@shlomibitton request change indication was removed. Still need to handle, right?

@shlomibitton
Copy link
Contributor Author

@shlomibitton request change indication was removed. Still need to handle, right?

Still waiting for @vaibhavhd and @yxieca reply for the new suggestion following Stepan comment.
To make it stable we can use the suggested approach, need to get the approval first and I will update the PR accordingly.

@shlomibitton
Copy link
Contributor Author

@vaibhavhd @liat-grozovik @stepanblyschak @yxieca I created 2 new PR's with the suggested approach, can you please review? Thanks

PMON: #10567
LLDP: #10568

Closing this PR.

@judyjoseph judyjoseph removed the Request for 202111 Branch For PRs being requested for 202111 branch label Apr 18, 2022
@shlomibitton shlomibitton deleted the shlomi_fast_reboot_optimizations_master branch June 2, 2022 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants