-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[fast-reboot] Optimize PMON and LLDP services start up CPU consumption #10242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fast-reboot] Optimize PMON and LLDP services start up CPU consumption #10242
Conversation
…python3 platform package. Optimize LLDP service start up by calling sonic-cfggen at the beginning script and not as part of supervisord. These optimizations are for execution of create_switch function on syncd with less CPU consumers, mainly for fast-reboot but good to have in general Signed-off-by: Shlomi Bitton <[email protected]>
@yxieca , @stepanblyschak FYI, please review |
@vaibhavhd could you please help to review? |
@yxieca , @stepanblyschak kindly reminder, please review |
@@ -42,8 +42,8 @@ if [ -e /usr/share/sonic/platform/platform_wait ]; then | |||
fi | |||
|
|||
# If the Python 2 sonic-platform package is not installed, try to install it | |||
python2 -c "import sonic_platform" > /dev/null 2>&1 || pip2 show sonic-platform > /dev/null 2>&1 | |||
if [ $? -ne 0 ]; then | |||
PYTHON2_PLATFORM_PACKAGE=/usr/local/lib/python2.7/dist-packages/sonic_platform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package can be anywhere in PYTHONPATH. Suggest to use pkgutil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the paths can change, and pkgutil should be better tool here.
Also, will the new check of path existence validate both sonic_platform
and sonic-platform
availability? The previous check was doing that, and I see only sonic_platform
in new check.
@@ -59,8 +59,8 @@ if [ $? -ne 0 ]; then | |||
fi | |||
|
|||
# If the Python 3 sonic-platform package is not installed, try to install it | |||
python3 -c "import sonic_platform" > /dev/null 2>&1 || pip3 show sonic-platform > /dev/null 2>&1 | |||
if [ $? -ne 0 ]; then | |||
PYTHON3_PLATFORM_PACKAGE=/usr/local/lib/python3.7/dist-packages/sonic_platform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python minor version might change. Suggest to use pkgutil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for Python2.
@@ -1,4 +1,7 @@ | |||
#!/usr/bin/env bash | |||
|
|||
. /usr/bin/start.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to make sure this won't interffer with create_switch in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a way to make sure this won't interfere in the future.
I think the best approach to improve the performance and ensure the stability of the solution is to delay these services in 90 seconds (fastboot criteria for control plane), this improve the performance even more.
The suggested services to delay are: dhcp_relay, radv, pmon and lldp.
In the future, as part of the HLD I am currently working, we will have a better way to indicate the end of fast-reboot flow and we can start the services according to it.
@liat-grozovik @stepanblyschak @vaibhavhd @yxieca what do you think about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM please wait for @vaibhavhd and @yxieca to confirm if this is ok from user pov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused about what are we proposing. This PR is preempting lldp execution (LLDP consume the CPU before create_switch is executed and not interfering it.
), but we are also saying that in future PR we are going to defer the execution post fast-reboot completion.
Do we need this change right now, if we are soon going to depend upon fast-reboot end-indicator to start the deferred set of services?
I am not sure what is the impact of delaying LLDP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested solution with fast-reboot end-indicator will indeed defer the set of services to start, but this is planned for next release and for this branch we can optimize it by delaying the services.
Notice that the new infrastructure will be supported only from next release and above since there are operations required on the old image which are used with the new image, just like in warm reboot.
On this case even if the new design will be implemented on 202205, fast-reboot to an image from this release will still fall back to the previous fast-reboot implementation, since DB dump is not available etc.
Regarding delaying all 4 suggested services: dhcp_relay, radv, pmon and lldp
By my testing I didn't see any impact and no errors appeared in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand the new-design in future-branch part.
Still some open questions. I agree that lldp startup sequence can be adjusted to prevent overlap with create-switch. But my doubt was - should it be done before create switch (this PR), or after create-switch (deferred start in 202205)?
Also I believe that this change will affect warm-boot process too. Do you agree? If yes, then title and description should be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to change this PR with the after create-switch approach - deferred start.
Following Stepan comment, this approach is lack of a sync mechanism to ensure it will not overlap again in the future.
The other suggestion is to delay the entire service startup until init flow has finished with a 90 seconds timer and this is much more reliable and a bit similar to the new design of 202205.
I mean, with the new implementation these services will be delayed as well, the only difference is to start them with a different mechanism and not a timer.
I agree it will affect warm-reboot and the description will be modified as well.
If this approach is acceptable I will update all as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavhd what do you think? Should I proceed and modify the PR to this suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replied back on the email regarding this. Sorry for stalling this. The new suggested approach LGTM.
However, I am not confident about impact of delaying lldp with a 90s timer. May be @prsunny or @yxieca can vet this from lldp point of view.
At this stage, I also would suggest to put two different PRs for pmon and lldp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @vaibhavhd, I will create a PR and test it.
@shlomibitton request change indication was removed. Still need to handle, right? |
Still waiting for @vaibhavhd and @yxieca reply for the new suggestion following Stepan comment. |
@vaibhavhd @liat-grozovik @stepanblyschak @yxieca I created 2 new PR's with the suggested approach, can you please review? Thanks Closing this PR. |
Signed-off-by: Shlomi Bitton [email protected]
Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, some adaptations to init of PMON and LLDP ae required to off load the CPU consumption during create_switch execution.
How I did it
By doing this 2 adaptations PMON is not consuming a lot of CPU during init and LLDP consume the CPU before create_switch is executed and not interfering it.
How to verify it
Run fast-reboot on MLNX platform and observe ~2 seconds less in create_switch execution time.
Attached to this PR, bootchart outputs with the different utilizations
With no optimizations:
After PMON optimizations:
After PMON and LLDP optimizations:
Which release branch to backport (provide reason below if selected)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)