Skip to content

Refactor ledd daemon & fix high CPU usage due to unexpected socket close #548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 23, 2025

Conversation

prgeor
Copy link
Collaborator

@prgeor prgeor commented Oct 10, 2024

…close

Description

[ledd] Refactor ledd daemon and fix high CPU usage due to unexpected redis DB socket close

As part of the refactor, the LED port color is changed to oper UP only if all subports of a front panel ports are UP.
If any subport is oper DOWN, the LED port color is changed to oper DOWN

Motivation and Context

For unkown reason if redis closes the socket, then ledd daemon is continuously busy waiting for socket event and failing all the time. This is resulting in high CPI

The change in Port LED behavior is necessary to prevent situation in the old behavior where even if some subports are oper down, the front panel port LED should not glow green.

How Has This Been Tested?

Close the socket abruptly and see ledd daemon is respawned so that new socket connection is established to redis DB without causing high CPU.

Test this on Arista 7060X6 with 8x100G mode and 2x400G mode

Additional Information (Optional)

@prgeor prgeor requested a review from Staphylo October 10, 2024 21:35
@prgeor
Copy link
Collaborator Author

prgeor commented Oct 10, 2024

@Staphylo could you please help review?

@prgeor prgeor requested a review from assrinivasan October 17, 2024 23:15
@prgeor
Copy link
Collaborator Author

prgeor commented Oct 17, 2024

@assrinivasan @vvolam please review

@vvolam
Copy link
Contributor

vvolam commented Oct 18, 2024

@prgeor nit: spelling issue in PR description.

image

Also, could you fix the build failure?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor prgeor changed the title Refactor ledd daemon and fix high CPU usage due to unexpected socket … Refactor ledd daemon & fix high CPU usage due to unexpected socket close Feb 15, 2025
@prgeor
Copy link
Collaborator Author

prgeor commented Feb 15, 2025

@prgeor nit: spelling issue in PR description.

image

Also, could you fix the build failure?

@vvolam fixed

@prgeor prgeor requested a review from vvolam February 15, 2025 23:24
if fvp:
# TODO: Once these flag entries have been removed from the DB,
# we can remove this check
if key in ["PortConfigDone", "PortInitDone"]:
return 3
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ignoring bad fvp or key not present case now?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Prince George <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor
Copy link
Collaborator Author

prgeor commented Apr 4, 2025

@davidm-arista @bobbymcgonigle can someone review this PR?

@@ -124,10 +276,11 @@ def main():

ledd = DaemonLedd()

# Listen indefinitely for changes to the PORT table in the Application DB's
# Listen indefinitely for port oper status changes
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this while loop within the run routine ?
The way xcvrd/ycabled run change event monitoring loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure we can ...can this be done in next iteration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works


def isFrontPanelPort(self):
return multi_asic.is_front_panel_port(self._name, self._role)
class FrontPanelPorts:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need an empty line before here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

@byu343 byu343 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# Load platform-specific LedControl module
try:
self.led_control = self.load_platform_util(LED_MODULE_NAME, LED_CLASS_NAME)
led_control = self.load_platform_util(LED_MODULE_NAME, LED_CLASS_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can you explain, why are we not storing this as a class variable ?

ledd.run()

if 0 != ledd.run():
print("ledd.run() failed... Exiting")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add/modify this a syslog here please

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdahiya12 I just kept the behavior as old one. if you see the main() function, it has many other prints....these exit log do come to syslog ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will catch it in supervisor logs, the print logs can remain. Ideally, we should have disallowed print statements as is, they won't go to syslog as I recall, but supervisor logs will if the daemon hangs/exits. For now we can keep as is.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca yxieca merged commit 6dfb552 into sonic-net:master Apr 23, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants