Skip to content

[improve][ci] Add Netty leak detection reporting to Pulsar CI #24272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented May 8, 2025

Motivation

Memory leaks in Netty buffers can lead to performance degradation, resource exhaustion, and unpredictable failures in production. A key challenge we face is that these leaks often go undetected until they cause issues in production environments.

Currently, our CI pipeline lacks the capability to systematically detect and prevent Netty buffer leaks before they reach production code. This creates a significant blind spot that allows buffer leak regressions to slip through our quality gates. This PR addresses this gap by implementing advanced Netty leak detection capabilities within the Pulsar CI pipeline.

The implementation follows a staged approach:

  1. First, enable leak detection and reporting without failing builds
  2. Fix identified leaks in both test and production code
  3. Eventually, enable strict enforcement where CI builds would fail when leaks are detected

This approach will allow us to establish a clean baseline and create an automated safety net that prevents future regressions in Netty buffer management. By catching these issues early in the development cycle, we can significantly improve system stability and resource efficiency in production environments.

Modifications

  • Added a custom ExtendedNettyLeakDetector implementation that extends Netty's ResourceLeakDetector
  • Configured leak detection to output detailed reports to files in a designated directory (NETTY_LEAK_DUMP_DIR)
  • Added reporting steps to all CI workflows to display leaks in GitHub Actions UI
  • Updated container configurations to pass appropriate system properties for leak detection
  • Modified build configurations to enable leak detection with appropriate settings
  • Enhanced PulsarTestListener to trigger leak detection at key test lifecycle events
  • Added capability to collect and report leaks from integration tests

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari
Copy link
Member Author

lhotari commented May 8, 2025

@lhotari lhotari marked this pull request as draft May 8, 2025 18:43
@codecov-commenter
Copy link

codecov-commenter commented May 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.25%. Comparing base (bbc6224) to head (0a8ad88).
Report is 1090 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24272      +/-   ##
============================================
+ Coverage     73.57%   74.25%   +0.67%     
+ Complexity    32624    32615       -9     
============================================
  Files          1877     1866      -11     
  Lines        139502   145054    +5552     
  Branches      15299    16580    +1281     
============================================
+ Hits         102638   107706    +5068     
+ Misses        28908    28835      -73     
- Partials       7956     8513     +557     
Flag Coverage Δ
inttests 26.67% <ø> (+2.08%) ⬆️
systests 23.28% <ø> (-1.04%) ⬇️
unittests 73.73% <ø> (+0.89%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1085 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lhotari
Copy link
Member Author

lhotari commented May 9, 2025

The overhead of adding Netty leak detection is negligible. In some jobs there's a 30% increase, but this is based on a single run. There's a lot of variance in build times in GitHub Actions CI, perhaps due to "noisy neighbors".
I'd assume that typical overhead is around 15-25%, which means that individual build jobs will take a few minutes longer.
We could split tests to build jobs more evenly to address the increase in build time.

job_duration_comparison_14906163209_14912589619

@lhotari lhotari marked this pull request as ready for review May 9, 2025 10:16
Copy link
Contributor

@liangyepianzhou liangyepianzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Leave a small comment.

@lhotari lhotari merged commit f51123c into apache:master May 9, 2025
55 checks passed
lhotari added a commit that referenced this pull request May 9, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 14, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 14, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 14, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 15, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 15, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 16, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 16, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 19, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 19, 2025
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants