Skip to content

Ensure log streams are flushed before garbage collection when using GCFlushedOutputStream by switching to Cleaner #388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Mar 28, 2025

See #83 (comment). Alternative to #387.

Testing done

See new automated tests. I ran them manually against pipeline-cloudwatch-logs, which does not use the classes changed by this PR and so is unaffected (the new tests are currently skipped, but they pass even if not skipped), as well as CloudBees Pipeline Explorer, which passes with the new tests.

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@jglick
Copy link
Member

jglick commented Apr 8, 2025

run it through the PCT

and pipeline-cloudwatch-logs (need an AWS account to run and some vars like AWS_PROFILE, AWS_REGION, AWS_CHAINED_ROLE, CLOUDWATCH_LOG_GROUP_NAME)

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right though I am not familiar with the API.

@dwnusbaum
Copy link
Member Author

I finally had a bit of time to test this. As far as I can see, pipeline-cloudwatch-logs does not care either way, because it does not use BufferedBuildListener, GCFlushedOutputStream, or DelayBufferedOutputStream. I did run its tests though and they passed with this PR as well as #387. Tests in workflow-durable-task-step also passed with both PRs.

The only way I could figure out to make LogStorageTestBase#remoting require GCFlushedOutputStream to pass with this PR was as follows:

  • Replace all uses of DelayBufferedOutputStream with BufferedOutputStream
    • FileLogStorageTest#remoting was still passing
  • Remove the call to flush in LogStorageTestBase$RemotePrint
    • FileLogStorageTest#remoting was failing consistently because the remote log output was never received by the controller
  • Adjust LogStorageTestBase$GC to call System#gc repeatedly in a loop for around 5 seconds so that the cleanup task would run (I tried to investigate why this was needed, but acquiring a heap dump consistently triggered the cleanup task immediately. Calling System#gc once and sleeping for a long time did not consistently trigger cleanup)
    • FileLogStorageTest#remoting was passing, but flaky, because the listeners in the two RemotePrint calls were being flushed in random order
  • Add a call to GC after the first RemotePrint call, to ensure that it gets flushed before the second RemotePrint call
    • FileLogStorageTest#remoting was passing consistently

I then went back and tested the same test changes without this PR.

  • Initially, FileLogStorageTest#remoting was flaky, because the listeners in the two RemotePrint calls were being flushed in random order
  • I added a 10 second sleep call after the two RemotePrint calls (to give the timer task in FlushRef time to run
    • FileLogStorageTest#remoting was now passing consistently
  • I then removed all calls to GC, so that the test no longer called System#gc at all
    • FileLogStorageTest#remoting was still passing consistently (!)
  • I then deleted the 10 second sleep after the two RemotePrint calls, and instead added a 10 second sleep inside of the two RemotePrint calls, at the end of which I printed a log message that referenced the listener, to ensure that it could not be GC'd in that time frame
    • FileLogStorageTest#remoting was still passing consistently, and the log messages showed that FlushRef was running the cleanup task while the stream was still reachable (!)

So, here are my key takeaways:

  • My suspicions in [JENKINS-54566] finalize vs. flush #83 (comment) seem to be correct: The current behavior of GCFlushedOutputStream is a 1-time flush within 10 seconds of the the stream's creation, regardless of the stream's reachability. I suspect the reason that [JENKINS-54566] finalize vs. flush #83 fixed JENKINS-54566 was just that it deleted the problematic finalize override.
  • I can create a test that specifically requires GCFlushedOutputStream for FileLogStorage by using DelayBufferedOutputStream.Tuning to disable the auto-flushing and adding an option to RemotePrint to not flush after writing, but it will be a pretty artificial test
    • pipeline-cloudwatch-logs would pass this test despite having no equivalent to GCFlushedOutputStream, since the only buffer that plugin uses is a LineTransformationOutputStream, so we'd have to write a line with no line ending to see a difference in behavior, but that starts to feel extremely artificial
    • cloudbees-pipeline-explorer should pass this test due to reuse of BufferedBuildListener
    • IDK about opentelemetry
  • It is not clear to me whether there is a real-world case where GCFlushedOutputStream actually matters. Given that we haven't run into any known problems since [JENKINS-54566] finalize vs. flush #83, I am still somewhat inclined to just delete this class.

@dwnusbaum dwnusbaum added the bug label Jul 7, 2025
@dwnusbaum dwnusbaum changed the title Make GCFlushedOutputStream work as intended by switching to Cleaner Ensure log streams are flushed before garbage collection when using GCFlushedOutputStream by switching to Cleaner Jul 7, 2025
Comment on lines +288 to +301
switch (variant) {
case DelayedAutoFlushOnly -> {
// TODO: It would be better to wait for `DelayBufferedOutputStream.flushBuffer` to run exactly once.
Thread.sleep(DelayBufferedOutputStream.Tuning.DEFAULT.minRecurrencePeriod * 3);
}
case GcFlushOnly -> {
listener = null;
// Sleeping and calling `System.gc` more than once seem to be necessary for Cleaner to run reliably.
for (int i = 0; i < 3; i++) {
System.gc();
Thread.sleep(1000);
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in real-world scenarios, where we do not nicely wait after writing each individual line, there is no guarantee of ordering with GCFlushedOutputStream or DelayBufferedOutputStream, and so you can end up with lines in unexpected orders.

Comment on lines +246 to +247
assumeThat("Skipping BufferedBuildListener-specific tests because listener is " + overall,
overall, instanceOf(BufferedBuildListener.class));
Copy link
Member Author

@dwnusbaum dwnusbaum Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary, i.e. the new tests pass against pipeline-cloudwatch-logs, but it seemed somewhat pointless to run tests that configure implementation details of BufferedBuildListener against unrelated listener implementations.

If desired, we could instead move the new tests down into FileLogStorageTest add make the tests here more generic. For example, we'd just have one test that checks that if you write a line and do not flush, the line gets written if you wait for a few seconds. pipeline-cloudwatch-logs would still pass this test without waiting because it doesn't buffer more than a single line. FileLogStorage would only pass after a few seconds due to DelayBufferedOutputStream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you write a line and do not flush, the line gets written if you wait for a few seconds

Fine, I guess, but no one should rely on that; remote code printing to a TaskListener is expected to flush explicitly.

@dwnusbaum dwnusbaum requested a review from jglick July 7, 2025 22:18
@dwnusbaum dwnusbaum marked this pull request as ready for review July 7, 2025 22:18
@dwnusbaum dwnusbaum requested a review from a team as a code owner July 7, 2025 22:18
@jglick
Copy link
Member

jglick commented Jul 14, 2025

I am still somewhat inclined to just delete this class

Go ahead if you think nothing would be harmed. I only dimly remember the original justification. These days I am mostly working in USE_WATCHING mode in workflow-durable-task-step (I should probably just go ahead and make that the default), which can lose a few lines across controller restarts but I never found a way to prevent that without degrading performance.

private static final Cleaner CLEANER = Cleaner.create(
new NamingThreadFactory(new DaemonThreadFactory(), GCFlushedOutputStream.class.getName() + ".CLEANER"));

static boolean DISABLED = Boolean.getBoolean(GCFlushedOutputStream.class.getName() + ".DISABLED");
Copy link
Member

@jglick jglick Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use SystemProperties generally. Also prefer positive to negative sense, e.g.

static final boolean ENABLED = SystemProperties.getBoolean(XXX + ".enabled", true);

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants