Skip to content

Fix BatchSpanProcessor.Shutdown to wait until all spans are processed #766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 9, 2020

Conversation

vmihailenco
Copy link
Contributor

Currently it exits too soon - before drainQueue is finished

Currently it exits too soon - before drainQueue is finished
@Aneurysm9
Copy link
Member

Is there a test case that can be added to demonstrate the faulty behavior and ensure it isn't reintroduced?

@jmacd
Copy link
Contributor

jmacd commented May 26, 2020

This might never exit, I think.

I believe it's WAI. If you shutdown the processor while there are still spans being generated, some of them will not be recorded. What are the semantics you want, here?

@vmihailenco
Copy link
Contributor Author

I would like to be able to understand what is "working as intented" here and how it might never exit. That is not first time I hear that from you so I guess I should point out that it does not help in any way except politely says "no" without any explanation (keep guessing).

If you shutdown the processor while there are still spans being generated, some of them will not be recorded

Given that after shutdown spans are just not added to the internal queue (channel) I don't see what to discuss here. You are right - they are discarded.

What are the semantics you want, here?

I am not particularly interested in any behavior here. But the code does not do what the comment says and what common sense suggests. Perhaps it makes sense to update the comment. Closing since it is WAI and I don't use this.

@jmacd
Copy link
Contributor

jmacd commented May 27, 2020

I could be wrong, but it looked like drainQueue() would continue to process spans as long sufficiently many of them are generated to never fully drain the queue. I think we should expose a Flush() operation to flush the processor -- there is also a spec issue written on this topic -- that would allow a user to flush and then shutdown without losing spans.

@jmacd
Copy link
Contributor

jmacd commented May 27, 2020

You're correct. It's clear that I'm not providing an effective review here, and I'm sorry. I've explained why I would not use this code (because it does not provide a way to limit encoded batch size), and that I only became involved because flaky tests were interfering with the metrics SDK development.

It worries me that we have disabled the tests for this class and are still making changes. What you point out is true, and we could either remove stopCh and drainQueue or we could make this change. I am concerned that even with this change, the existing test wouldn't pass.

The underlying problem seems to be that there is no "correct" way to shut down a processor when there are concurrent writers. I now believe that adding Flush() is the best path forward. We can write a correctness test for Flush() that ensures all spans are written, then all we need to test is that Shutdown closes channels, doesn't block indefinitely, and doesn't panic.

@jmacd jmacd reopened this May 27, 2020
@vmihailenco
Copy link
Contributor Author

vmihailenco commented May 28, 2020

I see at least 2 reasons why tests were flaky:

  • Shutdown exits before drainQueue is finished.
  • some tests don't use WithBlocking so spans can be dropped if scheduling is lucky/unlucky

This PR fixes both so hopefully tests should pass reliably again. And it is not like previously tests were working properly - they were passing, but not testing that code works correctly.

The underlying problem seems to be that there is no "correct" way to shut down a processor when there are concurrent writers.

Agreed. Shutdown only guarantees that spans added before the shutdown are processed. It does not help with concurrent writers - you should stop/synchronize them separately.

Tests should work fine because all spans are added before calling Shutdown and now tests use WithBlocking so spans can't be dropped when queue is full.

@jmacd
Copy link
Contributor

jmacd commented May 28, 2020

The last time I wrangled with this topic, I did add WithBlocking() in the tests, but I wasn't able to make the test pass. We discussed this in the SIG call today, and I think the best we could do in terms of giving the user proper semantics, would be to add a Flush() call (with tests) and then simplify Shutdown.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 8, 2020

CLA Check

@jmacd
Copy link
Contributor

jmacd commented Jun 8, 2020

Thank you @vmihailenco, sorry for the delay, and sorry for several mistakes of mine in reviewing this code.

@jmacd
Copy link
Contributor

jmacd commented Jun 8, 2020

@lizthegrey as our compliance officer, would you look at the error above (#766 (comment))? I can't see why the build is green given that message.

@jmacd
Copy link
Contributor

jmacd commented Jun 8, 2020

@lizthegrey nevermind. I see you posted on another PR that we have to re-authorize the CLA.
However, this makes me think the test is broken because the build is green. Thoughts?

@jmacd
Copy link
Contributor

jmacd commented Jun 9, 2020

This restores tests! Merging.

@jmacd jmacd merged commit 7ebd7b5 into open-telemetry:master Jun 9, 2020
@MrAlias MrAlias linked an issue Jun 10, 2020 that may be closed by this pull request
@pellared pellared added this to the untracked milestone Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch Span Processor unit test disabled
4 participants