Skip to content

[C++ repro] Logging lots of image looses data #9818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
abey79 opened this issue Apr 28, 2025 · 8 comments · Fixed by #9846
Closed

[C++ repro] Logging lots of image looses data #9818

abey79 opened this issue Apr 28, 2025 · 8 comments · Fixed by #9846
Assignees
Labels
🪳 bug Something isn't working 🌊 C++ API C/C++ API specific 🦟 regression A thing that used to work in an earlier release user-request This is a pressing issue for one of our users
Milestone

Comments

@abey79
Copy link
Member

abey79 commented Apr 28, 2025

User-provided C++ repro highlight loss of logged data.

Expected: all data is logged, 50 frames on the timeline
Actual: partial data, only ~33 frames

Repro repo: https://github.com/ExpertOfNil/rerun_cpp_mve
Discord thread: https://discord.com/channels/1062300748202921994/1364955449602084874
Slack thread: https://rerunio.slack.com/archives/C041NHU952S/p1745826038111429

@abey79 abey79 added 👀 needs triage This issue needs to be triaged by the Rerun team 🪳 bug Something isn't working 🌊 C++ API C/C++ API specific and removed 👀 needs triage This issue needs to be triaged by the Rerun team labels Apr 28, 2025
@abey79 abey79 added this to the 0.23.2 milestone Apr 28, 2025
@abey79
Copy link
Member Author

abey79 commented Apr 28, 2025

Adding a rec.flush_blocking() at the end does NOT fix the issue.

Adding while (true) {} does "fix" the issue (this might be indicative of a possible short-term workaround).

@abey79 abey79 added the user-request This is a pressing issue for one of our users label Apr 28, 2025
@grtlr grtlr self-assigned this Apr 28, 2025
@ExpertOfNil
Copy link
Contributor

Just a point of clarification, in case it is helpful: each image should be logged 4 times (representing the actual application logging various permutations), 3 prior to rr_log_pose_estimation and once more within rr_log_pose_estimation. The end result should be 200 logged images, 50 under mve/original, 50 under mve/Image1, 50 under mve/Image2, and 50 under mve/image along with a pinhole and transform. Let me know if this is not accurately represented in my repro. Otherwise, I'll update the notes to add these details.

@abey79
Copy link
Member Author

abey79 commented Apr 28, 2025

This is what I'm getting with the while (true) {} trick. I'm assuming that this is the "expected" state

Image

@ExpertOfNil
Copy link
Contributor

That looks correct! I'm assuming you used while (true) {} in main.cpp to keep the process alive?

@abey79
Copy link
Member Author

abey79 commented Apr 28, 2025

That looks correct! I'm assuming you used while (true) {} in main.cpp to keep the process alive?

Yes, exactly.

@Wumpf Wumpf added the 🦟 regression A thing that used to work in an earlier release label Apr 28, 2025
@ExpertOfNil
Copy link
Contributor

ExpertOfNil commented Apr 28, 2025

I was able to reproduce completion of all logs with while (true) {}. I also killed the process after pose 40, which seemed to drop the remainder of logs.

Would it be reasonable to provide a mechanism for flushing the RecordingStream, or checking if it has been full flushed, prior to shutting down the main process?

@abey79
Copy link
Member Author

abey79 commented Apr 28, 2025

Would it be reasonable to provide a mechanism for flushing the RecordingStream, or checking if it has been full flushed, prior to shutting down the main process?

Well, it was very much my expectation that rec.flush_blocking() would do exactly that, but it's apparently not the case, so may well be a bug. @grtlr is currently looking into all of this and will revert as soon as we figure that out.

@grtlr
Copy link
Contributor

grtlr commented May 2, 2025

Thank you so much @ExpertOfNil 🙏! Your repro pointed out a very subtle difference in our C++ SDK and made us realized that our logging level was wrong.

For the 0.23.2 release we will create a patch to warn on dataloss. We will properly address the underlying cause for this issue in the 0.24 release (it will probably lead to subtle changes to the API).

You can follow along with that line of development here:

Until we have the full fix, I would suggest bumping up the flush_timeout_sec parameter in your connect_grpc call to a value that prevents dataloss.

grtlr added a commit that referenced this issue May 2, 2025
### Related

* Closes #9818.

### What

> [!IMPORTANT]
> This PR also changes the way `RecordingStream` is free'd in the C/C++
API. Before we called `stream.disconnect`, which unnecessarily replaced
the current sink with a _buffered_ sink that would be immediately
dropped afterwards. Not only did this cause spam in the log outputs, it
also lead to race conditions upon (log) application shutdown.

This PR makes it more explicit why we drop data during flushing, by
bumping the log messages to `warn!`. It also improves the message by
pointing the users to `flush_timeout`.

We also bump the default timeout from two seconds to now 3 seconds.

It's worth taking note that explicitly calling `flush_blocking` from our
SDKs should be able to opt-out of this timeout, to ensure all data is
sent. This will be tracked here:

* #9845.
abey79 pushed a commit that referenced this issue May 2, 2025
### Related

* Closes #9818.

### What

> [!IMPORTANT]
> This PR also changes the way `RecordingStream` is free'd in the C/C++
API. Before we called `stream.disconnect`, which unnecessarily replaced
the current sink with a _buffered_ sink that would be immediately
dropped afterwards. Not only did this cause spam in the log outputs, it
also lead to race conditions upon (log) application shutdown.

This PR makes it more explicit why we drop data during flushing, by
bumping the log messages to `warn!`. It also improves the message by
pointing the users to `flush_timeout`.

We also bump the default timeout from two seconds to now 3 seconds.

It's worth taking note that explicitly calling `flush_blocking` from our
SDKs should be able to opt-out of this timeout, to ensure all data is
sent. This will be tracked here:

* #9845.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working 🌊 C++ API C/C++ API specific 🦟 regression A thing that used to work in an earlier release user-request This is a pressing issue for one of our users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants