Skip to content

Reliable data channel #737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Reliable data channel #737

wants to merge 16 commits into from

Conversation

pblazej
Copy link
Contributor

@pblazej pblazej commented Jul 14, 2025

Ports basic concepts from livekit/client-sdk-js#1546

The differences are mostly syntactic, due to how backpressure is handled via events in Swift. Thus, I decided to separate SendBuffer that's tightly coupled to RTC and RetryBuffer with slightly different semantics.

private static let reliableLowThreshold: UInt64 = 2 * 1024 * 1024 // 2 MB
private static let lossyLowThreshold: UInt64 = reliableLowThreshold

// Keep at least N reliable packets while draining webrtc buffers.
// Should prevent losing packets - requesting retry from lastSeq that's already dequeued.
private static let reliableRetryCapacity = 32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda conservative, but during device testing I discovered that a gap in the history may occur:

Screenshot 2025-07-14 at 10 12 57 AM

When we drain our buffer to 0 as soon as RTC part does it... So drained != sent 😭

We could also keep a fixed-size queue of let's say 1MB - the RAM sacrifice is negligible if we want real consistency here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what test cases did this happen?
curious if you have repro steps so that we can ensure we don't have the same problem on JS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happened with real n/w (real latency). Local server (used for unit tests) is so fast that the reconnection delay may not even interfere with the transfer, so it mostly checks the "duplicate logic".

To test it "live" I'm just sending a large packet indefinitely, firing a Quick reconnect in the meantime, then the server requests retry...

while !Task.isCancelled {
  try? await room.localParticipant.streamText(for: "asd").write(String(repeating: "a", count: 1024 * 100))
  try? await Task.sleep(100 * NSEC_PER_MSEC)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe still specific to the webrtc internals or even iOS n/w stack, but worth checking 💣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukasIO here's an example for bigger test data 4f55e6f

However, keep in mind that:

  • everything may depend on timing here, from yielding the cooperative concurrent stuff to n/w delays, chunk size, etc.
  • cli requests retry from 0 - looks like it does not respect the "receive state" part yet

So this is just an attempt to replicate the real-world condition.

@lukasIO lukasIO requested a review from cnderrauber July 14, 2025 09:19
Copy link

@cnderrauber cnderrauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@pblazej pblazej force-pushed the blaze/dc-reliability branch from 5d99fc4 to 6c3b038 Compare July 14, 2025 13:39
@pblazej pblazej force-pushed the blaze/dc-reliability branch from ba7be42 to d6d72b1 Compare July 15, 2025 11:44
@pblazej
Copy link
Contributor Author

pblazej commented Jul 15, 2025

@cnderrauber @lukasIO for the record:

I did a lot of semi-automated monkey-testing on iOS and macOS (like reconnecting every N s while streaming data). It turns out that it depends on how you test it aka how hard you stress it.

By stress, I mostly mean whether you sleep 🛌 or not in between calls to stream....

If you keep streaming continuously, you gotta buffer at least the full webrtc capacity (backpressure amount) + a few packets (I just did 2x backpressure). Otherwise, you get:

2025-07-15T15:37:37+0200 warning LiveKitSDK : [LiveKit] DataChannelPair.retry(buffer:from:) Wrong packet sequence while retrying: 913 > 910, 3 packets missing
2025-07-15T15:37:43+0200 warning LiveKitSDK : [LiveKit] DataChannelPair.retry(buffer:from:) Wrong packet sequence while retrying: 1698 > 1694, 4 packets missing
2025-07-15T15:37:48+0200 warning LiveKitSDK : [LiveKit] DataChannelPair.retry(buffer:from:) Wrong packet sequence while retrying: 2508 > 2502, 6 packets missing
2025-07-15T15:37:53+0200 warning LiveKitSDK : [LiveKit] DataChannelPair.retry(buffer:from:) Wrong packet sequence while retrying: 3346 > 3342, 4 packets missing
2025-07-15T15:37:58+0200 warning LiveKitSDK : [LiveKit] DataChannelPair.retry(buffer:from:) Wrong packet sequence while retrying: 4178 > 4176, 2 packets missing
2025-07-15T15:38:04+0200 warning LiveKitSDK : [LiveKit] DataChannelPair.retry(buffer:from:) Wrong packet sequence while retrying: 5015 > 5011, 4 packets missing
2025-07-15T15:38:14+0200 warning LiveKitSDK : [LiveKit] DataChannelPair.retry(buffer:from:) Wrong packet sequence while retrying: 6682 > 6680, 2 packets missing
2025-07-15T15:38:25+0200 warning LiveKitSDK : [LiveKit] DataChannelPair.retry(buffer:from:) Wrong packet sequence while retrying: 8113 > 8108, 5 packets missing

Once introducing sleep, this requirement is relaxed, and you can let it drain.

So in my case, it looks like what entered webrtc can be lost (as well as the in-flight data, these few packets).

As I said, it may depend on everything, webrtc stack, Apple's own POSIX impl, etc.

@pblazej pblazej force-pushed the blaze/dc-reliability branch from 5948773 to 98a206d Compare July 16, 2025 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants