Fix queue corruption in memberlist's TransmitLimitedQueue #324
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What I ran into
While running the following integration test (Go 1.21+) I hit a 100 % reproducible
timeout waiting for update broadcast
whenever several nodes callSetTags
concurrently:will panic: node 1: timeout waiting for update broadcast
Root cause analysis
SetTags
turns the tag update into aNamedBroadcast
held inside aTransmitLimitedQueue
(TLQ).item1
,id = 1
) istaken out → sent → deleted and re-inserted into the queue.
idGen
is reset to 0.The re-inserted item still keeps its old id = 1.
a peer. A new TLQ entry (
item2
) is created and gets the sameid = 1 (idGen restarted).
ReplaceOrInsert
treatsitem1
anditem2
as the same key,silently overwriting the in-flight broadcast without calling
Finished()
.The goroutine waiting in
SetTags
is never unblocked → timeout.Sequence:
The fix
Simply remove the line that resets
idGen
when the queue becomesempty
All updated unit & integration tests pass.