Fix queue corruption in memberlist's TransmitLimitedQueue #324

ktalg · 2025-06-13T08:11:58Z

What I ran into

While running the following integration test (Go 1.21+) I hit a 100 % reproducible timeout waiting for update broadcast whenever several nodes call SetTags concurrently:

package main

import (
	"fmt"
	"github.com/hashicorp/serf/serf"
	"io"
	"log"
	"strconv"
	"strings"
	"sync"
	"time"
)

// start launches a Serf node and returns the instance
func start(name string, join []string, c ...func(config *serf.Config)) (*serf.Serf, error) {
	cfg := serf.DefaultConfig()
	cfg.Init()

	cfg.NodeName = name
	cfg.MemberlistConfig.BindAddr = "127.0.0.1"
	cfg.MemberlistConfig.BindPort = 0
	// Aggressive settings to trigger the issue faster
	cfg.MemberlistConfig.RetransmitMult = 2
	cfg.MemberlistConfig.GossipInterval = time.Millisecond * 50
	cfg.BroadcastTimeout = 5 * time.Second
	cfg.EventBuffer = 500
	cfg.LogOutput = io.Discard

	for _, c := range c {
		c(cfg)
	}
	s, err := serf.Create(cfg)
	if err != nil {
		return nil, err
	}

	if len(join) > 0 {
		if _, err := s.Join(join, true); err != nil {
			return nil, err
		}
	}
	return s, nil
}

func main() {
	node1, err := start("s", nil, func(config *serf.Config) {
		config.MemberlistConfig.LogOutput = io.Discard
	})
	if err != nil {
		log.Fatalf("node1: %v", err)
	}

	nodes := []*serf.Serf{node1}
	totalNodes := 3
	for i := 0; i < totalNodes-1; i++ {
		addr := fmt.Sprintf("%s:%d", nodes[0].LocalMember().Addr.String(), nodes[0].LocalMember().Port)
		s, err := start(strconv.Itoa(i), []string{addr})
		if err != nil {
			log.Fatalf("node%d: %v", i, err)
		}
		nodes = append(nodes, s)
	}

	// Wait until all nodes see each other
	for _, s := range nodes {
		for s.NumNodes() != totalNodes {
			time.Sleep(10 * time.Millisecond)
		}
	}
	fmt.Println("Cluster established, starting concurrent SetTags.")

	round := 0
	for {
		round++
		var wg sync.WaitGroup
		wg.Add(len(nodes))
		for _, node := range nodes {
			// Capture loop variables
			currentNode := node
			nodeName := currentNode.LocalMember().Name

			go func() {
				defer wg.Done()
				// Each goroutine updates its own node's tags
				err := currentNode.SetTags(map[string]string{"round": strconv.Itoa(round) + strings.Repeat("a", 400)})
				if err != nil {
					// This is where the panic happens
					panic(fmt.Sprintf("node %s: %v", nodeName, err))
				}
				fmt.Printf("Done: %s, Round: %d\n", nodeName, round)
			}()
		}
		wg.Wait()
		fmt.Printf("--- Round %d finished ---\n", round)
	}
}

will panic: node 1: timeout waiting for update broadcast

Root cause analysis

SetTags turns the tag update into a NamedBroadcast held inside a
TransmitLimitedQueue (TLQ).
During retransmission the same broadcast (item1, id = 1) is
taken out → sent → deleted and re-inserted into the queue.
Because the queue is now empty, idGen is reset to 0.
The re-inserted item still keeps its old id = 1.
Very shortly afterwards the node receives the identical update from
a peer. A new TLQ entry (item2) is created and gets the same
id = 1 (idGen restarted).
ReplaceOrInsert treats item1 and item2 as the same key,
silently overwriting the in-flight broadcast without calling
Finished().
The goroutine waiting in SetTags is never unblocked → timeout.

Sequence:

queue item1 (id=1) → retransmit  
          ↓  
delete + reinsert item1  
          ↓  
queue empty → idGen = 0  
          ↓  
new item2 (id=1) arrives → overwrite  
          ↓  
💥 SetTags time-out

The fix

Simply remove the line that resets idGen when the queue becomes
empty

All updated unit & integration tests pass.

…sion

hashicorp-cla-app · 2025-06-13T08:12:12Z

All committers have signed the CLA.

hashicorp-cla-app · 2025-06-13T08:12:12Z

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

_{Have you signed the CLA already but the status is still pending? Recheck it.}

corhere · 2025-06-25T16:40:48Z

Welcome to the club. This is, by my count, the third independent discovery of (and proposed fix for) the same issue.

Prevent idGen reset in TransmitLimitedQueue to fix broadcast ID colli…

8ea3fe8

…sion

ktalg requested a review from a team as a code owner June 13, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix queue corruption in memberlist's TransmitLimitedQueue #324

Fix queue corruption in memberlist's TransmitLimitedQueue #324

Uh oh!

ktalg commented Jun 13, 2025 •

edited

Loading

Uh oh!

hashicorp-cla-app bot commented Jun 13, 2025 •

edited

Loading

Uh oh!

hashicorp-cla-app bot commented Jun 13, 2025

Uh oh!

corhere commented Jun 25, 2025

Uh oh!

Uh oh!

Fix queue corruption in memberlist's TransmitLimitedQueue #324

Are you sure you want to change the base?

Fix queue corruption in memberlist's TransmitLimitedQueue #324

Uh oh!

Conversation

ktalg commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What I ran into

Root cause analysis

The fix

Uh oh!

hashicorp-cla-app bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hashicorp-cla-app bot commented Jun 13, 2025

Uh oh!

corhere commented Jun 25, 2025

Uh oh!

Uh oh!

ktalg commented Jun 13, 2025 •

edited

Loading

hashicorp-cla-app bot commented Jun 13, 2025 •

edited

Loading