Skip to content

Fix queue corruption in memberlist's TransmitLimitedQueue #324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ktalg
Copy link

@ktalg ktalg commented Jun 13, 2025

What I ran into

While running the following integration test (Go 1.21+) I hit a 100 % reproducible timeout waiting for update broadcast whenever several nodes call SetTags concurrently:

package main

import (
	"fmt"
	"github.com/hashicorp/serf/serf"
	"io"
	"log"
	"strconv"
	"strings"
	"sync"
	"time"
)

// start launches a Serf node and returns the instance
func start(name string, join []string, c ...func(config *serf.Config)) (*serf.Serf, error) {
	cfg := serf.DefaultConfig()
	cfg.Init()

	cfg.NodeName = name
	cfg.MemberlistConfig.BindAddr = "127.0.0.1"
	cfg.MemberlistConfig.BindPort = 0
	// Aggressive settings to trigger the issue faster
	cfg.MemberlistConfig.RetransmitMult = 2
	cfg.MemberlistConfig.GossipInterval = time.Millisecond * 50
	cfg.BroadcastTimeout = 5 * time.Second
	cfg.EventBuffer = 500
	cfg.LogOutput = io.Discard

	for _, c := range c {
		c(cfg)
	}
	s, err := serf.Create(cfg)
	if err != nil {
		return nil, err
	}

	if len(join) > 0 {
		if _, err := s.Join(join, true); err != nil {
			return nil, err
		}
	}
	return s, nil
}

func main() {
	node1, err := start("s", nil, func(config *serf.Config) {
		config.MemberlistConfig.LogOutput = io.Discard
	})
	if err != nil {
		log.Fatalf("node1: %v", err)
	}

	nodes := []*serf.Serf{node1}
	totalNodes := 3
	for i := 0; i < totalNodes-1; i++ {
		addr := fmt.Sprintf("%s:%d", nodes[0].LocalMember().Addr.String(), nodes[0].LocalMember().Port)
		s, err := start(strconv.Itoa(i), []string{addr})
		if err != nil {
			log.Fatalf("node%d: %v", i, err)
		}
		nodes = append(nodes, s)
	}

	// Wait until all nodes see each other
	for _, s := range nodes {
		for s.NumNodes() != totalNodes {
			time.Sleep(10 * time.Millisecond)
		}
	}
	fmt.Println("Cluster established, starting concurrent SetTags.")

	round := 0
	for {
		round++
		var wg sync.WaitGroup
		wg.Add(len(nodes))
		for _, node := range nodes {
			// Capture loop variables
			currentNode := node
			nodeName := currentNode.LocalMember().Name

			go func() {
				defer wg.Done()
				// Each goroutine updates its own node's tags
				err := currentNode.SetTags(map[string]string{"round": strconv.Itoa(round) + strings.Repeat("a", 400)})
				if err != nil {
					// This is where the panic happens
					panic(fmt.Sprintf("node %s: %v", nodeName, err))
				}
				fmt.Printf("Done: %s, Round: %d\n", nodeName, round)
			}()
		}
		wg.Wait()
		fmt.Printf("--- Round %d finished ---\n", round)
	}
}

will panic: node 1: timeout waiting for update broadcast

Root cause analysis

  1. SetTags turns the tag update into a NamedBroadcast held inside a
    TransmitLimitedQueue (TLQ).
  2. During retransmission the same broadcast (item1, id = 1) is
    taken out → sent → deleted and re-inserted into the queue.
  3. Because the queue is now empty, idGen is reset to 0.
    The re-inserted item still keeps its old id = 1.
  4. Very shortly afterwards the node receives the identical update from
    a peer. A new TLQ entry (item2) is created and gets the same
    id = 1 (idGen restarted).
  5. ReplaceOrInsert treats item1 and item2 as the same key,
    silently overwriting the in-flight broadcast without calling
    Finished()
    .
    The goroutine waiting in SetTags is never unblocked → timeout.

Sequence:

queue item1 (id=1) → retransmit  
          ↓  
delete + reinsert item1  
          ↓  
queue empty → idGen = 0  
          ↓  
new item2 (id=1) arrives → overwrite  
          ↓  
💥 SetTags time-out

The fix

Simply remove the line that resets idGen when the queue becomes
empty

All updated unit & integration tests pass.

@ktalg ktalg requested a review from a team as a code owner June 13, 2025 08:11
Copy link

hashicorp-cla-app bot commented Jun 13, 2025

CLA assistant check
All committers have signed the CLA.

Copy link

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

Have you signed the CLA already but the status is still pending? Recheck it.

@corhere
Copy link

corhere commented Jun 25, 2025

Welcome to the club. This is, by my count, the third independent discovery of (and proposed fix for) the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants