Skip to content

Why does the Kafka service become unresponsive and cause read tcp <host>-><host>: i/o timeout? #3149

@juwell

Description

@juwell

Sarama Version: v1.43.3
Kafka Version: v2.7.2
Go Version: v1.21

Hello everyone, I'm encountering an issue where, during the process of pod rolling restarts (i.e., rebalancing), there's a probability of encountering the error read tcp xxx->xxx:9092: i/o timeout in the sarama.ConsumerGroup.Consume function.

I'm aware that it's similar to #1192, but neither in that issue nor after searching online, has there been a concrete explanation of the cause. I'd like to understand the reason. Because there are a few peculiarities about my issue:

  1. It only occurs in the production environment and cannot be reproduced in the development environment (also because the cause is unknown, it's difficult to reproduce it in a targeted manner).
  2. Other services that use the same logic for initialization and consumption do not encounter this error. Moreover, this service binds about 30 topics to the same consumer group and has 8 pods.
  3. This service previously had no issues, but suddenly encountered this problem during a certain release and now encounters this error with a very high probability.

Information that can be provided:

  1. There are 3 pods, and these 3 pods have been running stably for at least 5 minutes. Then, a new pod is started, and the error begins to occur.
  2. Each pod has a group named aaa that consumes 6 topics, and a corresponding client is created for each topic.
    That is, sarama.NewConsumerGroup() is called for each topic.
  3. There are also about 10 other clients consuming 10 different topics with 10 different group names.
  4. During startup, multiple consumer groups (sarama.NewConsumerGroup()) within a single pod are initialized asynchronously and basically start at the same point in time.
  5. In the development environment, increasing the number of pods to 10 and restarting the service multiple times did not result in the error.
  6. Each topic has 12 partitions.
  7. During the error period, the Kafka server's CPU and memory usage were relatively low, and the network was stable.

Organized relevant logs:

Apr 14, 2025 @ 14:48:08.361 {"saramaLog":"Connected to broker at xxx:9092 (unregistered)\n"}  
Apr 14, 2025 @ 14:48:08.369 {"saramaLog":"client/brokers registered new broker #1 at xxx:9092"}  
Apr 14, 2025 @ 14:48:09.210 {"saramaLog":"client/metadata fetching metadata for all topics from broker xxx:9092\n"}  
Apr 14, 2025 @ 14:48:09.229 {"saramaLog":"client/metadata fetching metadata for [topic] from broker xxx2:9092\n"}  
Apr 14, 2025 @ 14:48:09.239 {"saramaLog":"Connected to broker at xxx:9092 (registered as #1)\n"}
Apr 14, 2025 @ 14:48:39.244 {"saramaLog":"Closed connection to broker xxx:9092\n"}  
Apr 14, 2025 @ 14:48:39.345 ERROR read tcp xxx:46394->xxx:9092: i/o timeout

This is the server-side log:

[2025-04-14 14:48:09,243] INFO [GroupCoordinator 1]: Preparing to rebalance group aaa in state PreparingRebalance with old generation 816 (__consumer_offsets-36) (reason: Adding new member aaa-1ed78c04-d326-450d-906f-4d1c0c57a567 with group instance id None) (kafka.coordinator.group.GroupCoordinator)

The groups reporting errors are all aaa, and no errors are reported by other consumer groups. The Sarama Config configuration is as follows, with other configurations set to their default values:

config := sarama.NewConfig()  
config.Consumer.Return.Errors = true  
config.Version = sarama.V2_5_0_0  
config.Consumer.Group.Rebalance.Strategy = sarama.BalanceStrategyRange  
config.Consumer.Offsets.Initial = sarama.OffsetOldest  
config.Metadata.Timeout = time.Second * 3

From the logs, it can be seen that the connection is closed after 30 seconds. This is likely still during the joinGroup phase, and the disconnection likely occurred because the Kafka server did not respond, and Sarama actively closed the connection after the config.Net.ReadTimeout expired.

But why does the rebalance take more than 30 seconds without a response?

What I know is that during the joinGroup phase, the Kafka server waits for a certain period. If this period exceeds the timeout or all members have joined, it will immediately return the join request and proceed to the syncGroup phase.
This waiting time is configured on the server side as group.max.session.timeout.ms, with a default value of 3 seconds. My production Kafka also uses 3 seconds.

Thank you for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleIssues and pull requests without any recent activity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions