Why does the Kafka service become unresponsive and cause `read tcp <host>-><host>: i/o timeout`?

Sarama Version: v1.43.3
Kafka Version: v2.7.2
Go Version: v1.21

Hello everyone, I'm encountering an issue where, during the process of pod rolling restarts (i.e., rebalancing), there's a probability of encountering the error `read tcp xxx->xxx:9092: i/o timeout` in the `sarama.ConsumerGroup.Consume` function.

I'm aware that it's similar to [#1192](https://github.com/IBM/sarama/issues/1192), but neither in that issue nor after searching online, has there been a concrete explanation of the cause. I'd like to understand the reason. Because there are a few peculiarities about my issue:

1. It only occurs in the production environment and cannot be reproduced in the development environment (also because the cause is unknown, it's difficult to reproduce it in a targeted manner).
2. Other services that use the same logic for initialization and consumption do not encounter this error. Moreover, this service binds about 30 topics to the same consumer group and has 8 pods.
3. This service previously had no issues, but suddenly encountered this problem during a certain release and now encounters this error with a very high probability.

Information that can be provided:

1. There are 3 pods, and these 3 pods have been running stably for at least 5 minutes. Then, a new pod is started, and the error begins to occur.
2. Each pod has a group named `aaa` that consumes 6 topics, and a corresponding client is created for each topic.
    That is, `sarama.NewConsumerGroup()` is called for each topic.
3. There are also about 10 other clients consuming 10 different topics with 10 different group names.
4. During startup, multiple consumer groups (`sarama.NewConsumerGroup()`) within a single pod are initialized asynchronously and basically start at the same point in time.
5. In the development environment, increasing the number of pods to 10 and restarting the service multiple times did not result in the error.
6. Each topic has 12 partitions.
7. During the error period, the Kafka server's CPU and memory usage were relatively low, and the network was stable.

Organized relevant logs:

```text
Apr 14, 2025 @ 14:48:08.361 {"saramaLog":"Connected to broker at xxx:9092 (unregistered)\n"}  
Apr 14, 2025 @ 14:48:08.369 {"saramaLog":"client/brokers registered new broker #1 at xxx:9092"}  
Apr 14, 2025 @ 14:48:09.210 {"saramaLog":"client/metadata fetching metadata for all topics from broker xxx:9092\n"}  
Apr 14, 2025 @ 14:48:09.229 {"saramaLog":"client/metadata fetching metadata for [topic] from broker xxx2:9092\n"}  
Apr 14, 2025 @ 14:48:09.239 {"saramaLog":"Connected to broker at xxx:9092 (registered as #1)\n"}
Apr 14, 2025 @ 14:48:39.244 {"saramaLog":"Closed connection to broker xxx:9092\n"}  
Apr 14, 2025 @ 14:48:39.345 ERROR read tcp xxx:46394->xxx:9092: i/o timeout

This is the server-side log:

[2025-04-14 14:48:09,243] INFO [GroupCoordinator 1]: Preparing to rebalance group aaa in state PreparingRebalance with old generation 816 (__consumer_offsets-36) (reason: Adding new member aaa-1ed78c04-d326-450d-906f-4d1c0c57a567 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
```

The groups reporting errors are all `aaa`, and no errors are reported by other consumer groups. The Sarama Config configuration is as follows, with other configurations set to their default values:

```go
config := sarama.NewConfig()  
config.Consumer.Return.Errors = true  
config.Version = sarama.V2_5_0_0  
config.Consumer.Group.Rebalance.Strategy = sarama.BalanceStrategyRange  
config.Consumer.Offsets.Initial = sarama.OffsetOldest  
config.Metadata.Timeout = time.Second * 3
```

From the logs, it can be seen that the connection is closed after 30 seconds. This is likely still during the `joinGroup` phase, and the disconnection likely occurred because the Kafka server did not respond, and Sarama actively closed the connection after the `config.Net.ReadTimeout` expired.

**But why does the rebalance take more than 30 seconds without a response?**

What I know is that during the `joinGroup` phase, the Kafka server waits for a certain period. If this period exceeds the timeout or all members have joined, it will immediately return the join request and proceed to the `syncGroup` phase.
This waiting time is configured on the server side as [`group.max.session.timeout.ms`](https://kafka.apache.org/27/documentation.html#brokerconfigs_group.initial.rebalance.delay.ms), with a default value of 3 seconds. My production Kafka also uses 3 seconds.

Thank you for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does the Kafka service become unresponsive and cause `read tcp <host>-><host>: i/o timeout`? #3149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why does the Kafka service become unresponsive and cause read tcp <host>-><host>: i/o timeout? #3149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why does the Kafka service become unresponsive and cause `read tcp <host>-><host>: i/o timeout`? #3149