Skip to content

[Bug] Inconsistent markDeletePosition replication for geo-replicated shared subscriptions with delayed messages #24380

Open
@tarmacmonsterg

Description

@tarmacmonsterg

Search before reporting

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

User environment

Pulsar: 4.0.4 official docker image
Deployed on K8S

Issue Description

We have several topics in our Pulsar deployment. For some topics (cache-related), we have geo-replication disabled. Others work as expected — the subscription cursor is replicated to the backup cluster.
However, we are seeing inconsistent behavior with topics used for delayed messages and shared subscriptions. These topics have geo-replication enabled and use individual acknowledgments. According to the documentation, individual acknowledgments themselves are not replicated across clusters. However, the markDeletePosition should be replicated.
In our tests, we noticed that the markDeletePosition in the backup cluster does not move predictably. In some cases, it remains unchanged for a long time. The only time it eventually advances is after the primary cluster stops receiving new messages to that topic — and then, after a delay, the markDeletePosition is finally updated in the backup cluster.

First check
stats-internal main cluster

"delayed_message_10_min" : {
      "markDeletePosition" : "2382448:33718",

backup cluster

    "delayed_message_10_min" : {
      "markDeletePosition" : "50797:509",

Second check
main cluster

    "delayed_message_10_min" : {
      "markDeletePosition" : "2382448:40268",

backup cluster

    "delayed_message_10_min" : {
      "markDeletePosition" : "50797:509",

third check
main cluster

    "delayed_message_10_min" : {
      "markDeletePosition" : "2382722:21942",

backup cluster

    "delayed_message_10_min" : {
      "markDeletePosition" : "50797:509",

and check after stop load tests and empty backlog in main cluster
main

    "delayed_message_10_min" : {
      "markDeletePosition" : "2382761:11155",

backup

    "delayed_message_10_min" : {
      "markDeletePosition" : "54807:11139",

And i see one difference. In main clusters disappear individuallyDeletedMessages after stooping load test.

Error messages


Reproducing the issue

1.	Deploy two Pulsar clusters.
2.	Create the relevant topics.
3.	Configure geo-replication between the clusters.
4.	Enable subscription replication on the client.
5.	Start continuously producing delayed messages to the topic, with delivery delays of up to 10 minutes.
6.	On the primary cluster, consume messages selectively (based on delivery time).

Expected behavior:
The markDeletePosition should advance on both the primary and the backup clusters.

Actual behavior:
The markDeletePosition advances only on the primary cluster.
On the backup cluster, a backlog accumulates and markDeletePosition remains stuck for a long time.

Additional information

Disscussion started here: https://apache-pulsar.slack.com/archives/C5Z4T36F7/p1748598297819549

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugThe PR fixed a bug or issue reported a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions