Skip to content

[segment replication] Add cluster setting for retry timeout of publish checkpoint tx action #17749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

guojialiang92
Copy link
Contributor

@guojialiang92 guojialiang92 commented Apr 1, 2025

Description

Added a test. In the current situation, if the primary shard publish checkpoint fails, it will cause the replica shard and the primary shard to fail to synchronize.
TransportReplicationAction support specifying retryTimeout.
PublishCheckpointAction use the never give up retry strategy.

Related Issues

Resolves 17595

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Apr 1, 2025

❌ Gradle check result for 1edc0ca: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…eckpointAction use the never give up strategy.

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/PublishCheckpointAction_use_never_give_up_retry_strategy branch from 1edc0ca to e49aa81 Compare April 1, 2025 11:09
Copy link
Contributor

github-actions bot commented Apr 1, 2025

✅ Gradle check result for e49aa81: SUCCESS

…Action_use_never_give_up_retry_strategy

# Conflicts:
#	CHANGELOG.md
Copy link
Contributor

❕ Gradle check result for d333d0a: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

❌ Gradle check result for 2726e01: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@guojialiang92 guojialiang92 force-pushed the dev/PublishCheckpointAction_use_never_give_up_retry_strategy branch from 2726e01 to b744f4b Compare April 14, 2025 15:16
Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

❌ Gradle check result for b744f4b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/PublishCheckpointAction_use_never_give_up_retry_strategy branch from b744f4b to 68a5e9d Compare April 14, 2025 16:16
Copy link
Contributor

❌ Gradle check result for 68a5e9d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…Action_use_never_give_up_retry_strategy

# Conflicts:
#	CHANGELOG.md
Copy link
Contributor

❌ Gradle check result for 3eb976e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ashking94
Copy link
Member

❌ Gradle check result for 3eb976e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Restarted the pr build.

Copy link
Contributor

❌ Gradle check result for 3eb976e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for a3a23a7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
Copy link
Contributor

✅ Gradle check result for e7b926a: SUCCESS

@ashking94 ashking94 merged commit c44d230 into opensearch-project:main Apr 15, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] segment replication stops when publish checkpoint fails
2 participants