Skip to content

domain: add retry on pd failure for InfoSyncer (fix #61132) #61236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

NaturezzZ
Copy link
Contributor

@NaturezzZ NaturezzZ commented May 21, 2025

What problem does this PR solve?

Issue Number: close #61132

Problem Summary:

Drop database fail on pd transient unavaible response.

create database test_1;
use test_1;
create table t (a int);
insert into t values (1);
drop database test_1;

If pd fail on RPC runtime.goexit:1223;net/http.(*conn).serve:2092;net/http.serverHandler.ServeHTTP:3210;go.etcd.io/etcd/server/v3/embed.(*accessController).ServeHTTP:458;net/http.(*ServeMux).ServeHTTP:2747;github.com/urfave/negroni.(*Negroni).ServeHTTP:96;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.(*Recovery).ServeHTTP:193;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.HandlerFunc.ServeHTTP:29;github.com/urfave/negroni.(*Negroni).UseHandler.Wrap.func1:46;github.com/gorilla/mux.(*Router).ServeHTTP:210;github.com/gorilla/mux.(*Router).ServeHTTP:210;github.com/urfave/negroni.(*Negroni).ServeHTTP:96;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/pkg/utils/apiutil/serverapi.(*runtimeServiceValidator).ServeHTTP:48;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/pkg/utils/apiutil/serverapi.(*redirector).ServeHTTP:187;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.HandlerFunc.ServeHTTP:29;github.com/tikv/pd/server/api.NewHandler.Wrap.func21:46;github.com/gorilla/mux.(*Router).ServeHTTP:210;net/http.HandlerFunc.ServeHTTP:2220;github.com/tikv/pd/server/api.clusterMiddleware.middleware-fm.clusterMiddleware.middleware.func1:101;github.com/urfave/negroni.(*Negroni).ServeHTTP:96;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/server/api.(*requestInfoMiddleware).ServeHTTP:78;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/server/api.(*auditMiddleware).ServeHTTP:152;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/server/api.(*rateLimitMiddleware).ServeHTTP:184;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.HandlerFunc.ServeHTTP:29;github.com/tikv/pd/server/api.(*serviceMiddlewareBuilder).createHandler.WrapFunc.func1:56;github.com/tikv/pd/server/api.(*regionLabelHandler).PatchRegionLabelRules:71;github.com/tikv/pd/pkg/schedule/labeler.(*RegionLabeler).Patch:302;github.com/tikv/pd/pkg/storage/endpoint.RunBatchOpInTxn:94;github.com/tikv/pd/pkg/storage/kv.(*etcdKVBase).RunInTxn:215;github.com/tikv/pd/pkg/storage/kv.(*etcdTxn).commit:287;github.com/tikv/pd/pkg/storage/kv.(*SlowLogTxn).Commit:171;go.etcd.io/etcd/client/v3.(*txn).Commit:145;go.etcd.io/etcd/client/v3.(*retryKVClient).Txn:117;go.etcd.io/etcd/api/v3/etcdserverpb.(*kVClient).Txn:6487 ,
drop database does not work.

What changed and how does it work?

Add retry policy on pdHTTPCli.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

This PR has been tested manually by injecting a RPC failure between tidb and pd, and the result shows that this PR has fixed #61132. We did this by instrumenting the RPC framework and differential testing. The code base is a little bit large and maybe not very easy to integrate with existing tidb testing codes. We plan to open-source our testing tool months later.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fix the issue that pd transient error breaks tidb dropping database.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels May 21, 2025
Copy link

ti-chi-bot bot commented May 21, 2025

Hi @NaturezzZ. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

tiprow bot commented May 21, 2025

Hi @NaturezzZ. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@lance6716
Copy link
Contributor

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels May 22, 2025
Copy link
Contributor

@lance6716 lance6716 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ti-chi-bot ti-chi-bot bot requested review from JmPotato and okJiang May 22, 2025 14:08
@NaturezzZ
Copy link
Contributor Author

/retest

Copy link

codecov bot commented May 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.0323%. Comparing base (00efb5f) to head (cc41c62).
Report is 33 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #61236        +/-   ##
================================================
+ Coverage   73.1411%   75.0323%   +1.8911%     
================================================
  Files          1726       1748        +22     
  Lines        478040     489977     +11937     
================================================
+ Hits         349644     367641     +17997     
+ Misses       106943      99412      -7531     
- Partials      21453      22924      +1471     
Flag Coverage Δ
integration 48.6955% <0.0000%> (?)
unit 72.4401% <100.0000%> (+0.0601%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.7804% <ø> (ø)
parser ∅ <ø> (∅)
br 61.8304% <ø> (+14.2825%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 22, 2025
@NaturezzZ
Copy link
Contributor Author

/retest

@NaturezzZ
Copy link
Contributor Author

/retest

@NaturezzZ
Copy link
Contributor Author

Hi @lance6716, I removed dependency on the pdutil package. Is it okay now?

Copy link

ti-chi-bot bot commented May 24, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lance6716
Once this PR has been reviewed and has the lgtm label, please assign d3hunter for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

ti-chi-bot bot commented May 24, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-05-24 08:02:18.545892495 +0000 UTC m=+83268.917679956: ☑️ agreed by lance6716.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 24, 2025
@NaturezzZ
Copy link
Contributor Author

Hi @D3Hunter, would you mind take a look? Thank you very much!

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you test it manually, does it work for the case in the issue? i don't see UT in your PR, so I un-checked it in the description

@@ -241,6 +242,11 @@ type infoschemaMinTS interface {
GetAndResetRecentInfoSchemaTS(now uint64) uint64
}

const (
// InfoSyncerRetryTime is retry time limit for InfoSyncer.
InfoSyncerRetryTime = 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
InfoSyncerRetryTime = 120
InfoSyncerRetryTime = 30*time.Second

2 minutes is too long, i prefer to make it fail faster

@@ -255,7 +261,8 @@ func GlobalInfoSyncerInit(
if pdHTTPCli != nil {
pdHTTPCli = pdHTTPCli.
WithCallerID("tidb-info-syncer").
WithRespHandler(pdResponseHandler)
WithRespHandler(pdResponseHandler).
WithBackoffer(retry.InitialBackoffer(time.Second, time.Second, InfoSyncerRetryTime*time.Second))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WithBackoffer(retry.InitialBackoffer(time.Second, time.Second, InfoSyncerRetryTime*time.Second))
WithBackoffer(retry.InitialBackoffer(100*time.Millisecond, time.Second, InfoSyncerRetryTime*time.Second))

@NaturezzZ
Copy link
Contributor Author

Hi @D3Hunter, I have tested manually by injecting a RPC failure between tidb and pd, and confirmed that this issue is fixed by this PR. We did this by instrumenting the RPC framework and differential testing. The code base is a little bit large and maybe not very easy to integrate with tidb testing codes. We plan to open-source our testing tool months later.

@D3Hunter
Copy link
Contributor

Hi @D3Hunter, I have tested manually by injecting a RPC failure between tidb and pd, and confirmed that this issue is fixed by this PR. We did this by instrumenting the RPC framework and differential testing. The code base is a little bit large and maybe not very easy to integrate with tidb testing codes. We plan to open-source our testing tool months later.

can you update it in the description under Manual test (add detailed scripts or steps below), and also make it checked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-1-more-lgtm Indicates a PR needs 1 more LGTM. ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TiDB does not handle transient error raised by PD
3 participants