-
Notifications
You must be signed in to change notification settings - Fork 5.9k
domain: add retry on pd failure for InfoSyncer (fix #61132) #61236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi @NaturezzZ. Thanks for your PR. I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @NaturezzZ. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/retest |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #61236 +/- ##
================================================
+ Coverage 73.1411% 75.0323% +1.8911%
================================================
Files 1726 1748 +22
Lines 478040 489977 +11937
================================================
+ Hits 349644 367641 +17997
+ Misses 106943 99412 -7531
- Partials 21453 22924 +1471
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
/retest |
/retest |
Hi @lance6716, I removed dependency on the pdutil package. Is it okay now? |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lance6716 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
Hi @D3Hunter, would you mind take a look? Thank you very much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you test it manually, does it work for the case in the issue? i don't see UT in your PR, so I un-checked it in the description
@@ -241,6 +242,11 @@ type infoschemaMinTS interface { | |||
GetAndResetRecentInfoSchemaTS(now uint64) uint64 | |||
} | |||
|
|||
const ( | |||
// InfoSyncerRetryTime is retry time limit for InfoSyncer. | |||
InfoSyncerRetryTime = 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InfoSyncerRetryTime = 120 | |
InfoSyncerRetryTime = 30*time.Second |
2 minutes is too long, i prefer to make it fail faster
@@ -255,7 +261,8 @@ func GlobalInfoSyncerInit( | |||
if pdHTTPCli != nil { | |||
pdHTTPCli = pdHTTPCli. | |||
WithCallerID("tidb-info-syncer"). | |||
WithRespHandler(pdResponseHandler) | |||
WithRespHandler(pdResponseHandler). | |||
WithBackoffer(retry.InitialBackoffer(time.Second, time.Second, InfoSyncerRetryTime*time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WithBackoffer(retry.InitialBackoffer(time.Second, time.Second, InfoSyncerRetryTime*time.Second)) | |
WithBackoffer(retry.InitialBackoffer(100*time.Millisecond, time.Second, InfoSyncerRetryTime*time.Second)) |
Hi @D3Hunter, I have tested manually by injecting a RPC failure between tidb and pd, and confirmed that this issue is fixed by this PR. We did this by instrumenting the RPC framework and differential testing. The code base is a little bit large and maybe not very easy to integrate with tidb testing codes. We plan to open-source our testing tool months later. |
can you update it in the description under |
What problem does this PR solve?
Issue Number: close #61132
Problem Summary:
Drop database fail on pd transient unavaible response.
If pd fail on RPC
runtime.goexit:1223;net/http.(*conn).serve:2092;net/http.serverHandler.ServeHTTP:3210;go.etcd.io/etcd/server/v3/embed.(*accessController).ServeHTTP:458;net/http.(*ServeMux).ServeHTTP:2747;github.com/urfave/negroni.(*Negroni).ServeHTTP:96;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.(*Recovery).ServeHTTP:193;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.HandlerFunc.ServeHTTP:29;github.com/urfave/negroni.(*Negroni).UseHandler.Wrap.func1:46;github.com/gorilla/mux.(*Router).ServeHTTP:210;github.com/gorilla/mux.(*Router).ServeHTTP:210;github.com/urfave/negroni.(*Negroni).ServeHTTP:96;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/pkg/utils/apiutil/serverapi.(*runtimeServiceValidator).ServeHTTP:48;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/pkg/utils/apiutil/serverapi.(*redirector).ServeHTTP:187;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.HandlerFunc.ServeHTTP:29;github.com/tikv/pd/server/api.NewHandler.Wrap.func21:46;github.com/gorilla/mux.(*Router).ServeHTTP:210;net/http.HandlerFunc.ServeHTTP:2220;github.com/tikv/pd/server/api.clusterMiddleware.middleware-fm.clusterMiddleware.middleware.func1:101;github.com/urfave/negroni.(*Negroni).ServeHTTP:96;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/server/api.(*requestInfoMiddleware).ServeHTTP:78;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/server/api.(*auditMiddleware).ServeHTTP:152;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/server/api.(*rateLimitMiddleware).ServeHTTP:184;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.HandlerFunc.ServeHTTP:29;github.com/tikv/pd/server/api.(*serviceMiddlewareBuilder).createHandler.WrapFunc.func1:56;github.com/tikv/pd/server/api.(*regionLabelHandler).PatchRegionLabelRules:71;github.com/tikv/pd/pkg/schedule/labeler.(*RegionLabeler).Patch:302;github.com/tikv/pd/pkg/storage/endpoint.RunBatchOpInTxn:94;github.com/tikv/pd/pkg/storage/kv.(*etcdKVBase).RunInTxn:215;github.com/tikv/pd/pkg/storage/kv.(*etcdTxn).commit:287;github.com/tikv/pd/pkg/storage/kv.(*SlowLogTxn).Commit:171;go.etcd.io/etcd/client/v3.(*txn).Commit:145;go.etcd.io/etcd/client/v3.(*retryKVClient).Txn:117;go.etcd.io/etcd/api/v3/etcdserverpb.(*kVClient).Txn:6487
,drop database does not work.
What changed and how does it work?
Add retry policy on pdHTTPCli.
Check List
Tests
This PR has been tested manually by injecting a RPC failure between tidb and pd, and the result shows that this PR has fixed #61132. We did this by instrumenting the RPC framework and differential testing. The code base is a little bit large and maybe not very easy to integrate with existing tidb testing codes. We plan to open-source our testing tool months later.
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.