Skip to content

Transient failure causes end-to-end failure in SQL execution #9326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
NaturezzZ opened this issue May 15, 2025 · 4 comments
Open

Transient failure causes end-to-end failure in SQL execution #9326

NaturezzZ opened this issue May 15, 2025 · 4 comments
Labels
type/bug The issue is confirmed as a bug.

Comments

@NaturezzZ
Copy link

Bug Report

A transient failure (service unavailable) in PD finally causes the failure of end-to-end SQL execution. The failure is not handled in pd nor tidb.
When the following RPC fails with unavailable:

runtime.goexit:1223;net/http.(*conn).serve:2092;net/http.serverHandler.ServeHTTP:3210;go.etcd.io/etcd/server/v3/embed.(*accessController).ServeHTTP:458;net/http.(*ServeMux).ServeHTTP:2747;github.com/urfave/negroni.(*Negroni).ServeHTTP:96;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.(*Recovery).ServeHTTP:193;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.HandlerFunc.ServeHTTP:29;github.com/urfave/negroni.(*Negroni).UseHandler.Wrap.func1:46;github.com/gorilla/mux.(*Router).ServeHTTP:210;github.com/gorilla/mux.(*Router).ServeHTTP:210;github.com/urfave/negroni.(*Negroni).ServeHTTP:96;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/pkg/utils/apiutil/serverapi.(*runtimeServiceValidator).ServeHTTP:48;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/pkg/utils/apiutil/serverapi.(*redirector).ServeHTTP:187;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.HandlerFunc.ServeHTTP:29;github.com/tikv/pd/server/api.NewHandler.Wrap.func21:46;github.com/gorilla/mux.(*Router).ServeHTTP:210;net/http.HandlerFunc.ServeHTTP:2220;github.com/tikv/pd/server/api.clusterMiddleware.middleware-fm.clusterMiddleware.middleware.func1:101;github.com/urfave/negroni.(*Negroni).ServeHTTP:96;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/server/api.(*requestInfoMiddleware).ServeHTTP:78;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/server/api.(*auditMiddleware).ServeHTTP:152;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/tikv/pd/server/api.(*rateLimitMiddleware).ServeHTTP:184;github.com/urfave/negroni.middleware.ServeHTTP:38;github.com/urfave/negroni.HandlerFunc.ServeHTTP:29;github.com/tikv/pd/server/api.(*serviceMiddlewareBuilder).createHandler.WrapFunc.func1:56;github.com/tikv/pd/server/api.(*regionLabelHandler).PatchRegionLabelRules:71;github.com/tikv/pd/pkg/schedule/labeler.(*RegionLabeler).Patch:302;github.com/tikv/pd/pkg/storage/endpoint.RunBatchOpInTxn:94;github.com/tikv/pd/pkg/storage/kv.(*etcdKVBase).RunInTxn:215;github.com/tikv/pd/pkg/storage/kv.(*etcdTxn).commit:287;github.com/tikv/pd/pkg/storage/kv.(*SlowLogTxn).Commit:171;go.etcd.io/etcd/client/v3.(*txn).Commit:145;go.etcd.io/etcd/client/v3.(*retryKVClient).Txn:117;go.etcd.io/etcd/api/v3/etcdserverpb.(*kVClient).Txn:6487

And try to delete database

mysql> drop database access_path_selection;
ERROR 8243 (HY000): "rpc error: code = Unavailable desc = transient error"

The RPC from tidb to pd is as below

[2025/05/15 06:43:36.622 +00:00] [INFO] [audit.go:128] ["audit log"] [service-info="{ServiceLabel:PatchRegionLabelRules, Method:HTTP/1.1/PATCH:/pd/api/v1/config/region-label/rules, CallerID:tidb-info-syncer, IP:10.244.1.79, Port:60058, StartTime:2025-05-15 06:43:36 +0000 UTC, URLParam:{}, BodyParam:{\"sets\":[],\"deletes\":[\"schema/access_path_selection/access_path_selection\"]}}"]

What did you do?

Inject a transient error between pd and tikv(?) and try to delete the database.

What did you expect to see?

PD handles this bug with a retry.

What did you see instead?

PD raises this error to tidb and tidb does not handle it as well.
The log of tidb is shown below:

[2025/05/15 06:43:36.470 +00:00] [INFO] [session.go:4111] ["CRUCIAL OPERATION"] [conn=3078628316] [schemaVersion=23107] [cur_db=] [sql="drop database access_path_selection"] [user=root@%]
[2025/05/15 06:43:36.552 +00:00] [INFO] [executor.go:6613] ["DDL job submitted"] [category=ddl] [job="ID:22826, Type:drop schema, State:queueing, SchemaState:public, SchemaID:22819, TableID:0, RowCount:0, ArgLen:1, start time: 2025-05-15 06:43:36.444 +0000 UTC, Err:<nil>, ErrCount:0, SnapshotVersion:0, Version: v2"] [query="drop database access_path_selection"]
[2025/05/15 06:43:36.570 +00:00] [INFO] [job_submitter.go:139] ["add DDL jobs"] [category=ddl] ["batch count"=1] [jobs="ID:22826, Type:drop schema, State:queueing, SchemaState:public, SchemaID:22819, TableID:0, RowCount:0, ArgLen:1, start time: 2025-05-15 06:43:36.444 +0000 UTC, Err:<nil>, ErrCount:0, SnapshotVersion:0, Version: v2; "] [table=true] [fast_create=true]
[2025/05/15 06:43:36.578 +00:00] [INFO] [job_worker.go:784] ["run DDL job"] [category=ddl] [jobID=22826] [conn=3078628316] [job="ID:22826, Type:drop schema, State:queueing, SchemaState:public, SchemaID:22819, TableID:0, RowCount:0, ArgLen:0, start time: 2025-05-15 06:43:36.444 +0000 UTC, Err:<nil>, ErrCount:0, SnapshotVersion:0, Version: v2"]
[2025/05/15 06:43:36.656 +00:00] [WARN] [info.go:523] ["response not 200"] [method=PATCH] [host=basic-pd-0.basic-pd-peer.default.svc:2379] [url=/pd/api/v1/config/region-label/rules] ["http status"=500]
[2025/05/15 06:43:36.659 +00:00] [INFO] [job_worker.go:717] ["DDL job is cancelled normally"] [category=ddl] [jobID=22826] [conn=3078628316] [error="[domain:8243]\"rpc error: code = Unavailable desc = outbound unavailable injected by RPCTest\"\n"]
[2025/05/15 06:43:36.661 +00:00] [INFO] [job_worker.go:390] ["finish DDL job"] [category=ddl] [jobID=22826] [conn=3078628316] [job="ID:22826, Type:drop schema, State:cancelled, SchemaState:public, SchemaID:22819, TableID:0, RowCount:0, ArgLen:1, start time: 2025-05-15 06:43:36.444 +0000 UTC, Err:[domain:8243]\"rpc error: code = Unavailable desc = outbound unavailable injected by RPCTest\"\n, ErrCount:1, SnapshotVersion:0, Version: v2"]
[2025/05/15 06:43:36.697 +00:00] [INFO] [executor.go:6738] ["DDL job is failed"] [category=ddl] [jobID=22826]
[2025/05/15 06:43:36.698 +00:00] [INFO] [tidb.go:278] ["rollbackTxn called due to ddl/autocommit failure"]
[2025/05/15 06:43:36.699 +00:00] [WARN] [session.go:2162] ["run statement failed"] [conn=3078628316] [session_alias=] [schemaVersion=23107] [error="[domain:8243]\"rpc error: code = Unavailable desc = outbound unavailable injected by RPCTest\"\n"] [session="{\n  \"currDBName\": \"\",\n  \"id\": 3078628316,\n  \"status\": 2,\n  \"strictMode\": true,\n  \"user\": {\n    \"Username\": \"root\",\n    \"Hostname\": \"127.0.0.1\",\n    \"CurrentUser\": false,\n    \"AuthUsername\": \"root\",\n    \"AuthHostname\": \"%\",\n    \"AuthPlugin\": \"mysql_native_password\"\n  }\n}"]
[2025/05/15 06:43:36.704 +00:00] [INFO] [conn.go:1184] ["command dispatched failed"] [conn=3078628316] [session_alias=] [connInfo="id:3078628316, addr:127.0.0.1:59450 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="drop database access_path_selection"] [txn_mode=PESSIMISTIC] [timestamp=458041961072295938] [err="[domain:8243]\"rpc error: code = Unavailable desc = outbound unavailable injected by RPCTest\"\n\ngithub.com/pingcap/errors.AddStack\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/errors.go:178\ngithub.com/pingcap/errors.Trace\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tidb/pkg/ddl.(*executor).DoDDLJobWrapper\n\t/workspace/source/tidb/pkg/ddl/executor.go:6739\ngithub.com/pingcap/tidb/pkg/ddl.(*executor).doDDLJob2\n\t/workspace/source/tidb/pkg/ddl/executor.go:6555\ngithub.com/pingcap/tidb/pkg/ddl.(*executor).DropSchema\n\t/workspace/source/tidb/pkg/ddl/executor.go:772\ngithub.com/pingcap/tidb/pkg/executor.(*DDLExec).executeDropDatabase\n\t/workspace/source/tidb/pkg/executor/ddl.go:338\ngithub.com/pingcap/tidb/pkg/executor.(*DDLExec).Next\n\t/workspace/source/tidb/pkg/executor/ddl.go:169\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/workspace/source/tidb/pkg/executor/internal/exec/executor.go:456\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).next\n\t/workspace/source/tidb/pkg/executor/adapter.go:1266\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).handleNoDelayExecutor\n\t/workspace/source/tidb/pkg/executor/adapter.go:1015\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).handleNoDelay\n\t/workspace/source/tidb/pkg/executor/adapter.go:848\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).Exec\n\t/workspace/source/tidb/pkg/executor/adapter.go:611\ngithub.com/pingcap/tidb/pkg/session.runStmt\n\t/workspace/source/tidb/pkg/session/session.go:2288\ngithub.com/pingcap/tidb/pkg/session.(*session).ExecuteStmt\n\t/workspace/source/tidb/pkg/session/session.go:2150\ngithub.com/pingcap/tidb/pkg/server.(*TiDBContext).ExecuteStmt\n\t/workspace/source/tidb/pkg/server/driver_tidb.go:291\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleStmt\n\t/workspace/source/tidb/pkg/server/conn.go:2026\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleQuery\n\t/workspace/source/tidb/pkg/server/conn.go:1779\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).dispatch\n\t/workspace/source/tidb/pkg/server/conn.go:1378\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).Run\n\t/workspace/source/tidb/pkg/server/conn.go:1147\ngithub.com/pingcap/tidb/pkg/server.(*Server).onConn\n\t/workspace/source/tidb/pkg/server/server.go:741\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1223"]

What version of PD are you using (pd-server -V)?

v8.5.0

@lance6716
Copy link
Contributor

ptal @okJiang

@qts0312
Copy link

qts0312 commented May 16, 2025

This bug can be reproduced in this way

create database test_1;
use test_1;
create table t (a int);
insert into t values (1);
drop database test_1;

Last SQL will fail for transient failure on the RPC mentioned above

@okJiang
Copy link
Member

okJiang commented May 19, 2025

@NaturezzZ
Copy link
Author

Fixed by pingcap/tidb#61236

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

4 participants