Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lighting failed when kill pd leader #9142

Open
Lily2025 opened this issue Mar 13, 2025 · 2 comments
Open

Lighting failed when kill pd leader #9142

Lily2025 opened this issue Mar 13, 2025 · 2 comments

Comments

@Lily2025
Copy link

Bug Report

What did you do?

1、Lighting import
2、kill pd leader

What did you expect to see?

Lighting can succeed

What did you see instead?

Lighting failed
[2025/03/09 23:34:46.447 +00:00] [ERROR] [service_discovery.go:581] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379,http://tc-pd-1.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379,http://tc-pd-2.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.233.76.189:2379: connect: connection refused\" target:tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.233.76.189:2379: connect: connection refused\" target:tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379 status:TRANSIENT_FAILURE"] [stack="github.com/tikv/pd/client/servicediscovery.(*serviceDiscovery).updateServiceModeLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/[email protected]/servicediscovery/service_discovery.go:581"] [2025/03/09 23:34:46.447 +00:00] [INFO] [service_discovery.go:889] ["[pd] cannot update member from this url"] [url=http://tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.233.76.189:2379: connect: connection refused\" target:tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.233.76.189:2379: connect: connection refused\" target:tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379 status:TRANSIENT_FAILURE"] [2025/03/09 23:34:46.448 +00:00] [INFO] [client.go:210] ["[tso] switch the tso leader serving url"] [new-url=http://tc-pd-2.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379] [2025/03/09 23:34:46.448 +00:00] [INFO] [service_discovery.go:986] ["[pd] switch leader"] [new-leader=http://tc-pd-2.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379] [old-leader=http://tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379] [2025/03/09 23:34:46.448 +00:00] [INFO] [service_discovery.go:889] ["[pd] cannot update member from this url"] [url=http://tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.233.76.189:2379: connect: connection refused\" target:tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.233.76.189:2379: connect: connection refused\" target:tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379 status:TRANSIENT_FAILURE"] [2025/03/09 23:36:37.833 +00:00] [INFO] [import.go:1134] [progress] [total=65.7%] [tables="11/12 (91.7%)"] [chunks="319/578 (55.2%)"] [engines="22/24 (91.7%)"] [restore-bytes=15.92GiB/39.97GiB] [restore-rows=257940402/647796346(estimated)] [import-bytes=63.63GiB/159.8GiB(estimated)] ["encode speed(MiB/s)"=18.109545168332534] [state=writing] [remaining=7m49s] [2025/03/09 23:37:38.992 +00:00] [INFO] [table_import.go:1410] ["analyze completed"] [table=location.MonsterSource] [takeTime=4m45.311751634s] [] [2025/03/09 23:37:38.993 +00:00] [ERROR] [import.go:1413] ["restore all tables data failed"] [takeTime=16m1.172135185s] [error="[Lightning:Restore:ErrRestoreTable]restore table location.GooglePoiSourcefailed: rpc error: code = Unavailable desc = error reading from server: EOF"] [2025/03/09 23:37:38.993 +00:00] [INFO] [import.go:1008] ["everything imported, stopping periodic actions"] [2025/03/09 23:37:38.995 +00:00] [ERROR] [import.go:577] ["run failed"] [step=4] [error="[Lightning:Restore:ErrRestoreTable]restore tablelocation.GooglePoiSourcefailed: rpc error: code = Unavailable desc = error reading from server: EOF"] [2025/03/09 23:37:38.995 +00:00] [ERROR] [import.go:587] ["the whole procedure failed"] [takeTime=16m4.489249975s] [error="[Lightning:Restore:ErrRestoreTable]restore tablelocation.GooglePoiSource failed: rpc error: code = Unavailable desc = error reading from server: EOF"] [2025/03/09 23:37:39.027 +00:00] [INFO] [service_discovery.go:544] ["[pd] exit member loop due to context canceled"]

What version of PD are you using (pd-server -V)?

./pd-server -V
Release Version: v9.0.0-alpha-70-g5e82f16
Edition: Community
Git Commit Hash: 5e82f16
Git Branch: HEAD
UTC Build Time: 2025-03-06 10:18:40
2025-03-10T07:21:32.242+0800

@Lily2025
Copy link
Author

/type bug
/severity major
/assign okJiang

@okJiang
Copy link
Member

okJiang commented Mar 17, 2025

The root cause of this error is on 2025/03/09 23:22:33.979.

[2025/03/09 23:22:33.992 +00:00] [ERROR] [import.go:1452] ["failed to import table"] [table=location.GooglePoiSource] [error="rpc error: code = Unavailable desc = error reading from server: EOF"] [errorVerbose="rpc error: code = Unavailable desc = error reading from server: EOF\ngithub.com/tikv/pd/client/clients/tso.(*tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/[email protected]/clients/tso/stream.go:427\nruntime.goexit\n\t/root/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1700\ngithub.com/tikv/pd/client/clients/tso.(*Request).waitCtx\n\t/root/go/pkg/mod/github.com/tikv/pd/[email protected]/clients/tso/request.go:86\ngithub.com/tikv/pd/client/clients/tso.(*Request).Wait\n\t/root/go/pkg/mod/github.com/tikv/pd/[email protected]/clients/tso/request.go:73\ngithub.com/tikv/pd/client.(*client).GetTS\n\t/root/go/pkg/mod/github.com/tikv/pd/[email protected]/client.go:515\ngithub.com/pingcap/tidb/pkg/lightning/backend/local.(*Backend).GetTS\n\t/workspace/source/tidb/pkg/lightning/backend/local/local.go:1786\ngithub.com/pingcap/tidb/pkg/lightning/backend/local.(*engineManager).allocateTSIfNotExists\n\t/workspace/source/tidb/pkg/lightning/backend/local/engine_mgr.go:459\ngithub.com/pingcap/tidb/pkg/lightning/backend/local.(*engineManager).openEngine\n\t/workspace/source/tidb/pkg/lightning/backend/local/engine_mgr.go:280\ngithub.com/pingcap/tidb/pkg/lightning/backend/local.(*Backend).OpenEngine\n\t/workspace/source/tidb/pkg/lightning/backend/local/local.go:817\ngithub.com/pingcap/tidb/pkg/lightning/backend.EngineManager.OpenEngine\n\t/workspace/source/tidb/pkg/lightning/backend/backend.go:266\ngithub.com/pingcap/tidb/lightning/pkg/importer.(*TableImporter).importEngines\n\t/workspace/source/tidb/lightning/pkg/importer/table_import.go:476\ngithub.com/pingcap/tidb/lightning/pkg/importer.(*TableImporter).importTable\n\t/workspace/source/tidb/lightning/pkg/importer/table_import.go:265\ngithub.com/pingcap/tidb/lightning/pkg/importer.(*Controller).importTables.func6\n\t/workspace/source/tidb/lightning/pkg/importer/import.go:1450\nruntime.goexit\n\t/root/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1700"]

When Lightning call OpenEngine and GetTS(), the PD leader is down, and returns rpc error: code = Unavailable desc = error reading from server: EOF

Refer to PD's log, we can see the new leader was elected at 22:45

[2025/03/10 07:22:45.138 +08:00] [INFO] [server.go:1618] ["start to watch pd leader"] [pd-leader="name:"tc-pd-0" member_id:2809911138913259268 peer_urls:"http://tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2380" client_urls:"http://tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7783261-1-181.svc:2379" "]

The PD client seems to not attempt a retry when handling a request that encounters an error, but instead directly returns an error. It will analyze the error afterward to determine if there has been a leader change, and then switch the stream to the new leader. During this phase, the PD client may be unable to provide any service and continues to report errors.

If this is the case, perhaps we need to offer an option for the client to keep retrying during the aforementioned phase rather than directly returning an error.

Considering this analysis, perhaps this PR does not address the current issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants