Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stale region cache with no leader #445

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yongman
Copy link
Contributor

@yongman yongman commented Mar 11, 2024

When we create client during tikv-server startup and the region has no leader been elected yet, the region cache in client may be stale with no leader.

It will cause the region access return no leader error until the region id_ver changed.

@yongman yongman force-pushed the ym/fix-region-cache branch from 21cbeca to 04c77d2 Compare March 11, 2024 06:23
Signed-off-by: yongman <[email protected]>
@yongman yongman force-pushed the ym/fix-region-cache branch from 04c77d2 to 45b5fae Compare March 11, 2024 06:46
@pingyu
Copy link
Collaborator

pingyu commented Mar 11, 2024

It seems that if there is still no leader when read through PD server, we would all the same get the no leader error.

How about try to handle this situation uniformly by handle_region_error ? Then this error can be retried, as well as backoff to avoid cause too much press to PD servers.

(It's likely that some related codes need to be changed too as this error is raised at apply_shard. Maybe we can try to pass the region_store to single_shard_handler and handle the condition of no leader there.)

@yongman
Copy link
Contributor Author

yongman commented Mar 12, 2024

@pingyu Thanks for your advise. It's not enough just handling the NotLeader error in single_shard_handler. In Shardable::shards, store_stream_for_keys, store_stream_for_range, store_stream_for_ranges and resolve_locks will also raise this error.

This seems to require lots of modifications, which could take a lot of time and introduce more risks. Moreover, the logic of the application should have the ability to retry and backoff during handling this error, so just refresh the region cache seems reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants