Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: start services after first heartbeat response processed #2424

Merged

Conversation

WenyXu
Copy link
Member

@WenyXu WenyXu commented Sep 17, 2023

I hereby agree to the terms of the GreptimeDB CLA

What's changed and what's your intention?

Start services after the first heartbeat response is processed

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.

Refer to a related PR or issue link (optional)

@codecov
Copy link

codecov bot commented Sep 17, 2023

Codecov Report

Merging #2424 (f5d4a2d) into develop (4a82926) will decrease coverage by 0.03%.
Report is 1 commits behind head on develop.
The diff coverage is 17.39%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2424      +/-   ##
===========================================
- Coverage    77.72%   77.70%   -0.03%     
===========================================
  Files          724      724              
  Lines       114092   114126      +34     
===========================================
- Hits         88679    88676       -3     
- Misses       25413    25450      +37     

@WenyXu WenyXu force-pushed the feat/start-service-after-hb-received branch from 17faeef to 0470e91 Compare September 17, 2023 14:38
@fengjiachun
Copy link
Collaborator

fengjiachun commented Sep 17, 2023

Will it encounter a readonly region error if we start services directly?
But blocking datanode start does not seem to be a good idea, consider providing a parameter such as start(blocking: bool), we can call start(true) on testing.

@WenyXu WenyXu force-pushed the feat/start-service-after-hb-received branch from 0470e91 to 6eea7e9 Compare September 17, 2023 14:53
@fengjiachun
Copy link
Collaborator

I am afraid that may not be effective to blocking datanode from starting, because the route information of metasrv will guide the Frontend to access the datanode, and the Frontend will still encounter errors, but not readonly, but errors such as no service.

@WenyXu
Copy link
Member Author

WenyXu commented Sep 17, 2023

I am afraid that may not be effective to blocking datanode from starting, because the route information of metasrv will guide the Frontend to access the datanode, and the Frontend will still encounter errors, but not readonly, but errors such as no service.

  1. Waits for the first heartbeat response should be very fast. The first thing (without any wait) the datanode will do is send a heartbeat to metasrv.
  2. No Service means the datanode is still under coordination, and the service is currently unavailable. Then, the gateway or other component can know it. If we skip the coordination, No components will know this except the Enduesr (Then they may scream, why my table is readonly🤣).
  3. Retrying writes on the "read-only" table is problematic. It's better to let upper service know the datanode is currently unavailable

@killme2008
Copy link
Contributor

The datanode must try its best to start and provide services properly at startup. If it can't for some reason (such as metasrv availability or network issues etc.), it can retry in the background.

@killme2008
Copy link
Contributor

But my question is do we need to provide an explicit synchronous RPC for such requests or just wait for the first heartbeat response asynchronously in an implicit way?

I think the former is better for understanding and debugging.

@WenyXu WenyXu marked this pull request as draft September 18, 2023 03:39
@MichaelScofield
Copy link
Collaborator

I suggest going the full async way, and combining region's open/close state with table route. Like this:

  • Datanode starts with heartbeat enabled, but not waiting on the "first heartbeat" to decide which regions to open;
  • Metasrv decides which regions a Datanode can open, and carry this infomation in heartbeat's response;
  • When regions are opened in a Datanode, it's reflected in the Datanode's heartbeats. Metasrv knows it, then modify the table route to let the regions visible to Frontend(clients).

The essential of this idea is fully reverse the region open/close management from "Datanode requests regions from Metasrv" to "regions are assigned to Datanode by Metasrv, dynamically". Also combining region open/close state with table route makes clients happy.

@WenyXu WenyXu marked this pull request as ready for review September 18, 2023 04:54
@WenyXu WenyXu force-pushed the feat/start-service-after-hb-received branch from 23497d8 to 7b1c3ea Compare September 18, 2023 04:55
@WenyXu WenyXu force-pushed the feat/start-service-after-hb-received branch from 7b1c3ea to 7ca6aeb Compare September 18, 2023 05:01
@WenyXu
Copy link
Member Author

WenyXu commented Sep 18, 2023

But my question is do we need to provide an explicit synchronous RPC for such requests or just wait for the first heartbeat response asynchronously in an implicit way?

I think the former is better for understanding and debugging.

The synchronous RPC is too heavy.

src/datanode/src/datanode.rs Outdated Show resolved Hide resolved
@fengjiachun fengjiachun added this pull request to the merge queue Sep 18, 2023
Merged via the queue into GreptimeTeam:develop with commit 2c3ff90 Sep 18, 2023
17 checks passed
paomian pushed a commit to paomian/greptimedb that referenced this pull request Oct 19, 2023
…meTeam#2424)

* feat: start services after first heartbeat response processed

* refactor: watch changes in RegionAliveKeeper

* feat: add coordination to DatanodeOptions

* chore: apply suggestions from CR

* chore: apply suggestions from CR

* chore: enable coordination in sqlness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants