Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add persistent ephemeral node support to metadata store #772

Closed
wants to merge 3 commits into from

Conversation

bryanlb
Copy link
Contributor

@bryanlb bryanlb commented Feb 14, 2024

Summary

This adds an optional opt-in mode to run the KaldbMetadataStore using a combination of PersistentNode and PersistentWatcher for ephemeral nodes. This system property flag will be removed once the logic is validated, at which point this will no longer be able to be opt-out.

The goal of this PR is to support using ephemeral nodes across ZK session disconnects (#644 (comment)), and uses the Curator PersistentNode in order to recreate nodes when they disappear, and PersistentWatcher to subscribe to updates to ephemeral nodes by other instances (ie - manager updating cache/recovery ephemerals).

https://curator.apache.org/docs/recipes-persistent-node
https://curator.apache.org/docs/recipes-persistent-watcher

This PR also includes:

  • Passing the meter registry into the metadata stores - in part for testing of this PR, but it will also enable us to start adding meters where appropriate to this class.
  • Fixes to both unclosed metadata stores when they should have been closed (mostly tests), and fixes to closing metadata stores when they shouldn't have been (ie, passed as a param)
  • Deduplication of events on the recovery service. This should probably have a further refactor to match the approach used by the manager, with an eventAggregationWindow given that ZK is eventually consistent.

@bryanlb bryanlb force-pushed the bburkholder/zk-reconnect branch 13 times, most recently from f6f6e79 to 6bcba22 Compare February 15, 2024 22:39
@bryanlb bryanlb force-pushed the bburkholder/zk-reconnect branch 2 times, most recently from b6af00f to d3ca51f Compare February 16, 2024 20:56
@bryanlb bryanlb marked this pull request as ready for review February 16, 2024 20:58
@bryanlb bryanlb marked this pull request as draft February 19, 2024 18:42
@@ -15,12 +16,14 @@ public class CacheSlotMetadataStore extends KaldbPartitioningMetadataStore<Cache
* Initializes a cache slot metadata store at the CACHE_SLOT_ZK_PATH. This should be used to
* create/update the cache slots, and for listening to all cache slot events.
*/
public CacheSlotMetadataStore(AsyncCuratorFramework curatorFramework) throws Exception {
public CacheSlotMetadataStore(AsyncCuratorFramework curatorFramework, MeterRegistry meterRegistry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large part of this diff if adding meterRegistry parameter. It would be far more easier to review the diff if meterRegistry change is done in a separate PR.

Copy link

github-actions bot commented May 9, 2024

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days.

@github-actions github-actions bot added the Stale label May 9, 2024
@github-actions github-actions bot closed this May 12, 2024
@bryanlb bryanlb deleted the bburkholder/zk-reconnect branch August 26, 2024 17:35
@bryanlb bryanlb restored the bburkholder/zk-reconnect branch August 26, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants