opt: optimize cluster identification #3309

tediou5 · 2024-12-11T08:21:29Z

Second attempt to close #2900

The first commit is meaningless; it simply renames cache_id to piece_cache_id

For the cache, everything is straightforward; it's just a matter of recording the corresponding relationships in the controller. However, things are a litter bit more complicated for the farmer. First, we check the identify message to see if the farmer is newly discovered and whether the fingerprint has changed. Based on the results, we decide whether to use the stream to fetch details.

The final commit is to ensure compatibility with previous approaches.

(I’m really sorry, actually, I finished it a long time ago, but I forgot about it and left it in a corner.)

Code contributor checklist:

I have read, understood and followed contributing guide

tediou5 · 2024-12-16T03:15:01Z

@nazar-pc additionally, during my actual development, this part of the code is not easy to test, and some scenarios are hard to cover(like farm FingerprintUpdated). At the very least, I need to start 3 components: nats, controller, and farmer/cache to do so. I was thinking maybe I could first submit a PR to extract the update logic for caches and farms and cover enough test cases?

nazar-pc

Thanks for contribution and sorry it took this long to read into it. This is certainly the right direction, but it will cause issues for the way maintenance of caches and farms is done (prevent loop with select! from actually looping quickly, which was carefully avoided before).

I only left comments on cache, but similar comments apply to farmer side as well.

I also don't fully understand why the thing that we try to address here was sort of added back at the end, I'm confused.

And please rebase changes after further updates (if any) and squash changes to the same part of the codebase, it'll be easier to review that way.

crates/subspace-farmer/src/cluster/cache.rs

nazar-pc · 2025-01-14T01:31:20Z

crates/subspace-farmer/src/cluster/cache.rs

+pub struct ClusterCacheDetailsRequest;
+
+impl GenericStreamRequest for ClusterCacheDetailsRequest {
+    const SUBJECT: &'static str = "subspace.cache.*.details";


Please add a comment just like in other places

Suggested change

const SUBJECT: &'static str = "subspace.cache.*.details";

/// `*` here stands for cache ID

const SUBJECT: &'static str = "subspace.cache.*.details";

Since the subject's instance for farmer's request could either be a cluster farmer ID or a single farm ID, I have added comments to the farmer's subject as well.

crates/subspace-farmer/src/cluster/cache.rs

nazar-pc · 2025-01-14T01:41:16Z

crates/subspace-farmer/src/farm.rs

+pub enum CacheId {
    /// Cache ID
    Ulid(Ulid),
 }


This doesn't have to use Ulid (though it doesn't hurt, just more verbose implementing encoding/decoding) and I'd call it ClusterCacheInstance (and similarly for farmer), in which case there is no need to rename things in many places and will prevent confusion between PieceCacheId and CacheId.

Just to confirm, do you mean something like using first_cache_id.to_string() as the ClusterCacheInstance? Or perhaps ClusterCacheInstance(CacheId)?

nazar-pc · 2025-01-14T01:58:18Z

crates/subspace-farmer/src/cluster/controller/caches.rs

+                known_caches.update_cache(
                    cache_id,
-                    max_num_elements,
-                } = identify_message;
-                if known_caches.update(cache_id, max_num_elements, nats_client) {
-                    info!(
-                        %cache_id,
-                        "New cache discovered, scheduling reinitialization"
-                    );
-                    scheduled_reinitialization_for.replace(
-                        Instant::now() + SCHEDULE_REINITIALIZATION_DELAY,
-                    );
-                } else {
-                    trace!(
-                        %cache_id,
-                        "Received identification for already known cache"
-                    );
-                }
+                    &mut scheduled_reinitialization_for,
+                    nats_client,
+                    async {
+                        nats_client
+                            .stream_request(
+                                &ClusterCacheDetailsRequest,
+                                Some(&cache_id.to_string()),
+                            )
+                            .await
+                            .inspect_err(|error| warn!(
+                                %error,
+                                %cache_id,
+                                "Failed to request farmer farm details"
+                            ))
+                            .ok()
+                    },
+                ).await


There are a couple of issues with this:

You can't .await here because it'll prevent other branches of select! from making progress in the meantime, which will likely make other caches to disconnect, causing further issues

It is a confusing API creating a future here that is actually awaited inside with more work done internally

KnownCaches is designed as a simple state machine, it shouldn't do any async work, only update its state based on inputs.

What should happen is a background task like farms_to_add_remove + farm_add_remove_in_progress that manage sequential addition of farms (it doesn't actually need to be globally sequentially, just for individual farms, but it was easier to implement that way, might be a good improvement to parallelize though).

Sorry, I hadn’t paid much attention to this part of the details before. Yes, this would cause the select to be blocked. I think I need to adjust the logic here carefully (including for the farmer). Also, do you think I should add an extra flag to indicate that this ClusterCache (or ClusterFarmer) is currently being updated to prevent duplicate update tasks from being added? Based on the current identification interval, such a situation shouldn’t occur, but I’d like your input.

Farmer right now doesn't need any flags because it queues things and processes them sequentially. As mentioned that is not the most efficient, but one of the simplest ways to implement it. I think it'll be fine to do the same here.

On farmer side since we already have such infrastructure we might be able to reuse it and parallelize processing of multiple farms belonging to the same farmer (since now add/remove unit is farmer, not farm).

My suggestion would be to not change cache and farmer at the same time. Start with cache, once done and you have an experience with how that works, proceed with farmer, which is a bit more involved, but conceptually similar.

crates/subspace-farmer/src/cluster/controller/caches.rs

tediou5

Yes, let me adjust the code. Once I've completed my modifications, I'll take care of those annoying merges.

crates/subspace-farmer/src/cluster/controller/caches.rs

tediou5 · 2025-01-15T02:41:45Z

crates/subspace-farmer/src/cluster/controller/caches.rs

+                known_caches.update_cache(
                    cache_id,
-                    max_num_elements,
-                } = identify_message;
-                if known_caches.update(cache_id, max_num_elements, nats_client) {
-                    info!(
-                        %cache_id,
-                        "New cache discovered, scheduling reinitialization"
-                    );
-                    scheduled_reinitialization_for.replace(
-                        Instant::now() + SCHEDULE_REINITIALIZATION_DELAY,
-                    );
-                } else {
-                    trace!(
-                        %cache_id,
-                        "Received identification for already known cache"
-                    );
-                }
+                    &mut scheduled_reinitialization_for,
+                    nats_client,
+                    async {
+                        nats_client
+                            .stream_request(
+                                &ClusterCacheDetailsRequest,
+                                Some(&cache_id.to_string()),
+                            )
+                            .await
+                            .inspect_err(|error| warn!(
+                                %error,
+                                %cache_id,
+                                "Failed to request farmer farm details"
+                            ))
+                            .ok()
+                    },
+                ).await


Sorry, I hadn’t paid much attention to this part of the details before. Yes, this would cause the select to be blocked. I think I need to adjust the logic here carefully (including for the farmer). Also, do you think I should add an extra flag to indicate that this ClusterCache (or ClusterFarmer) is currently being updated to prevent duplicate update tasks from being added? Based on the current identification interval, such a situation shouldn’t occur, but I’d like your input.

teor2345

This looks good to me, once Nazar's comments have been addressed

tediou5 · 2025-01-15T09:18:55Z

No changes were made, just squash changes through rebase. Additionally, I removed ClusterCacheIdentifyPieceCacheBroadcast (also for the Farmer), they were reintroduced in a separate commit, and I simply dropped that commit. Nazar's comments will be addressed in subsequent commits.

tediou5 · 2025-01-20T05:52:54Z

I’ve rearranged the commit order to make squashing easier later.

@nazar-pc I finished the cache implementation (it’s relatively straightforward), so you can review it for any potential issues.

91b3dd4: When a new cache appears, the system will collect the stream in the background and update KnownCaches once it’s done.

Before making changes to the farmer, perhaps I could submit a separate PR to parallelly add or remove farms? It doesn’t look too complex right now (and may even simplify the implementation).

tediou5 · 2025-01-20T10:18:32Z

The farmer's work turned out to be simpler than I imagined, and it's also done.

1704ed5 is refactoring and moving code, with no actual changes.

75ddfe6 is the actual modification, but the logic after refactoring hasn’t changed much—it’s just split into two parts, with no other changes.

tediou5 requested review from nazar-pc, shamil-gadelshin and rg3l3dr as code owners December 11, 2024 08:21

tediou5 changed the title ~~Tmp/opt/optimize cluster identification~~ opt: optimize cluster identification Dec 11, 2024

nazar-pc requested changes Jan 14, 2025

View reviewed changes

nazar-pc requested review from teor2345 and removed request for shamil-gadelshin January 14, 2025 02:07

tediou5 commented Jan 15, 2025

View reviewed changes

teor2345 reviewed Jan 15, 2025

View reviewed changes

tediou5 added 2 commits January 15, 2025 15:40

chore: rename cache_id => piece_cache_id

749abf6

feat: optimize cache identification

6b814c4

tediou5 force-pushed the tmp/opt/optimize-cluster-identification branch from b633d33 to beb798c Compare January 15, 2025 09:18

tediou5 force-pushed the tmp/opt/optimize-cluster-identification branch 2 times, most recently from d368b04 to 4246b58 Compare January 20, 2025 05:33

tediou5 added 2 commits January 20, 2025 13:48

opt: collect piece caches details stream request in the background

91b3dd4

feat: optimize farmer identification

3cdc671

tediou5 force-pushed the tmp/opt/optimize-cluster-identification branch from 4246b58 to cda20e5 Compare January 20, 2025 05:49

tediou5 requested a review from nazar-pc January 20, 2025 05:59

tediou5 mentioned this pull request Jan 20, 2025

opt: improve farms maintenance performance via parallelization #3354

Open

1 task

tediou5 added 4 commits January 20, 2025 18:07

chore: refactor and moving for cluster::controller::farms

1704ed5

opt: collect farms details stream request in the background

75ddfe6

chore: remove unnecessary checks when sending identification broadcast

03d3e38

chore: improve comments on cache and farmer's SUBJECT

75b0755

tediou5 force-pushed the tmp/opt/optimize-cluster-identification branch from cda20e5 to 75b0755 Compare January 20, 2025 10:13

tediou5 requested a review from teor2345 January 20, 2025 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: optimize cluster identification #3309

opt: optimize cluster identification #3309

tediou5 commented Dec 11, 2024

tediou5 commented Dec 16, 2024 •

edited

Loading

nazar-pc left a comment •

edited

Loading

nazar-pc Jan 14, 2025

tediou5 Jan 17, 2025

nazar-pc Jan 14, 2025

tediou5 Jan 17, 2025

nazar-pc Jan 14, 2025

tediou5 Jan 15, 2025

nazar-pc Jan 15, 2025

tediou5 left a comment

tediou5 Jan 15, 2025

teor2345 left a comment •

edited

Loading

tediou5 commented Jan 15, 2025

tediou5 commented Jan 20, 2025

tediou5 commented Jan 20, 2025 •

edited

Loading

	const SUBJECT: &'static str = "subspace.cache.*.details";
	/// `*` here stands for cache ID
	const SUBJECT: &'static str = "subspace.cache.*.details";

opt: optimize cluster identification #3309

Are you sure you want to change the base?

opt: optimize cluster identification #3309

Conversation

tediou5 commented Dec 11, 2024

Code contributor checklist:

tediou5 commented Dec 16, 2024 • edited Loading

nazar-pc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tediou5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teor2345 left a comment • edited Loading

Choose a reason for hiding this comment

tediou5 commented Jan 15, 2025

tediou5 commented Jan 20, 2025

tediou5 commented Jan 20, 2025 • edited Loading

tediou5 commented Dec 16, 2024 •

edited

Loading

nazar-pc left a comment •

edited

Loading

teor2345 left a comment •

edited

Loading

tediou5 commented Jan 20, 2025 •

edited

Loading