Sampling backend implementation #709

holisticode · 2024-08-28T21:42:38Z

This PR adds the sampling backend implementation to the sampling backend service.

…inished

bacv · 2024-08-30T08:29:24Z

Are you planning to add network functionality to the sampler backend in this PR?

danielSanchezQ · 2024-08-30T08:33:48Z

Are you planning to add network functionality to the sampler backend in this PR?

There is no networking in the backend. Everything is piped from the service itself.

nomos-services/data-availability/sampling/src/backend/kzgrs.rs

bacv · 2024-08-30T09:13:49Z

nomos-services/data-availability/sampling/src/lib.rs

                    }
                    Some(sampling_message) = sampling_message_stream.next() => {
                        Self::handle_sampling_message(sampling_message, &mut sampler).await;
+                        // cleanup not on time samples
+                        sampler.prune();


As I understand, pruning now is done only after some interaction with the service. Could there be a situation in which the sampling was triggered way before the old_blobs_check_duration, for some reason no other interactions happen with the service, and block producer requests for outdated validated blob (because pruning is done after the message is handled). Maybe we should have a ticker for pruning?

A ticker is a good idea probably. Let's do this, lets delegate ticker creation to the backend (we need to add the method to the trait), because is the one that should know about timings, then in the service on tick we prune. Nice catch @bacv !

Yes that was the initial idea with the thread in the backend, which we then moved out to the service. A combination of the two sounds like the best approach. If I am understanding correctly, a ticker in the backend, but executing pruning from the service.

The backend is just responsible of building the ticker (as it holds the proper configuration), and pruning. The service will call the prune when the ticker ticks, as it owns the main loop.
Specifically the backend need to return an Interval

bacv

Looks good! The tests helps to follow the flow.

bacv · 2024-08-31T18:24:14Z

nomos-services/data-availability/sampling/Cargo.toml

@@ -23,6 +23,7 @@ tracing = "0.1"
 thiserror = "1.0.63"
 rand = "0.8.5"
 rand_chacha = "0.3.1"
+chrono = "0.4.38"


Nit: this is not needed anymore.

danielSanchezQ

Looking good overall.

danielSanchezQ · 2024-09-02T08:07:13Z

nomos-services/data-availability/sampling/src/backend/kzgrs.rs

+            println!("{}", self.settings.num_samples);
+            println!("{}", ctx.subnets.len());


Leftover?
You can instrument the function if you want to debug it.

danielSanchezQ · 2024-09-02T08:07:44Z

nomos-services/data-availability/sampling/src/backend/kzgrs.rs

+// TODO there is no logging at all here. Should we not do some logging?
+// Or can we assume that it is done on an upeer level by the clients?


We can add whatever logging we need here. Then it is gonna be wrapped in the service context.

Added just 2 tracing::info

danielSanchezQ · 2024-09-02T08:08:29Z

nomos-services/data-availability/sampling/src/backend/kzgrs.rs

+        }
+    }
+
+    async fn next_prune_interval(&self) -> Interval {


Nitpick: prune_interval

danielSanchezQ · 2024-09-02T08:12:01Z

nomos-services/data-availability/sampling/src/backend/kzgrs.rs

+            match ctx.subnets.len() {
+                // sampling of this blob_id terminated successfully
+                len if len == self.settings.num_samples as usize => {
+                    self.validated_blobs.insert(blob_id);
+                    // cleanup from pending samplings
+                    self.pending_sampling_blobs.remove(&blob_id);
+                }
+                len if len > self.settings.num_samples as usize => {
+                    unreachable!("{}", "more subnets than expected after sampling success!");
+                }
+                // do nothing if smaller
+                _ => {}
+            }


Suggested change

match ctx.subnets.len() {

// sampling of this blob_id terminated successfully

len if len == self.settings.num_samples as usize => {

self.validated_blobs.insert(blob_id);

// cleanup from pending samplings

self.pending_sampling_blobs.remove(&blob_id);

}

len if len > self.settings.num_samples as usize => {

unreachable!("{}", "more subnets than expected after sampling success!");

}

// do nothing if smaller

_ => {}

}

// sampling of this blob_id terminated successfully

if ctx.subnets.len() == self.settings.num_samples as usize {

self.validated_blobs.insert(blob_id);

// cleanup from pending samplings

self.pending_sampling_blobs.remove(&blob_id);

}

I am not much of a fan of hiding away situations which should not happen but if they do, they are critical if not fatal...why would we do that? Yes, the code is nicer. Is there any place in the rest of the codebase which makes this error impossible?

As we discussed privately. It is not possible it is more than the expected threshold. We increment 1 by 1 (ensured by &muy self when we have enough the queue is empty and the blob is considered validated. Something else may happen. But not that.

danielSanchezQ · 2024-09-02T08:16:07Z

nomos-services/data-availability/sampling/src/backend/kzgrs.rs

+    // TODO: This also would be an error which should never happen, but what if a client starts
+    // init_sampling of a blob which is already pending? Or worse, which already is validated?
+    // Should we not therefore return an error here?


Imo this object should be idempotent. Why?, Because we cannot ensure that a single service will try to start sampling or check how things are going (even if for now it is the case). So, maybe here we could add a check and just start sampling if it wasn't pending already. wdyt?

Currently, if already sampling, this would result in the network adapter resending SampleRequest messages. Out of band you suggested an enum return signaling the state, so that's what I did here.

danielSanchezQ · 2024-09-02T08:19:45Z

nomos-services/data-availability/sampling/src/lib.rs

+                // TODO: in most of these error cases we can't get the blob_id from the error
+                // Shouldn't the error contain that?
+                // We can of course stop tracking that blob_id in the backend via timeout,
+                // which we want to have anyways, but could it be nicer to remove it here too,
+                // by calling the handler_sampling_error method?


This is a specific error. We can try to get the blob_id from the error. It is has we sort circuit the sampling (calling the error handling method with the blobid). Otherwise we wait till time expires. It can be done here or in a different PR.
Steps would be:

Create a blob_id method for the sampling error which returns an Option<BlobId>

Call the method here and check if it is Some(blob_id) to call the error handling method.

Log it otherwise

Added in this PR

danielSanchezQ

Made a tiny refactor, I'm merging this after CI is 🟢
Good job! Thanks!

implemented backend; correct start of monitoring thread needs to be f…

9b13fac

…inished

github-actions bot assigned holisticode Aug 28, 2024

holisticode requested review from danielSanchezQ and bacv August 28, 2024 21:43

danielSanchezQ and others added 3 commits August 29, 2024 09:16

Fix kzgrs backend (#710)

47d5e2c

abstract rng

b3c5379

fix rng abstraction

85a3306

holisticode marked this pull request as ready for review August 29, 2024 21:04

bacv reviewed Aug 30, 2024

View reviewed changes

nomos-services/data-availability/sampling/src/backend/kzgrs.rs Show resolved Hide resolved

bacv reviewed Aug 30, 2024

View reviewed changes

replaced subnets vector with HashSet, fixed bugs, added tests

61be9cc

bacv approved these changes Aug 31, 2024

View reviewed changes

danielSanchezQ reviewed Sep 2, 2024

View reviewed changes

holisticode and others added 3 commits September 2, 2024 16:48

addressed PR comments

945330f

fix clippy warnings

163743f

Rename TrackingState -> SamplingState

8831d7f

danielSanchezQ approved these changes Sep 3, 2024

View reviewed changes

Short circuit failure on init error

570eb5e

danielSanchezQ merged commit efff80d into master Sep 3, 2024
11 checks passed

danielSanchezQ deleted the sampling-service branch September 3, 2024 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling backend implementation #709

Sampling backend implementation #709

holisticode commented Aug 28, 2024 •

edited

Loading

bacv commented Aug 30, 2024

danielSanchezQ commented Aug 30, 2024

bacv Aug 30, 2024

danielSanchezQ Aug 30, 2024

holisticode Aug 30, 2024

danielSanchezQ Aug 30, 2024 •

edited

Loading

bacv left a comment

bacv Aug 31, 2024

danielSanchezQ left a comment

danielSanchezQ Sep 2, 2024

danielSanchezQ Sep 2, 2024

holisticode Sep 2, 2024

danielSanchezQ Sep 2, 2024

danielSanchezQ Sep 2, 2024

holisticode Sep 2, 2024

danielSanchezQ Sep 2, 2024

danielSanchezQ Sep 2, 2024

holisticode Sep 2, 2024

danielSanchezQ Sep 2, 2024

holisticode Sep 2, 2024

danielSanchezQ left a comment

		println!("{}", self.settings.num_samples);
		println!("{}", ctx.subnets.len());

		// TODO there is no logging at all here. Should we not do some logging?
		// Or can we assume that it is done on an upeer level by the clients?

Sampling backend implementation #709

Sampling backend implementation #709

Conversation

holisticode commented Aug 28, 2024 • edited Loading

bacv commented Aug 30, 2024

danielSanchezQ commented Aug 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielSanchezQ Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

bacv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielSanchezQ left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielSanchezQ left a comment

Choose a reason for hiding this comment

holisticode commented Aug 28, 2024 •

edited

Loading

danielSanchezQ Aug 30, 2024 •

edited

Loading