-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomos node tree overlay #415
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the overlay config will have to go away eventually as it must be calculated from chain instead. Overall looks good. I just do not know we we didn't mix a couple branches in this PR.
mixnet/client/src/config.rs
Outdated
#[serde(default = "MixnetClientConfig::default_max_retries")] | ||
pub max_retries: usize, | ||
#[serde(default = "MixnetClientConfig::default_retry_delay")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this from another branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the another branch had defaults for MixnetNode
, I've added these for MixnetClient
config parsing.
nodes/nomos-node/config.yaml
Outdated
da: | ||
da_protocol: | ||
num_attestations: 1 | ||
backend: | ||
max_capacity: 10 | ||
evicting_period: | ||
secs: 3600 | ||
nanos: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, this is not from this pr or is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeegomo - After changing FlatOverlay to TreeOverlay in nomos node and running integration tests, ten happy path nodes network had some panicking nodes and failed to progress with views. @danielSanchezQ helped debug it and error handling for this case was added. Please review. I've also increased the unhappy case timeout, tree overlay added more latency. |
tracing::debug!("Failed to gather initial votes"); | ||
return Event::None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeegomo This should be ok right? I have been working with Gusto and in reality if failing to gather the initial votes nodes should catch up anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing to gather initial votes means we go to the timeout routine, why does this happen now?
I'll review it after our meeting 👍🏼 |
d076749
to
eabdf6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What were the panics you saw in tests? I suspect they were caused by a timeout triggering the closing of the vote stream.
Those changes seems ok (and we were looking to add them), but since the tree overlay with 1 committee is exactly like a flat one, I'd like to investigate why we need the additional time
I've also increased the unhappy case timeout, tree overlay added more latency.
As in, this shouldn't be the case for a 1 committee tree overlay
tracing::debug!("Failed to gather initial votes"); | ||
return Event::None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing to gather initial votes means we go to the timeout routine, why does this happen now?
Yes, issue need to be investigated. Apparantly after some number of nodes, some get unresponsive 🤔 , starting with the first leader. |
I think it's worth investigating this issue now that the playground is still small, debugging something like this at greater scales is almost impossible |
The panic was caused by executing It's most likely caused by the tree overlay. The increase in the time it takes to complete the test cases is probably related to the added error handling that triggers unhappy path?
Agree, I'll ping you how we could organize this. |
eabdf6f
to
8e78539
Compare
Moved code unrelated to tree overlay to #423 |
20082e2
to
f4f7d62
Compare
* report unhappy blocks in the happy path test
I merged all fixes to this PR. If CI passes, I think we can consider merging this. |
Added tree overlay to nomos-node. Some configuration parameters were missing in example config.yaml, so the file was updated.