-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(resharding): resharding archival node #12578
Conversation
8a9f670
to
5444b0c
Compare
0f9eb90
to
fd1b7c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'd like someone else to take a look too
Also, maybe we should trigger GC in the tests before merging? Not a strong opinion.
} | ||
|
||
fn with_clients(num_clients: u64) -> Self { | ||
fn with_validators(num_validators: u64) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to configure num of archivals, rpc, etc. Not in this PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just wanted to have at least one archival and one rpc so that they are tested. Not sure what bigger number of these would give us though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added TODO
@@ -1036,7 +1080,7 @@ fn test_resharding_v3_base(params: TestReshardingParameters) { | |||
params | |||
.loop_actions | |||
.iter() | |||
.for_each(|action| action(&node_datas, test_loop_data, client_handles[0].clone())); | |||
.for_each(|action| action(&env.datas, test_loop_data, client_handles[0].clone())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you get an RPC client here instead of client at index 0, the rpc_id
param should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added TODO, won't do it in this PR.
// and it is not accessible through a regular, RPC node. | ||
assert!(account_exist_in_view_client( | ||
&mut env, | ||
num_clients - 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_client - 1
is archival and num_client - 2
is RPC?
For clarity can you assign them to variables for clarity and assert they are truly archival and RPCs (if there's a way)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice!
core/store/src/adapter/trie_store.rs
Outdated
pub fn get_shard_uid_mapping(store: &Store, child_shard_uid: ShardUId) -> ShardUId { | ||
store | ||
.get_ser::<ShardUId>(DBCol::StateShardUIdMapping, &child_shard_uid.to_bytes()) | ||
.expect("get_shard_uid_mapping() failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add the child shard uid to the error message
.get_ser::<ShardUId>(DBCol::StateShardUIdMapping, &shard_uid.to_bytes()) | ||
.expect("get_key_from_shard_uid_and_hash() failed") | ||
.unwrap_or(shard_uid); | ||
let mapped_shard_uid = get_shard_uid_mapping(store, shard_uid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@@ -57,8 +58,7 @@ struct BatchTransaction { | |||
} | |||
|
|||
/// Updates provided cold database from provided hot store with information about block at `height`. | |||
/// Returns if the block was copied (false only if height is not present in `hot_store`). | |||
/// Block as `height` has to be final. | |||
/// Block at `height` has to be final and present in `hot_store`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not too hard can you assert that this is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not present, the function will return error.
Regarding finality, it was asserted in cold_store_copy()
:
if next_height > hot_final_head_height {
return Err(ColdStoreError::SkippedBlocksBetweenColdHeadAndNextHeightError {
cold_head_height,
next_height,
hot_final_head_height,
});
}
num_threads: usize, | ||
) -> io::Result<bool> { | ||
) -> io::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to the return value? Was is redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is redundant after refactor. Before, it was used to return whether the height exist in hot store. Now we would return an error if it does not exist.
for op in trie_changes.insertions() { | ||
// TODO(reshardingV3) Handle shard_uid not mapped there | ||
let key = join_two_keys(&shard_uid_key, op.hash().as_bytes()); | ||
// TODO(reshardingV3) Test it properly. Currently this path is not triggered in testloop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: TODO(resharding) is the convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using TODO(reshardingV3) for a few months to differentiate it from old todos for previous reshardingV2. Will replace all of them with TODO(resharding) now.
QueryRequest::ViewAccount { account_id }, | ||
); | ||
let result = near_async::messaging::Handler::handle(view_client, msg); | ||
result.is_ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sufficient to check that it returned ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the result is Result<QueryResponse, QueryError>
and QueryResponse
contains a data of a succesful response.
} | ||
|
||
let rpc_id = params.rpc_clients[0].clone(); | ||
let temporary_account = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- please add a comment what's this for
- Do you want this to happen in all tests or should it be part of the TestParams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. I would prefer it to happen in all tests, because it currently fails in a test that I did not expect to, so I think it can catch nontrivial bugs.
|
||
let TestLoopEnv { mut test_loop, datas: node_datas, tempdir } = builder | ||
let mut env = builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not unpack? It was a bit nicer when unpacked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also do not like it, but it is required by create_account
/ delete_account
let receiver_account: AccountId = "account4".parse().unwrap(); | ||
let account_1_in_stable_shard: AccountId = "account1".parse().unwrap(); | ||
let account_2_in_stable_shard: AccountId = "account2".parse().unwrap(); | ||
let rpc_id = params.rpc_clients[0].clone(); | ||
let params = TestReshardingParameters::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid creating params twice, instead reuse the one that you have above.
Some in other tests.
nit: The following pattern may be a bit nicer:
let rpc_id = ...;
let params = TestReshardingParameters::new()
.rpc_id(rpc_id)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mistakenly called new()
twice in two tests, fixed it now.
But in general, I need:
let params = TestReshardingParameters::new();
let rpc_id = params.rpc_clients[0].clone();
let params = params
.add_loop_action(call_burn_gas_contract(
...
rpc_id,
))
because rpc_id
is calculated in TestReshardingParameters::new()
.
Are you ok with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nearcore/src/cold_storage.rs
Outdated
// Here it should be sufficient to just read from hot storage. | ||
// Because BlockHeight is never garbage collectable and is not even copied to cold. | ||
match hot_store.get_ser::<CryptoHash>(DBCol::BlockHeight, &next_height.to_le_bytes())? { | ||
Some(next_height_block_hash) => break next_height_block_hash, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fancy, this pattern is not commonly used. Would it be any more readable in a dedicated function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see how to make it more readable with a dedicated function.
Would this be good enough?
let next_height_block_hash = hot_store.get_ser::<CryptoHash>(DBCol::BlockHeight, &next_height.to_le_bytes())?;
if let Some(next_height_block_hash) = next_height_block_hash {
break next_height_block_hash;
}
next_height = next_height + 1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about it then, it's fine.
2ad01ed
to
7e9e716
Compare
041bb7a
to
ae2e7c2
Compare
ae2e7c2
to
61482cd
Compare
61482cd
to
029cd6c
Compare
// TODO(resharding) Test chunk validators too (would need to change the lines below). | ||
base_epoch_config.num_block_producer_seats = params.num_validators; | ||
base_epoch_config.num_chunk_producer_seats = params.num_validators; | ||
base_epoch_config.num_chunk_validator_seats = params.num_validators; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this shard shuffling test failed because default number of seats was 100, and it treated rpc and archival nodes as validators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed.. I've run into the same issue while adding other node types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check my comment in nearcore/src/cold_storage.rs before merging?
if !col.is_cold() { | ||
return false; | ||
} | ||
if col == &DBCol::StateShardUIdMapping && !is_last_block_in_epoch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of this is_last_block_in_epoch
check here. Is it possible to just remove this and copy DBCol::StateShardUIdMapping at every height? I don't suppose it would add too much inefficiency, would it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my preference is to copy it at each height too, for simplicity.
Just during team discussion we decided to copy it in last block of epoch.
@wacban @Trisfald @Longarithm wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, personally since you make the effort to implement the logic to make the copy only at the last block, I'm ok with keeping it as is. Not a strong preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, didn't know it was already discussed. We can keep the majority preference
@@ -173,6 +173,17 @@ impl<'a> TrieStoreUpdateAdapter<'a> { | |||
} | |||
} | |||
|
|||
/// Get the `ShardUId` mapping for child_shard_uid. If the mapping does not exist, map the shard to itself. | |||
/// Used by Resharding V3 for State mapping. | |||
pub fn get_shard_uid_mapping(store: &Store, child_shard_uid: ShardUId) -> ShardUId { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we push this pub function to TrieStoreAdapter
? In cold_storage.rs we can call hot_store.trie_store().get_shard_uid_mapping(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to but it is problematic, please see #12232 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah okay! Yes, we don't want to be cloning store I guess. Could you add a quick comment just so that next time we have context for why it's where it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added here: f46714f
let shard_layout = epoch_manager.get_shard_layout(&epoch_id)?; | ||
let is_last_block_in_epoch = | ||
epoch_manager.is_next_block_epoch_start(&next_height_block_hash)?; | ||
update_cold_db( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, this isn't correct right? Shouldn't we be calling update_cold_db
function for all heights between cold_head_height + 1
and hot_final_head_height
? It should be within the loop right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copies data for the next available produced block after current cold store head
The loop exist in order to find the first available block in hot store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No but in the original code, update_cold_db
is being called in the loop for each height right?? That has side effects??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function first checked if the block is available and returned false if it wasn't:
https://github.com/near/nearcore/pull/12578/files#diff-33e110068d4d20b9833e59ebb06676f0fcbdc3e79b11241c50a969727d5642deL88-L91.
Instead, we now find the correct height first, then call update_cold_db()
with the correct height and is_last_block_in_epoch
info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh..... Originally poorly written 😢
Apologies for the panic!()
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12578 +/- ##
==========================================
- Coverage 70.46% 70.45% -0.01%
==========================================
Files 845 845
Lines 172261 172287 +26
Branches 172261 172287 +26
==========================================
+ Hits 121377 121387 +10
- Misses 45759 45789 +30
+ Partials 5125 5111 -14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Mapping from child to parent shard is stored in
DBCol::StateShardUIdMapping
, which is marked as a cold column.By default, it would be copied from hot to cold storage at each block.
In this PR, we do add
is_last_block_in_epoch
parameter toupdate_cold_db()
so that it is only copied once per epoch.For that, a slight refactor of
cold_store_copy()
was needed.If it is possible to have skips at epoch boundary, then
cold_store_copy()
had a bug, where previous epoch shard layout could be used to copy data for a block from the new shard layout.Added archival node and rpc node to reshardingv3 testloop.
In testloop, create temporary account before resharding and remove it after resharding. After gc period, the account is available at the archival node and it is not available at rpc node.
However, it works without actually testing the
cold_store_copy
path, because neithercold_store_copy
nor garbage collection is called for archival node in testloop. So it needs fixing testloop and a follow-up testing.