Manual round restart fails, verifier uploads response during the round reset #304

kellpossible · 2021-07-13T18:49:41Z

Currently the verifiers first upload their verifications /v1/upload/challenge/... and then try_verify /v1/verifier/try_verify/... to perform verification. Verifiers seem to be able to upload verifications even during the round reset ( #288 ) process, even though they have technically already been removed from the round (but not in the cached state in the operator in aleo-setup-coordinator seemingly). This is probably a serious problem, and seems to be causing problems in the restarted round when these files are unexpectedly already present. This behaviour may also be present with contributors, and could also explain some of the problems encountered with implementing the replacement contributors.

This behaviour can be seen on line 50570 in the log:
coordinator.log

Jul 11 17:02:28.956 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_1.unverified
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed ./transcript/development/round_1/chunk_36/contribution_1.unverified
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removing file ./transcript/development/round_1/chunk_36/contribution_1.unverified.signature
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_1.unverified.signature
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed ./transcript/development/round_1/chunk_36/contribution_1.unverified.signature
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removing file ./transcript/development/round_1/chunk_36/contribution_1.verified
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_1.verified
Jul 11 17:02:28.958 TRACE aleo_setup_coordinator::api::upload::challenge: Response uploaded from: aleo1em2sc2e24f6kg6ksr690etmz6tnp7h4tjekedwnhukvqy2snkuxs4lw67p.verifier to transcript/development/round_2/chunk_26/contribution_0.verified
Jul 11 17:02:28.958 TRACE aleo_setup_coordinator::api::upload::challenge: Contribution file signature uploaded from: aleo1em2sc2e24f6kg6ksr690etmz6tnp7h4tjekedwnhukvqy2snkuxs4lw67p.verifier to transcript/development/round_2/chunk_26/contribution_0.verified.signature
Jul 11 17:02:28.958 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed ./transcript/development/round_1/chunk_36/contribution_1.verified
Jul 11 17:02:28.958 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removing file ./transcript/development/round_1/chunk_36/contribution_1.verified.signature
Jul 11 17:02:28.958 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_1.verified.signature
Jul 11 17:02:28.959 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed ./transcript/development/round_1/chunk_36/contribution_1.verified.signature
Jul 11 17:02:28.959 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removing file ./transcript/development/round_1/chunk_36/contribution_2.unverified
Jul 11 17:02:28.959 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_2.unverified

another upload that now fails because the participant has now been removed. I'm not sure why there isn't a try_verify between these two, or perhaps one of these log lines is from try_verify and the other is from the other verifier... It warrants more investigation.

Jul 11 17:02:29.096 TRACE phase1_coordinator::storage::disk: Fetched ./transcript/development/round_0/state.json
Jul 11 17:02:29.096 ERROR aleo_setup_coordinator::operator: Participant aleo1em2sc2e24f6kg6ksr690etmz6tnp7h4tjekedwnhukvqy2snkuxs4lw67p.verifier is not authorized
Jul 11 17:02:29.096 ERROR phase1_coordinator::coordinator: ParticipantUnauthorized
Jul 11 17:02:29.096 ERROR phase1_coordinator::coordinator: ParticipantUnauthorized
Jul 11 17:02:29.096 ERROR aleo_setup_coordinator::api::upload::challenge: Error uploading next challenge file ParticipantUnauthorized
Jul 11 17:02:29.657 TRACE aleo_setup_coordinator::authentication::authentication_middleware: Authentication for address aleo15fc08f4pcky20f63ldqa97guc4gr554w7zduxmt7yuf3dv7dacxqpflffg message is: "post /v1/verifier/try_lock"

The text was updated successfully, but these errors were encountered:

kellpossible · 2021-07-13T19:03:51Z

Perhaps this has something to do with the caching of the state...!

ibaryshnikov · 2021-07-13T19:43:55Z

Right now the task is identified by chunk_id and contribution_id. If you reset the round and reassign the tasks, the old tasks may overlap with new tasks, and there's a chance verifiers/contributors would be able to upload the old work after the round reset. Therefore it's not sound.
Assigning a random number to each task upon creation solves the problem. When contributor/verifier uploads the work, not only chunk_id/contribution_id should match, but also an additional identifier. And on round reset all tasks will receive the new identifier, making uploading old work impossible.

kellpossible · 2021-07-17T11:31:51Z

there's a chance verifiers/contributors would be able to upload the old work after the round reset.

But with the coordinator state write locks obtained during the round reset, and the same read lock obtained during contribution upload it seems like that shouldn't happen. Perhaps these aren't the same lock currently due to the duplicated state with the cached state, and removing the cache would also fix this problem without a serious performance impact.

kellpossible · 2021-07-17T12:59:41Z

Okay I had a play with eliminating the cache, it doesn't appear to have a noticeable effect on performance (there are no doubt some extra clones in there but in the grand scheme of things it seems the coordinator spends waiting for during a round it's small), whether or not it fixes the problem entirely I don't know. The locks on the coordinator state are still not held for the duration of the operation. At least without the cache it is less likely to be using outdated state (which was previously only updated maybe once every 10 seconds). Perhaps I can modify the operator to lock the coordinator state for the duration of the operation. I'll keep exploring this solution space, but @ibaryshnikov 's suggestion also deserves attention

kellpossible · 2021-07-17T13:49:27Z

Implementing a unique id for tasks is no small task itself! It alters the equality of tasks, and serialization/deserialization and the basic definition of a task. It is starting to seem like the correct solution but I'll probably break a lot of stuff with this!

kellpossible · 2021-07-17T14:07:04Z

@ibaryshnikov what do you think we could name contribution files contribution_{contribution_id}_{id}.(verified|unverified)(.signature) where id is a randomly generated uuid created at task start, or when tasks are duplicated. One tricky part will be ensuring that every situation where tasks are duplicated for the restarted round (instead of just cloned), will receive a new id.

kellpossible · 2021-07-17T18:23:10Z

It really is a big change, everywhere there is the assumption that a task can be reconstructed using only the contribution id and chunk id.

Edit: this is a huge change, touches practically every part of the coordinator, and in a non-trivial way. E.g. in Round method, how to obtain the unique id for a task in the previous round? Currently it can construct locators for tasks from the previous round without worry. This is just one example, there are hundreds.

See the number of build errors there are with the fix/304-round-restart-fail-task-id branch, many of them I some how need to source the task id from somewhere...

Perhaps there is another less disruptive way to approach this problem.

kellpossible · 2021-07-17T19:11:46Z

I'm starting to think that essentially rendering the coordinator server single threaded in the worst case with a lock over the coordinator state for the entire duration each operation would be easier, and probably less disruptive. aleo-setup-coordinator Operator methods could be updated to require a lock (in effect a lock on the coordinator/round state), that can be shared across multiple methods to achieve an atomic operation. We could add read write differentiation for the lock so that parallel read only operations at least don't block.

The system requires fundamental changes to make it stable for completely parallel asynchronous operations, thinking it's probably better to work around this with a lock. I'm hoping it will still be possible to allow contributions to be uploaded in parallel with this solution, but the act of deciding whether to save it to a file or return an error will be blocking/synchronous.

kellpossible · 2021-07-17T19:22:08Z

@AleoHQ/aleo-setup what do we think about this?

kellpossible · 2021-07-18T16:01:08Z

At the moment we've got locks within locks, Operator contains an RwLock on coordinator, and Coordinator contains RwLock on Storage and CoordianatorState. It's difficult for the user of Corodinator to reason about the mutability effects of calling methods. I think this issue demonstrates that those concerns can affect the user, and it may be worthwhile exposing the mutability concerns of Coordinator, and removing the internal locks. This make Coordinator a little bit more tricky to use, but we already have it wrapped in an RwLock, so it won't be much different I don't think, and it will simplify/clarify the situation.

kellpossible · 2021-07-24T07:24:30Z

I've found another problem, files are created during Round::try_lock_chunk with Storage::initialize, these files are probably not removed during round reset. Probably round reset needs to be updated to try removing these files for any locked chunks. Or we can alter the code to initialize these locators when the contributions actually occur rather than creating empty files.

kellpossible · 2021-07-24T08:56:03Z

As a workaround I tried disabling the storage initialization in Storage::try_lock_chunk It appears as though the locator is expected to be initialized in storage before it can be read during Coordinator::try_contribute, even though the file exists on disk after being uploaded, it has not been registered in the storage manifest, so the read fails. Performing the initialization in the try_contribute will conflict with the file that is already on disk. Perhaps we could add a new method for initialization to adopt a file that already exists.

+ Remove contributions/verification files that were created when locks were taken and the locators initialized. + Some more tracing instrumentation. + Remove some unecessary inlines.

kellpossible · 2021-07-24T15:05:28Z

I ended up taking the approach of deleting the files created in the lock during the reset in #315

Fix #304 manual restart fails

kellpossible added the bug Something isn't working label Jul 13, 2021

kellpossible self-assigned this Jul 17, 2021

kellpossible added a commit that referenced this issue Jul 17, 2021

work in progress changes to address #304

3f07147

kellpossible mentioned this issue Jul 18, 2021

Refactors for Fix/304 manual restart fails #311

Merged

kellpossible added a commit that referenced this issue Jul 24, 2021

Fix #304

93b3191

+ Remove contributions/verification files that were created when locks were taken and the locators initialized. + Some more tracing instrumentation. + Remove some unecessary inlines.

kellpossible linked a pull request Jul 24, 2021 that will close this issue

Refactors for Fix/304 manual restart fails #311

Merged

kellpossible removed a link to a pull request Jul 24, 2021

Refactors for Fix/304 manual restart fails #311

Merged

kellpossible linked a pull request Jul 24, 2021 that will close this issue

Fix #304 manual restart fails #315

Merged

kellpossible mentioned this issue Jul 24, 2021

Fix #304 manual restart fails #315

Merged

apruden2008 closed this as completed in #315 Jul 26, 2021

apruden2008 added a commit that referenced this issue Jul 26, 2021

Merge pull request #315 from AleoHQ/fix/304-manual-restart-fails-2

2025b54

Fix #304 manual restart fails

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual round restart fails, verifier uploads response during the round reset #304

Manual round restart fails, verifier uploads response during the round reset #304

kellpossible commented Jul 13, 2021 •

edited

Loading

kellpossible commented Jul 13, 2021

ibaryshnikov commented Jul 13, 2021

kellpossible commented Jul 17, 2021

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021

kellpossible commented Jul 18, 2021 •

edited

Loading

kellpossible commented Jul 24, 2021

kellpossible commented Jul 24, 2021

kellpossible commented Jul 24, 2021

Manual round restart fails, verifier uploads response during the round reset #304

Manual round restart fails, verifier uploads response during the round reset #304

Comments

kellpossible commented Jul 13, 2021 • edited Loading

kellpossible commented Jul 13, 2021

ibaryshnikov commented Jul 13, 2021

kellpossible commented Jul 17, 2021

kellpossible commented Jul 17, 2021 • edited Loading

kellpossible commented Jul 17, 2021 • edited Loading

kellpossible commented Jul 17, 2021 • edited Loading

kellpossible commented Jul 17, 2021 • edited Loading

kellpossible commented Jul 17, 2021 • edited Loading

kellpossible commented Jul 17, 2021

kellpossible commented Jul 18, 2021 • edited Loading

kellpossible commented Jul 24, 2021

kellpossible commented Jul 24, 2021

kellpossible commented Jul 24, 2021

kellpossible commented Jul 13, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 17, 2021 •

edited

Loading

kellpossible commented Jul 18, 2021 •

edited

Loading