Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual round restart fails, verifier uploads response during the round reset #304

Closed
kellpossible opened this issue Jul 13, 2021 · 13 comments · Fixed by #315
Closed

Manual round restart fails, verifier uploads response during the round reset #304

kellpossible opened this issue Jul 13, 2021 · 13 comments · Fixed by #315
Assignees
Labels
bug Something isn't working

Comments

@kellpossible
Copy link

kellpossible commented Jul 13, 2021

Currently the verifiers first upload their verifications /v1/upload/challenge/... and then try_verify /v1/verifier/try_verify/... to perform verification. Verifiers seem to be able to upload verifications even during the round reset ( #288 ) process, even though they have technically already been removed from the round (but not in the cached state in the operator in aleo-setup-coordinator seemingly). This is probably a serious problem, and seems to be causing problems in the restarted round when these files are unexpectedly already present. This behaviour may also be present with contributors, and could also explain some of the problems encountered with implementing the replacement contributors.

This behaviour can be seen on line 50570 in the log:
coordinator.log

Jul 11 17:02:28.956 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_1.unverified
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed ./transcript/development/round_1/chunk_36/contribution_1.unverified
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removing file ./transcript/development/round_1/chunk_36/contribution_1.unverified.signature
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_1.unverified.signature
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed ./transcript/development/round_1/chunk_36/contribution_1.unverified.signature
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removing file ./transcript/development/round_1/chunk_36/contribution_1.verified
Jul 11 17:02:28.957 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_1.verified
Jul 11 17:02:28.958 TRACE aleo_setup_coordinator::api::upload::challenge: Response uploaded from: aleo1em2sc2e24f6kg6ksr690etmz6tnp7h4tjekedwnhukvqy2snkuxs4lw67p.verifier to transcript/development/round_2/chunk_26/contribution_0.verified
Jul 11 17:02:28.958 TRACE aleo_setup_coordinator::api::upload::challenge: Contribution file signature uploaded from: aleo1em2sc2e24f6kg6ksr690etmz6tnp7h4tjekedwnhukvqy2snkuxs4lw67p.verifier to transcript/development/round_2/chunk_26/contribution_0.verified.signature
Jul 11 17:02:28.958 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed ./transcript/development/round_1/chunk_36/contribution_1.verified
Jul 11 17:02:28.958 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removing file ./transcript/development/round_1/chunk_36/contribution_1.verified.signature
Jul 11 17:02:28.958 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_1.verified.signature
Jul 11 17:02:28.959 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed ./transcript/development/round_1/chunk_36/contribution_1.verified.signature
Jul 11 17:02:28.959 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removing file ./transcript/development/round_1/chunk_36/contribution_2.unverified
Jul 11 17:02:28.959 TRACE reset_round{round=1}: phase1_coordinator::storage::disk: Removed file ./transcript/development/round_1/chunk_36/contribution_2.unverified

another upload that now fails because the participant has now been removed. I'm not sure why there isn't a try_verify between these two, or perhaps one of these log lines is from try_verify and the other is from the other verifier... It warrants more investigation.

Jul 11 17:02:29.096 TRACE phase1_coordinator::storage::disk: Fetched ./transcript/development/round_0/state.json
Jul 11 17:02:29.096 ERROR aleo_setup_coordinator::operator: Participant aleo1em2sc2e24f6kg6ksr690etmz6tnp7h4tjekedwnhukvqy2snkuxs4lw67p.verifier is not authorized
Jul 11 17:02:29.096 ERROR phase1_coordinator::coordinator: ParticipantUnauthorized
Jul 11 17:02:29.096 ERROR phase1_coordinator::coordinator: ParticipantUnauthorized
Jul 11 17:02:29.096 ERROR aleo_setup_coordinator::api::upload::challenge: Error uploading next challenge file ParticipantUnauthorized
Jul 11 17:02:29.657 TRACE aleo_setup_coordinator::authentication::authentication_middleware: Authentication for address aleo15fc08f4pcky20f63ldqa97guc4gr554w7zduxmt7yuf3dv7dacxqpflffg message is: "post /v1/verifier/try_lock"
@kellpossible kellpossible added the bug Something isn't working label Jul 13, 2021
@kellpossible
Copy link
Author

Perhaps this has something to do with the caching of the state...!

@ibaryshnikov
Copy link

Right now the task is identified by chunk_id and contribution_id. If you reset the round and reassign the tasks, the old tasks may overlap with new tasks, and there's a chance verifiers/contributors would be able to upload the old work after the round reset. Therefore it's not sound.
Assigning a random number to each task upon creation solves the problem. When contributor/verifier uploads the work, not only chunk_id/contribution_id should match, but also an additional identifier. And on round reset all tasks will receive the new identifier, making uploading old work impossible.

@kellpossible kellpossible self-assigned this Jul 17, 2021
@kellpossible
Copy link
Author

there's a chance verifiers/contributors would be able to upload the old work after the round reset.

But with the coordinator state write locks obtained during the round reset, and the same read lock obtained during contribution upload it seems like that shouldn't happen. Perhaps these aren't the same lock currently due to the duplicated state with the cached state, and removing the cache would also fix this problem without a serious performance impact.

@kellpossible
Copy link
Author

kellpossible commented Jul 17, 2021

Okay I had a play with eliminating the cache, it doesn't appear to have a noticeable effect on performance (there are no doubt some extra clones in there but in the grand scheme of things it seems the coordinator spends waiting for during a round it's small), whether or not it fixes the problem entirely I don't know. The locks on the coordinator state are still not held for the duration of the operation. At least without the cache it is less likely to be using outdated state (which was previously only updated maybe once every 10 seconds). Perhaps I can modify the operator to lock the coordinator state for the duration of the operation. I'll keep exploring this solution space, but @ibaryshnikov 's suggestion also deserves attention

@kellpossible
Copy link
Author

kellpossible commented Jul 17, 2021

Implementing a unique id for tasks is no small task itself! It alters the equality of tasks, and serialization/deserialization and the basic definition of a task. It is starting to seem like the correct solution but I'll probably break a lot of stuff with this!

@kellpossible
Copy link
Author

kellpossible commented Jul 17, 2021

@ibaryshnikov what do you think we could name contribution files contribution_{contribution_id}_{id}.(verified|unverified)(.signature) where id is a randomly generated uuid created at task start, or when tasks are duplicated. One tricky part will be ensuring that every situation where tasks are duplicated for the restarted round (instead of just cloned), will receive a new id.

@kellpossible
Copy link
Author

kellpossible commented Jul 17, 2021

It really is a big change, everywhere there is the assumption that a task can be reconstructed using only the contribution id and chunk id.

Edit: this is a huge change, touches practically every part of the coordinator, and in a non-trivial way. E.g. in Round method, how to obtain the unique id for a task in the previous round? Currently it can construct locators for tasks from the previous round without worry. This is just one example, there are hundreds.

See the number of build errors there are with the fix/304-round-restart-fail-task-id branch, many of them I some how need to source the task id from somewhere...

Perhaps there is another less disruptive way to approach this problem.

@kellpossible
Copy link
Author

kellpossible commented Jul 17, 2021

I'm starting to think that essentially rendering the coordinator server single threaded in the worst case with a lock over the coordinator state for the entire duration each operation would be easier, and probably less disruptive. aleo-setup-coordinator Operator methods could be updated to require a lock (in effect a lock on the coordinator/round state), that can be shared across multiple methods to achieve an atomic operation. We could add read write differentiation for the lock so that parallel read only operations at least don't block.

The system requires fundamental changes to make it stable for completely parallel asynchronous operations, thinking it's probably better to work around this with a lock. I'm hoping it will still be possible to allow contributions to be uploaded in parallel with this solution, but the act of deciding whether to save it to a file or return an error will be blocking/synchronous.

@kellpossible
Copy link
Author

@AleoHQ/aleo-setup what do we think about this?

@kellpossible
Copy link
Author

kellpossible commented Jul 18, 2021

At the moment we've got locks within locks, Operator contains an RwLock on coordinator, and Coordinator contains RwLock on Storage and CoordianatorState. It's difficult for the user of Corodinator to reason about the mutability effects of calling methods. I think this issue demonstrates that those concerns can affect the user, and it may be worthwhile exposing the mutability concerns of Coordinator, and removing the internal locks. This make Coordinator a little bit more tricky to use, but we already have it wrapped in an RwLock, so it won't be much different I don't think, and it will simplify/clarify the situation.

@kellpossible
Copy link
Author

I've found another problem, files are created during Round::try_lock_chunk with Storage::initialize, these files are probably not removed during round reset. Probably round reset needs to be updated to try removing these files for any locked chunks. Or we can alter the code to initialize these locators when the contributions actually occur rather than creating empty files.

@kellpossible
Copy link
Author

As a workaround I tried disabling the storage initialization in Storage::try_lock_chunk It appears as though the locator is expected to be initialized in storage before it can be read during Coordinator::try_contribute, even though the file exists on disk after being uploaded, it has not been registered in the storage manifest, so the read fails. Performing the initialization in the try_contribute will conflict with the file that is already on disk. Perhaps we could add a new method for initialization to adopt a file that already exists.

kellpossible added a commit that referenced this issue Jul 24, 2021
+ Remove contributions/verification files that were created when locks were
  taken and the locators initialized.
+ Some more tracing instrumentation.
+ Remove some unecessary inlines.
@kellpossible kellpossible linked a pull request Jul 24, 2021 that will close this issue
@kellpossible kellpossible linked a pull request Jul 24, 2021 that will close this issue
@kellpossible
Copy link
Author

I ended up taking the approach of deleting the files created in the lock during the reset in #315

apruden2008 added a commit that referenced this issue Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants