-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manual round restart fails, verifier uploads response during the round reset #304
Comments
Perhaps this has something to do with the caching of the state...! |
Right now the task is identified by chunk_id and contribution_id. If you reset the round and reassign the tasks, the old tasks may overlap with new tasks, and there's a chance verifiers/contributors would be able to upload the old work after the round reset. Therefore it's not sound. |
But with the coordinator state write locks obtained during the round reset, and the same read lock obtained during contribution upload it seems like that shouldn't happen. Perhaps these aren't the same lock currently due to the duplicated state with the cached state, and removing the cache would also fix this problem without a serious performance impact. |
Okay I had a play with eliminating the cache, it doesn't appear to have a noticeable effect on performance (there are no doubt some extra clones in there but in the grand scheme of things it seems the coordinator spends waiting for during a round it's small), whether or not it fixes the problem entirely I don't know. The locks on the coordinator state are still not held for the duration of the operation. At least without the cache it is less likely to be using outdated state (which was previously only updated maybe once every 10 seconds). Perhaps I can modify the operator to lock the coordinator state for the duration of the operation. I'll keep exploring this solution space, but @ibaryshnikov 's suggestion also deserves attention |
Implementing a unique id for tasks is no small task itself! It alters the equality of tasks, and serialization/deserialization and the basic definition of a task. It is starting to seem like the correct solution but I'll probably break a lot of stuff with this! |
@ibaryshnikov what do you think we could name contribution files |
It really is a big change, everywhere there is the assumption that a task can be reconstructed using only the contribution id and chunk id. Edit: this is a huge change, touches practically every part of the coordinator, and in a non-trivial way. E.g. in See the number of build errors there are with the fix/304-round-restart-fail-task-id branch, many of them I some how need to source the task id from somewhere... Perhaps there is another less disruptive way to approach this problem. |
I'm starting to think that essentially rendering the coordinator server single threaded in the worst case with a lock over the coordinator state for the entire duration each operation would be easier, and probably less disruptive. The system requires fundamental changes to make it stable for completely parallel asynchronous operations, thinking it's probably better to work around this with a lock. I'm hoping it will still be possible to allow contributions to be uploaded in parallel with this solution, but the act of deciding whether to save it to a file or return an error will be blocking/synchronous. |
@AleoHQ/aleo-setup what do we think about this? |
At the moment we've got locks within locks, Operator contains an RwLock on coordinator, and Coordinator contains RwLock on Storage and CoordianatorState. It's difficult for the user of Corodinator to reason about the mutability effects of calling methods. I think this issue demonstrates that those concerns can affect the user, and it may be worthwhile exposing the mutability concerns of Coordinator, and removing the internal locks. This make Coordinator a little bit more tricky to use, but we already have it wrapped in an RwLock, so it won't be much different I don't think, and it will simplify/clarify the situation. |
I've found another problem, files are created during |
As a workaround I tried disabling the storage initialization in |
I ended up taking the approach of deleting the files created in the lock during the reset in #315 |
Currently the verifiers first upload their verifications
/v1/upload/challenge/...
and thentry_verify
/v1/verifier/try_verify/...
to perform verification. Verifiers seem to be able to upload verifications even during the round reset ( #288 ) process, even though they have technically already been removed from the round (but not in the cached state in the operator inaleo-setup-coordinator
seemingly). This is probably a serious problem, and seems to be causing problems in the restarted round when these files are unexpectedly already present. This behaviour may also be present with contributors, and could also explain some of the problems encountered with implementing the replacement contributors.This behaviour can be seen on line 50570 in the log:
coordinator.log
another upload that now fails because the participant has now been removed. I'm not sure why there isn't a
try_verify
between these two, or perhaps one of these log lines is fromtry_verify
and the other is from the other verifier... It warrants more investigation.The text was updated successfully, but these errors were encountered: