-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Refactor phase1-coordinator's coordinator.rs, coordinator_state.rs #267
Comments
Thanks for this @kellpossible. Very clearly stated. Agree with your proposed course of action, I and I think we should also discuss in more detail during our setup standup call. @raychu86 and @howardwu who were authors of the original codebase may also have comments about your analysis or the original motivation for some of the earlier architectural decisions. Would be great to get your take as well. |
I just want to add that this is not a criticism of the work from @raychu86 and @howardwu. It's entirely possible that the current implementation is this way simply due to inherent complexities of the problem domain that I have not yet completely understood, and that an improvement will be very difficult to come by. It's also much easier to make comments about a design in hindsight than it is to undertake one at the beginning of a project. I still feel like we are on a fine balance between it being worthwhile to undertake this refactor vs just hacking our way to the finish line. It's not my call. I do however think it would be a good idea for us to continue to allocate effort to document the existing design and requirements of the software so we can make a better assessment, and even if we do not choose to make the change this time, the documentation will be there for next time, and for any future contributors to understand how the project works and why it is the way it is. |
I have previous experience undertaking refactors of similar scale or larger in my last job, and when we properly analysed and documented the existing design and requirements, it was easy to make a choice that everyone was happy with, and the implementation went fairly smoothly. In some cases we rejected making any major changes, and the choice was logical and easier to be content with. |
Aleo Setup Refactoring Discussion
Discussion about proposed changes in #267 Proposal: Refactor phase1-coordinator's coordinator.rs, coordinator_state.rs. Agenda
MotivationsSituation
What we like with the current design.
Time Available
Proposed RefactorsMultiple Task Queues
Chunk Locking
Reduce Task Decision Complexity in Clients
Side Effects and Global State
Storage Layer
|
Currently we are stalling on #256, which appears on the surface to be a bug fix, but in reality is continuing the implementation of the replacement contributors (#260) now that we are attempting to run the feature through integration testing. During the course of this work, some problems with the current software design were highlighted. I've held off on stating this kind of feedback for a while, everyone has something to complain about when working on a legacy project, but now that we are quite a few months into this, which was supposed to have been implemented in a few weeks, and we have a better understanding of the scope of the project, perhaps it is the correct time to highlight some technical debt which may be impeding progress:
Side Effects and Global State
There are many unexpected side-effects hidden in methods, making it difficult to reason about changes and causes to bugs. Prime among these is the global state of the
Storage
layer, which can be modified at any point incoordinator
, and which is also modified by the web server hosting the coordinator. Modification of the task queues in different places is also a great source of complexity and potential problems.What modification of state is required to drop a contributor? The answer to this question is spread out among at least 3 files. The round state is handed separately and in different places to the storage layer, which allows them to easily get out of sync, causing confusing problems. A more functional style approach separating business logic calculations from the performing of important actions with effects would make the code a lot easier to reason about. All modification of important state should be concentrated to one place where it is easy to test, and easier to reason about.
Multiple Task Queues
Currently we have multiple queues for tasks, per participant, and assigned tasks get moved between them at different points during the ceremony, it's possible for tasks to go missing or get duplicated between queues. Each queue actually represents the state of a given task. This is a source of confusion and bugs while working on the
coordinator_state.rs
module.We could perhaps instead store the tasks in a single master list, however we still need to consider concurrent read/write access and iteration/cache coherence performance implications. There is currently a maximum of about 40000 tasks (universal setup, 4096 chunks * 5 participants * 2 (contribution + verification)). At the moment there is no concurrent read/write access, as the entire coordinator state is behind an
RwLock
. If traversing the entire task list is a problem then tasks could be maintained in per-participant buckets, perhaps using something likedashmap
to allow bucket independent mutability, but still eliminating the separate queues for task state.Chunk Locking
Currently, from my understanding, the external API for the locking of chunks does not appear to be inherently needed to perform the requirements of the of the coordinator. It appears to be an artefact of the implementation, which introduces unnecessary complexity. If the coordinator is deciding which tasks that contributors and verifiers should perform at a given moment in time, there should be no need to expose this in the frontend. Internally there may be some use to locking chunks for verifiers, but perhaps there may be other simpler or more reliable ways to design this.
"Unsafe" Methods
There are a lot of concerning TODO's and concerning "unsafe" (not in the Rust sense) labelled methods, which also might be an indication of attempts to work around fundamental inadequacies of the design for what we are trying to do. My feeling is that we should not be feeling the need to label methods as unsafe if we structure the design of the code such that they are only called when it is perfectly sensible to do so (again, concentrating important changes to global state to a single place).
Storage Layer
The storage layer appears to be built around an abstraction that allows the use of some third party storage layer which is not a file system (some kind of file upload service, or database). This abstraction was never put to use, however it's probably too core to the system to remove, and it might still serve its purpose in the future. The important thing to note is that the web server has access to this layer, and is expected to place contribution files there, this could be a source of problems, changing global state under the feet of the coordinator logic. It could be better to have the contribution files passed directly in via the contribution API methods, and have the coordinator assign them to storage.
There is currently no performance benefit to having some third party service place the files in storage because the coordinator needs to download the files anyway to perform aggregation. The same with obtaining files to serve them, this could also be performed via the API, rather than some backdoor file system access.
The one problem I can see with such an approach is that if we want to distribute aggregation (a bit of a major bottleneck for universal setup), there is no longer a requirement to having the actual files stored with the coordinator. So as an alternative, at least we could have the locator to the files passed in via these methods rather than assuming that they have been put in the correct place. This would require less change to the current design. However having files not stored on the coordinator might make it more complicated to perform a backup.
What to do?
For context, in total there is currently about 5000 total lines of code in
coordinator.rs
andcoordinator_state.rs
, including unit tests. I have started work on a detailed analysis of the requirements for this code in #266 which should also help estimate how long we think a major refactor might take. My personal take is that the problem at hand is not super complicated, in some ways the project has been over-engineered, I'm going to guess maybe 1 month with 1 developer to rewrite these modules.Such changes always take longer than expected, and we should probably increase our worst case estimates. We should also be careful not to suffer from the sunk-cost fallacy. The time spent so far with the project was not a waste, because it gained us valuable insight into the requirements and the problem domain, enabling the production of a much better design than if we had decided to make major changes when @ibaryshnikov and I were first brought on to the project. We also now have a much more comprehensive set of tests, including integration tests.
I'm still going to spend the next couple of days attempting to complete #256, however if it continues to highlight problems, this proposition would be my vote. Our lack of confidence in the code-base still might come back to bite us if we get problems in production or need to implement more features.
The text was updated successfully, but these errors were encountered: