Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Refactor phase1-coordinator's coordinator.rs, coordinator_state.rs #267

Open
kellpossible opened this issue May 16, 2021 · 4 comments
Labels
Epic refactor Changes which affect the maintainability of the code

Comments

@kellpossible
Copy link

kellpossible commented May 16, 2021

Currently we are stalling on #256, which appears on the surface to be a bug fix, but in reality is continuing the implementation of the replacement contributors (#260) now that we are attempting to run the feature through integration testing. During the course of this work, some problems with the current software design were highlighted. I've held off on stating this kind of feedback for a while, everyone has something to complain about when working on a legacy project, but now that we are quite a few months into this, which was supposed to have been implemented in a few weeks, and we have a better understanding of the scope of the project, perhaps it is the correct time to highlight some technical debt which may be impeding progress:

Side Effects and Global State

There are many unexpected side-effects hidden in methods, making it difficult to reason about changes and causes to bugs. Prime among these is the global state of the Storage layer, which can be modified at any point in coordinator, and which is also modified by the web server hosting the coordinator. Modification of the task queues in different places is also a great source of complexity and potential problems.

What modification of state is required to drop a contributor? The answer to this question is spread out among at least 3 files. The round state is handed separately and in different places to the storage layer, which allows them to easily get out of sync, causing confusing problems. A more functional style approach separating business logic calculations from the performing of important actions with effects would make the code a lot easier to reason about. All modification of important state should be concentrated to one place where it is easy to test, and easier to reason about.

Multiple Task Queues

Currently we have multiple queues for tasks, per participant, and assigned tasks get moved between them at different points during the ceremony, it's possible for tasks to go missing or get duplicated between queues. Each queue actually represents the state of a given task. This is a source of confusion and bugs while working on the coordinator_state.rs module.

We could perhaps instead store the tasks in a single master list, however we still need to consider concurrent read/write access and iteration/cache coherence performance implications. There is currently a maximum of about 40000 tasks (universal setup, 4096 chunks * 5 participants * 2 (contribution + verification)). At the moment there is no concurrent read/write access, as the entire coordinator state is behind an RwLock. If traversing the entire task list is a problem then tasks could be maintained in per-participant buckets, perhaps using something like dashmap to allow bucket independent mutability, but still eliminating the separate queues for task state.

Chunk Locking

Currently, from my understanding, the external API for the locking of chunks does not appear to be inherently needed to perform the requirements of the of the coordinator. It appears to be an artefact of the implementation, which introduces unnecessary complexity. If the coordinator is deciding which tasks that contributors and verifiers should perform at a given moment in time, there should be no need to expose this in the frontend. Internally there may be some use to locking chunks for verifiers, but perhaps there may be other simpler or more reliable ways to design this.

"Unsafe" Methods

There are a lot of concerning TODO's and concerning "unsafe" (not in the Rust sense) labelled methods, which also might be an indication of attempts to work around fundamental inadequacies of the design for what we are trying to do. My feeling is that we should not be feeling the need to label methods as unsafe if we structure the design of the code such that they are only called when it is perfectly sensible to do so (again, concentrating important changes to global state to a single place).

Storage Layer

The storage layer appears to be built around an abstraction that allows the use of some third party storage layer which is not a file system (some kind of file upload service, or database). This abstraction was never put to use, however it's probably too core to the system to remove, and it might still serve its purpose in the future. The important thing to note is that the web server has access to this layer, and is expected to place contribution files there, this could be a source of problems, changing global state under the feet of the coordinator logic. It could be better to have the contribution files passed directly in via the contribution API methods, and have the coordinator assign them to storage.

There is currently no performance benefit to having some third party service place the files in storage because the coordinator needs to download the files anyway to perform aggregation. The same with obtaining files to serve them, this could also be performed via the API, rather than some backdoor file system access.

The one problem I can see with such an approach is that if we want to distribute aggregation (a bit of a major bottleneck for universal setup), there is no longer a requirement to having the actual files stored with the coordinator. So as an alternative, at least we could have the locator to the files passed in via these methods rather than assuming that they have been put in the correct place. This would require less change to the current design. However having files not stored on the coordinator might make it more complicated to perform a backup.

What to do?

For context, in total there is currently about 5000 total lines of code in coordinator.rs and coordinator_state.rs, including unit tests. I have started work on a detailed analysis of the requirements for this code in #266 which should also help estimate how long we think a major refactor might take. My personal take is that the problem at hand is not super complicated, in some ways the project has been over-engineered, I'm going to guess maybe 1 month with 1 developer to rewrite these modules.

Such changes always take longer than expected, and we should probably increase our worst case estimates. We should also be careful not to suffer from the sunk-cost fallacy. The time spent so far with the project was not a waste, because it gained us valuable insight into the requirements and the problem domain, enabling the production of a much better design than if we had decided to make major changes when @ibaryshnikov and I were first brought on to the project. We also now have a much more comprehensive set of tests, including integration tests.

I'm still going to spend the next couple of days attempting to complete #256, however if it continues to highlight problems, this proposition would be my vote. Our lack of confidence in the code-base still might come back to bite us if we get problems in production or need to implement more features.

@kellpossible kellpossible added the refactor Changes which affect the maintainability of the code label May 16, 2021
@apruden2008
Copy link

Thanks for this @kellpossible. Very clearly stated. Agree with your proposed course of action, I and I think we should also discuss in more detail during our setup standup call.

@raychu86 and @howardwu who were authors of the original codebase may also have comments about your analysis or the original motivation for some of the earlier architectural decisions. Would be great to get your take as well.

@kellpossible
Copy link
Author

kellpossible commented May 18, 2021

I just want to add that this is not a criticism of the work from @raychu86 and @howardwu. It's entirely possible that the current implementation is this way simply due to inherent complexities of the problem domain that I have not yet completely understood, and that an improvement will be very difficult to come by. It's also much easier to make comments about a design in hindsight than it is to undertake one at the beginning of a project.

I still feel like we are on a fine balance between it being worthwhile to undertake this refactor vs just hacking our way to the finish line. It's not my call. I do however think it would be a good idea for us to continue to allocate effort to document the existing design and requirements of the software so we can make a better assessment, and even if we do not choose to make the change this time, the documentation will be there for next time, and for any future contributors to understand how the project works and why it is the way it is.

@kellpossible
Copy link
Author

I have previous experience undertaking refactors of similar scale or larger in my last job, and when we properly analysed and documented the existing design and requirements, it was easy to make a choice that everyone was happy with, and the implementation went fairly smoothly. In some cases we rejected making any major changes, and the choice was logical and easier to be content with.

@kellpossible
Copy link
Author

Aleo Setup Refactoring Discussion

Discussion about proposed changes in #267 Proposal: Refactor phase1-coordinator's coordinator.rs, coordinator_state.rs.

Agenda

  • Add any new proposed refactors not already mentioned.
  • Time/resources available for refactors.
  • Discuss proposed refactors impacts.
  • Discuss proposed refactors implementation details.
  • Estimate proposed refactors and rank them.
  • Decide what we would like to implement in the time available.

Motivations

Situation

  • Ilya and Luke brought onto the project which was said to be almost complete, only requiring some tests and fix a few bugs. We are now 6 months in, and still fixing bugs, and trying not to break the code.
  • In effect the drop contributor/replacement design is a complex solution, perhaps in hindsight it would have been better to implement a more simple solution had we known the state of the implementation earlier.

What we like with the current design.

  • Restarting the coordinator is a great feature.
  • The integration test has been helpful.

Time Available

  • We don't have much time for the first setup, only a couple of weeks, we want to prioritise the logic for restarting the round.
  • In 2 weeks we will run the universal setup, we will probably run as is, hopefully with the round restart implemented.
  • In 1 month we will run the outer setup.
  • In about 2.5 months we will run the browser inner setup.

Proposed Refactors

Multiple Task Queues

  • This would be difficult to make this change and maintain backwards compatibility with the serialized round state. If we decide we want to change once the ceremony is already running it will be difficult.
  • We could make a branch for the ceremony that is already running so we don't need to make these changes backwards compatible.
  • Big change because it's central data structure, at two weeks to implement, plus another two for the review, times 2 for everthing because we didn't expect something, so 8 weeks.

Chunk Locking

  • Highest priority. Cost/Benefit is the highest, and within timeframe.
  • The state of the task is duplicated by the lock, and the chunk state (what queue the task is in).
  • By definition of initialize_tasks() in tasks.rs which is run at the start of the round, participants cannot have duplicated tasks, and therefore should not be working on the same task.
  • We seem to agree that chunk locks should not be necessary.
  • Maybe 2 weeks of work, mostly just deleting stuff.
  • Does contribute to the complexity of the round restart.

Reduce Task Decision Complexity in Clients

  • Make the verifier and contributor more stupid, it should only be asking the coordinator what is the next task for me to work on.
  • Reliability checks will be a case study for this refactor.

Side Effects and Global State

  • Will be helped a lot by reducing complexity with chunk locking and the task queues should also alleviate this problem.

Storage Layer

  • Storage path sanitization in the aleo-setup-coordinator is a critical issue.
  • Think about this issue separately, perhaps in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic refactor Changes which affect the maintainability of the code
Projects
None yet
Development

No branches or pull requests

2 participants