Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate Realm from Legion #1781

Open
muraj opened this issue Oct 23, 2024 · 3 comments
Open

Separate Realm from Legion #1781

muraj opened this issue Oct 23, 2024 · 3 comments
Assignees
Labels
Build Issues pertaining to build systems CI Issues pertaining to continuous integration Realm Issues pertaining to Realm
Milestone

Comments

@muraj
Copy link

muraj commented Oct 23, 2024

This issue is more to keep track of and discuss different ways to structure Realm as a standalone library and to leverage NVIDIA CI for testing, following a discussion from the Legion meeting on 10/23/2024

Currently, Realm lives within Legion in a sort of mixed mono-repo style system, which has worked well enough for a while since most of Realm's users are also Legion users. As Realm develops more, Realm is having more and more direct users. This leads to the conclusion that Realm really needs to stand on it's own as it's own library.

Additionally, Legion CI's resources are not enough to capture all the configurations and implementation bugs that have arisen as a result of the large amount of engineering efforts on Realm. In order to manage this, several NVIDIA employees contributing to Realm would like to leverage NVIDIA resources to manage CI. In order to do so though, NVIDIA needs to own the code repository and limit users' access to the NVIDIA CI test machines. There are many ways to manage this:

  1. Move active development to a open sourced, public github repository under NVIDIA ownership. Contributors are welcome to provide pull requests (PR), but ultimately an nvidia employee needs to approve running CI on the PR and approve merging the PR in.
    1.a) This would be the best case scenario for the majority of active Realm developers, especially if we can whitelist contributors to trigger CI.

  2. Keep the gitlab / github repositories as they are, but move most of the active developement to an NVIDIA owned repository that will periodically make code drops to the gitlab/github mirrors.
    2.a) This is a major pain for devops, as each code drop will require work to reconcile changes from upstream with the code drop.

  3. Keep the gitlab / github repositories as they are, but move most of the active development to an NVIDIA owned repositiory that will be used soley for NVIDIA CI, and changes must still go through the gitlab PR approval process.
    3.a) This is a major pain for development, in that it requires a large amount of our developement to go through extra hoops, which will significantly slow down the development process

@muraj muraj added Realm Issues pertaining to Realm Build Issues pertaining to build systems CI Issues pertaining to continuous integration labels Oct 23, 2024
@muraj muraj added this to the realm-25.02 milestone Oct 23, 2024
@muraj muraj self-assigned this Oct 23, 2024
@muraj
Copy link
Author

muraj commented Oct 23, 2024

Mentioning @elliottslaughter @lightsighter @alexaiken for visibility, feel free to add whoever might be interested.

@lightsighter
Copy link
Contributor

First, I think there are actually three separate questions here:

  1. Should we separate Realm into its own repository?
  2. If we do separate Realm into its own repository, who owns the "canonical" version of the Realm repository?
  3. Should the day-to-day development of Realm take place in the "canonical" repository or can it occur in a fork?

Having thought about these questions quite a bit now, I'll go ahead and be the first one to put a stake in the ground. I'm going to try to balance several competing concerns in a way that I personally would find reasonable. Others may feel differently and I won't be offended.

With regards to the first question, I think the answer here is probably yes. I think it's time for Realm to step out of Legion's shadow and stand on its own. We do want to attract other users that might just want to use Realm and they shouldn't need to download all of Legion to be able to do that. We've already done the hardest part by keeping the Realm and Legion codes themselves separated. Separating build systems and tests won't be easy, but it's easier than separating code. This may make it more likely for Realm bugs to slip in while Realm is still under-tested, but that should incentivize better Realm testing. This will make our lives a bit harder on the Legion side of the world, but I think I'm ok with that if it gives the Realm team some more autonomy and visibility.

For the second question, it's my preference that the "canonical" Realm repository remain under the Legion github organization. Realm has been developed under the Legion github organization its whole life, and if something happens at NVIDIA I don't want the Legion project to lose control over Realm. In practice, I think for me this means four things:

  1. Legion will always only depend on commits in the Realm repository in the Legion github organization.
  2. We always encourage all users of Realm to download Realm's code only from the Legion github organization.
  3. When users want to file a bug against Realm, they make an issue in the Realm repo in the Legion github organization.
  4. Documentation for Realm will still live in the Legion github organization.

Other than that, I don't think there's anything else that must happen on the Realm repo in the Legion github organization. Planning, CI, development, etc can all happen somewhere else if so desired.

And perhaps you can now guess where that leaves me on third question. I think we probably do want to allow the Realm team at NVIDIA to make a fork of the "canonical" repo and put it inside an NVIDIA-owned github organization to make use of NVIDIA's CI resources and have the day-to-day development take place there.

At least for me, I have a couple of preferences for how this work, some of which are more of a deal-breaker than others.

  1. I would like this fork to also remain open-source where the development takes place in the open (with the obvious exception for versions of Realm that are being developed for new NVIDIA hardware that hasn't been released). This isn't a deal-breaker but I think it will make it easier for other people to try out new Realm features before they are merged into the canonical repo.
  2. At least once a quarter there should be a concerted effort to commit code back to the canonical Realm repo in the Legion github organization, preferably before a Legion release so that Legion can try to update the version of Realm that it is using. As part of the release testing, we'll test a bunch of different things with the latest Realm code and see if it passes. If it does, Legion will bump its Realm commit that it depends on and that will go out in the release. If bugs are detected and they can't be fixed before the release, then Legion just won't change the Realm version it is using before the release and Realm's changes will slip to the next release. I'm flexible on how often Realm does these merges into the canonical repo, but we'll probably still do the Legion release testing just once a quarter unless there is an express need to do it sooner. We might also consider doing the Realm merges just after a Legion release to give us an entire quarter to find and fix issues, but that means all Realm users will always be at least three months behind Realm development.
  3. When Realm bugs are discovered during release testing, it will be the responsibility of the Realm team to find and fix the bugs. Legion users aren't going to do anything to help other than maybe providing instructions for how to build and run their applications on a particular machine. Realm's testing has to become good enough to warrant it being separated from Legion and Legion users can't bear the responsibility of helping to find and fix bugs if the day-to-day development for Realm is happening outside of the Legion github organization. This one is non-negotiable for me (since it also applies to me too 😇).

I'll note that the worse case scenario here is that a bug is introduced into Realm that continues to manifest only in Legion applications and it goes several quarters without being found and fixed so Legion ends up getting stuck on a 3, 6, 9, or 12 month old version of Realm. If we see signs of that occurring we will need to probably revisit the conversation about the third question and whether we need to move day-to-day development of Realm back inside the Legion github organization so Legion users can help Realm test itself more robustly until it ramps up its own testing more.

I think this approach balances several different competing concerns in a practical way. It gives the Realm team both more autonomy and control over how they manage Realm, while at the same time placing more responsibility on them to rigorously test and maintain Realm since they won't have the crutch of Legion constantly finding and reporting all the things that break in Realm while it is under development. It allows us to make use of the NVIDIA CI resources without the Legion organization needing to give up ownership of Realm.

@apryakhin
Copy link
Contributor

apryakhin commented Oct 24, 2024

Thanks @muraj for initiating the discussion and @lightsighter for the feedback. Seems like all the right questions have already been asked here.

Should we separate Realm into its own repository?

This is a 'yes' for me, and my main objective is to attract new users. A larger user base generally increases the pressure on the runtime's robustness, which I believe will boost Realm's improvement in multiple areas (quality, features, etc.). One idea that's been discussed is to start organizing Realm's source, CMake/Make files, tests, and documents into a separate directory (which isn't the case today). When the time comes, separating it into its own repo would simply involve taking this directory out.

If we do separate Realm into its own repository, who owns the "canonical" version of the Realm repository?

The answer to this question in my opinion should be considered alongside the next one. How will day-to-day development look if Realm's ownership changes? First, I want to understand whose call this is to make. If it’s a collective decision, we should take steps to gather feedback from everyone, as I feel that very few people are aware of this initiative at the moment—or perhaps just too few truly care? Feedback from the retreat will be good..perhaps don't have to wait that long. If we don't have enough people at the legion meeting for this, we should probably consider inviting them or just reaching out offline with the "heads up" asking for feedback.

Personally, I have no objections to the ownership remaining with Legion. However, I do have objections about decisions that could complicate the DevOps side of Realm. The "two-repository hybrid approach" introduces some overhead but certainly gives the necessary middle ground. Taking advantage of NVIDIA CI resources would be another core objective here.

Lastly, I think it would be reasonable to document in detail how day-to-day development workflows will operate under the proposed "hybrid" approach (partially already summarized above), and have everyone sign off on it. If this approach ultimately hinders the Realm team's velocity or leads to somewhat negative trajectory, I don't see why it couldn't be reconsidered in the future.

Realm's testing has to become good enough to warrant it being separated from Legion and Legion users can't bear the responsibility of helping to find and fix bugs if the day-to-day development for Realm is happening outside of the Legion github organization. This one is non-negotiable for me.

Yes, I keep repeating the same thing in a loop. This is my biggest concern regarding the whole 'standalone Realm' initiative. We've already defined the testing milestones and made a collective effort, but the required velocity just isn't there yet. Root cause analysis, bug fixes, and the 'occasional' higher-priority feature work oftentimes consume significant bandwidth. The unit testing of the core Realm subsystems, which are runtime-dependent, requires refactoring. The integration tests need an audit, and ideally, we should understand what coverage will be lost by losing the Legion/Regent tests. We should probably apply more pressure to get this done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Issues pertaining to build systems CI Issues pertaining to continuous integration Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

3 participants