Skip to content

Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

MattKotzbauer
Copy link
Contributor

GH Issue: #173

Extends Lighthouse to to support multiple independent quorums on a single server by tagging each gRPC call with a room-id metadata header and feeding requests through a lightweight router that maintains per‐room state with a DashMap<String, Arc<Lighthouse>>. LighthouseClient now accepts an optional room_id argument and automatically injects the corresponding metadata header into each heartbeat and quorum request, while untagged calls continue to use a default namespace. In multi_quorum_test.py, I created 2 clients with distinct room ID's and have them form independent quorums on the same server port.

Open to making any changes to the code or approach 🤙.

Warmly,
Matt

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 5, 2025
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! This looks very promising and excited to have this in torchft.

I think there's a couple of ways we can make this cleaner/more generic that I've commented on

let room = self.room(&id).await;
<Arc<Lighthouse> as LighthouseService>::heartbeat(&room, req).await
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as is since this is fairly minimal boilerplate per request but I think we can do even better.

By doing this at the Service layer instead of LighthouseService layer we can have it automatically work for all endpoints on the LighthouseService

Can you look into this and see how feasible it is? If it's not any cleaner we can land this as is

Some pointers:

There's also https://github.com/teimuraz/tonic-middleware which might be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried an initial attempt to do the routing at the Service layer rather than the LighthouseService layer, but have had trouble adapting between the initial tonic message types (tonic::Request/Response) and the Tower message types (http::Request/Response) - tonic::Request/Response wraps the body in tonic::body::BoxBody and carries gRPC-specific extensions, while the Tower stack we’re intercepting expects a bare http::Request/Response<B> where the body implements HttpBody. I haven't yet found a concise way to do this.

If I were to keep at this, I'd see if I could get something working that relies more on tonic-middleware - perhaps there's a way to stay entirely in the tonic domain that keeps the implementation and debugging cleaner?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixing the two is a bit tricky -- we probably need to stay at the tower layer. Why do you need to access the tonic::Request/Response objects? It's all HTTP at the end of the day so seems like we should be able to operate at the tower/http layer and view the metadata as a header?

middleware might work though it may be too high level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, it became easier when I had router.rs operate entirely at the tower layer rather than trying to mix Service and tonic. Most recent commit has router.rs at the tower level, which lets us start the lighthouse server with a call to
Server::builder().add_service(router).serve(addr)

Matt Kotzbauer and others added 2 commits May 7, 2025 15:57
…:builder calls (in src/bin/lighthouser.rs, src/lib.rs) and torchft/multi_quorum_test.py modified to reflect change.
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MattKotzbauer this is looking great and thanks for pushing this through! Just need some small cleanups

@@ -0,0 +1,12 @@
use tonic::{Request, Status, service::Interceptor};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's unintentional, removing in next commit

@@ -654,7 +689,7 @@ impl LighthouseServer {
/// Returns:
/// str: The address of the lighthouse server.
fn address(&self) -> PyResult<String> {
Ok(self.lighthouse.address().to_string())
Ok(self.bind.clone())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this unfortunately isn't sufficient -- bind could be something like "0.0.0.0:0" which will bind to a random port. Address needs to be the routable http address i.e. http://foo.bar:1324

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, perhaps we could use similar calls as the Lighthouse class uses to resolve host IP and address? Will include a version of this in next commit, though am also down to change it

/// gRPC server for a single room (inner state = `Arc<Lighthouse>`).
type GrpcSvc = LighthouseServiceServer<Arc<Lighthouse>>;

#[derive(Clone)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does Router need to be Cloneable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly made Router Cloneable so that calls to tonic's add_service would compile when constructing the LighthouseServer in src/bin/lighthouse.rs and src/lib.rs

src/router.rs Outdated
}

// Build room state once.
let lh = Lighthouse::new(tmpl.clone())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass in the id into Lighthouse so we can prepend it to the Lighthouse log messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will include in next commit


import pytest

import torchft._torchft as ext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the torchft.coordination API for this test instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg, including in next commit

@@ -0,0 +1,44 @@
from __future__ import annotations

import datetime as _dt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually just do from datetime import timedelta

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg, including in next commit (moving test into lighthouse_test.py and will use existing imports)

src/router.rs Outdated
rooms: Arc<DashMap<String, Arc<GrpcSvc>>>,
tmpl: LighthouseOpt,
id: &str,
) -> Arc<GrpcSvc> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be typed Arc<LighthouseServiceServer> instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good - am changing in line with the below thread (returning an Arc and then wrapping with a LighthouseServiceServer once the method returns).

src/router.rs Outdated
.await
.expect("failed to create Lighthouse");

let svc_new = Arc::new(LighthouseServiceServer::new(lh));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be just returning Arc from this method and constructing the LighthouseServiceServer wrapper on demand so we don't need to clone it in the parent method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh true, will include in next commit

@@ -0,0 +1,44 @@
from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could move this test to lighthouse_test.py to keep it with the rest of the lighthouse tests

…to Arc<Lighthouse>, Lighthouse::new now takes id prefix, test relocated to lighthouse_test.py and now uses coordination API, LighthouseServer now resolves host/port from the bound socket to give a routable http://host:port address
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@d4l3k
Copy link
Member

d4l3k commented May 28, 2025

@MattKotzbauer looks like lint and unit tests are failing, not sure if that's related to this PR though

@d4l3k d4l3k force-pushed the HTTP/2_Multi_Room_Lighthouse branch from dc92b42 to 53ec8be Compare May 29, 2025 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants