-
Notifications
You must be signed in to change notification settings - Fork 33
Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment #189
Conversation
…tiple quorums on a single LighthouseServer (pytorch#173)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! This looks very promising and excited to have this in torchft.
I think there's a couple of ways we can make this cleaner/more generic that I've commented on
let room = self.room(&id).await; | ||
<Arc<Lighthouse> as LighthouseService>::heartbeat(&room, req).await | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine as is since this is fairly minimal boilerplate per request but I think we can do even better.
By doing this at the Service layer instead of LighthouseService layer we can have it automatically work for all endpoints on the LighthouseService
Can you look into this and see how feasible it is? If it's not any cleaner we can land this as is
Some pointers:
- https://docs.rs/tower-service/0.3.3/tower_service/trait.Service.html
- https://github.com/hyperium/tonic/blob/b303caa52ba8bbe8172310be7165a80b7c2a53f8/examples/src/tower/server.rs#L83-L109
There's also https://github.com/teimuraz/tonic-middleware which might be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried an initial attempt to do the routing at the Service layer rather than the LighthouseService layer, but have had trouble adapting between the initial tonic message types (tonic::Request/Response
) and the Tower message types (http::Request/Response
) - tonic::Request/Response
wraps the body in tonic::body::BoxBody
and carries gRPC-specific extensions, while the Tower stack we’re intercepting expects a bare http::Request/Response<B>
where the body implements HttpBody
. I haven't yet found a concise way to do this.
If I were to keep at this, I'd see if I could get something working that relies more on tonic-middleware
- perhaps there's a way to stay entirely in the tonic
domain that keeps the implementation and debugging cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mixing the two is a bit tricky -- we probably need to stay at the tower layer. Why do you need to access the tonic::Request/Response objects? It's all HTTP at the end of the day so seems like we should be able to operate at the tower/http layer and view the metadata as a header?
middleware might work though it may be too high level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, it became easier when I had router.rs operate entirely at the tower layer rather than trying to mix Service and tonic. Most recent commit has router.rs at the tower level, which lets us start the lighthouse server with a call to
Server::builder().add_service(router).serve(addr)
…ng add_room_header for each RPC call
…:builder calls (in src/bin/lighthouser.rs, src/lib.rs) and torchft/multi_quorum_test.py modified to reflect change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MattKotzbauer this is looking great and thanks for pushing this through! Just need some small cleanups
src/interceptor.rs~
Outdated
@@ -0,0 +1,12 @@ | |||
use tonic::{Request, Status, service::Interceptor}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's unintentional, removing in next commit
@@ -654,7 +689,7 @@ impl LighthouseServer { | |||
/// Returns: | |||
/// str: The address of the lighthouse server. | |||
fn address(&self) -> PyResult<String> { | |||
Ok(self.lighthouse.address().to_string()) | |||
Ok(self.bind.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this unfortunately isn't sufficient -- bind could be something like "0.0.0.0:0" which will bind to a random port. Address needs to be the routable http address i.e. http://foo.bar:1324
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, perhaps we could use similar calls as the Lighthouse class uses to resolve host IP and address? Will include a version of this in next commit, though am also down to change it
/// gRPC server for a single room (inner state = `Arc<Lighthouse>`). | ||
type GrpcSvc = LighthouseServiceServer<Arc<Lighthouse>>; | ||
|
||
#[derive(Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does Router need to be Cloneable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly made Router Cloneable so that calls to tonic's add_service would compile when constructing the LighthouseServer in src/bin/lighthouse.rs and src/lib.rs
src/router.rs
Outdated
} | ||
|
||
// Build room state once. | ||
let lh = Lighthouse::new(tmpl.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pass in the id
into Lighthouse so we can prepend it to the Lighthouse log messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will include in next commit
torchft/multi_quorum_test.py
Outdated
|
||
import pytest | ||
|
||
import torchft._torchft as ext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the torchft.coordination API for this test instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg, including in next commit
torchft/multi_quorum_test.py
Outdated
@@ -0,0 +1,44 @@ | |||
from __future__ import annotations | |||
|
|||
import datetime as _dt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually just do from datetime import timedelta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg, including in next commit (moving test into lighthouse_test.py and will use existing imports)
src/router.rs
Outdated
rooms: Arc<DashMap<String, Arc<GrpcSvc>>>, | ||
tmpl: LighthouseOpt, | ||
id: &str, | ||
) -> Arc<GrpcSvc> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be typed Arc<LighthouseServiceServer>
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good - am changing in line with the below thread (returning an Arc and then wrapping with a LighthouseServiceServer once the method returns).
src/router.rs
Outdated
.await | ||
.expect("failed to create Lighthouse"); | ||
|
||
let svc_new = Arc::new(LighthouseServiceServer::new(lh)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be just returning Arc from this method and constructing the LighthouseServiceServer wrapper on demand so we don't need to clone it in the parent method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh true, will include in next commit
torchft/multi_quorum_test.py
Outdated
@@ -0,0 +1,44 @@ | |||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could move this test to lighthouse_test.py
to keep it with the rest of the lighthouse tests
…to Arc<Lighthouse>, Lighthouse::new now takes id prefix, test relocated to lighthouse_test.py and now uses coordination API, LighthouseServer now resolves host/port from the bound socket to give a routable http://host:port address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@MattKotzbauer looks like lint and unit tests are failing, not sure if that's related to this PR though |
dc92b42
to
53ec8be
Compare
GH Issue: #173
Extends Lighthouse to to support multiple independent quorums on a single server by tagging each gRPC call with a
room-id
metadata header and feeding requests through a lightweight router that maintains per‐room state with aDashMap<String, Arc<Lighthouse>>
.LighthouseClient
now accepts an optionalroom_id
argument and automatically injects the corresponding metadata header into eachheartbeat
andquorum
request, while untagged calls continue to use a default namespace. Inmulti_quorum_test.py
, I created 2 clients with distinct room ID's and have them form independent quorums on the same server port.Open to making any changes to the code or approach 🤙.
Warmly,
Matt