Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change: remove AsyncSeek trait from snapshot #605

Closed

Conversation

zach-schoenberger
Copy link
Contributor

@zach-schoenberger zach-schoenberger commented Nov 15, 2022

Snapshots don't really need to be AsyncSeek. Simplifying the trait bounds for a snapshot.
Checklist

  • Updated guide with pertinent info (may not always apply).
  • Squash down commits to one or two logical commits which clearly describe the work you've done.
  • Unittest is a friend:)

This change is Reviewable

@zach-schoenberger zach-schoenberger force-pushed the no_seek branch 2 times, most recently from b649785 to 1b1f046 Compare November 17, 2022 20:00
Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The receiving end should not expect that snapshot segments will arrive in order. Thus seek will be required.

@zach-schoenberger
Copy link
Contributor Author

I'm not sure I follow why. If that's the case then the receiving end might get a message with done == true before the snapshot has finished even in the current implementation. Would you mind explaining the scenario where that is possible?

@drmingdrmer
Copy link
Member

I'm not sure I follow why. If that's the case then the receiving end might get a message with done == true before the snapshot has finished even in the current implementation. Would you mind explaining the scenario where that is possible?

You are right the current implementation does not need AsyncSeek.

In the future, several snapshot segments may be transmitted simultaneously and received in arbitrary orders to speed up snapshot transfer.

Since it is a usual way for optimization thus, for now, it would be better not to assume segments are received in order.

make sense? :D

@zach-schoenberger
Copy link
Contributor Author

If the project's plan is to send chunks of the same snapshot in parallel then that makes sense. IMO that also seems like the kind of optimization that the user of the api should be making or let them be exposed to that option. Keeping AsyncSeek to me feels like it forces the underlying snapshot to be a single file or something that can fit in memory. Which severely limits what the users of the lib can do.

@schreter
Copy link
Collaborator

Just my 2c:

I'm also having my share of thoughts about the snapshot handling and I don't particularly like the stream-based API, and especially the AsyncSeek part :-). Our snapshot is a completely different beast and can't be easily pressed into a stream. What we'll use is probably some kind of out-of-band transport mechanism just sending a tiny information about the snapshot, which is then handled in an application-specific way.

Speeding up the snapshot transfer in general makes sense - but then probably using a better protocol. But even a "regular" TCP/IP can already transfer >=4GB/s with a single stream (see for example https://doc.tm.uka.de/2019-LCN-100g-tuning-authors-copy.pdf for some pointers). There are also newer protocols better optimized for high-speed transfer, so I wouldn't force using a simple stream, seekable or not.

@zach-schoenberger
Copy link
Contributor Author

@schreter I'm pretty sure I'm in the same boat as you (Rocksdb with large files). I was also looking at doing that approach or similar, but it alway seemed odd to have a massive timeout on sending the snapshot part (I'm still doing it anyways though). My original thought was to just let the user do whatever they want #600 but it sounds like that exposes to much of the raft logic to the user to worry about/implement.

@drmingdrmer
Copy link
Member

Looks like there are two options:

  • 1, C::SD(the snapshot data type) is the snapshot data.
  • 2, C::SD is just a manifest of snapshot segments.

For option 1, either treat the snapshot as a sequence of bytes, as it does now, by making it seekable, or an application must define how snapshot data is organized, e.g., a list of rocksdb SSTable plus the level and key space of every SSTable.

For option 2, the RaftStorage implementation will take some time to download the segments. RaftCore has to block until it finishes installing the snapshot. If we want RaftCore not to block on this operation, before installing a snapshot, RaftStorage has to check the vote to determine whether the snapshot is still legal to install.

@drmingdrmer
Copy link
Member

If the project's plan is to send chunks of the same snapshot in parallel then that makes sense.

:D

IMO that also seems like the kind of optimization that the user of the api should be making or let them be exposed to that option. Keeping AsyncSeek to me feels like it forces the underlying snapshot to be a single file or something that can fit in memory. Which severely limits what the users of the lib can do.

IMHO, a trait such as AsyncSeek defines the internal structure of a snapshot. To remove this trait bound, RaftStorage has to provide several additional API for sending and receiving segments, e.g.:

read_snapshot_segment<SegmentId, Segment>(snapshot_id: SnapshotId, segment_id: SegmentId) -> Segment;

// Return true if all segments are received.
save_snapshot_segment<SegmentId, Segment>(snapshot_id: SnapshotId, segment_id: SegmentId, s: Segment) -> bool;

For @schreter 's need, save_snapshot_segment() will download snapshot data in another tokio task.

Finally, when save_snapshot_segment() returns true, RaftCore checks vote again and then decides if to install the snapshot to the state machine.

@zach-schoenberger
Copy link
Contributor Author

Sorry I keep asking what feels like the same question, but for:

If we want RaftCore not to block on this operation, before installing a snapshot, RaftStorage has to check the vote to determine whether the snapshot is still legal to install.

What is the downside to relying on this? Was it that the snapshot streaming would require too much raft state management?

@drmingdrmer
Copy link
Member

Sorry I keep asking what feels like the same question, but

Is OK:)

If we want RaftCore not to block on this operation, before installing a snapshot, RaftStorage has to check the vote to
determine
whether the snapshot is still
legal
to install.

What is the downside to relying on this? Was it that the snapshot streaming would require too much raft state management ?

RaftStorage is implemented by an application developer. Openraft has no control on how it implements a method.
There is no way to force the implementation to check the vote before installing a snapshot.

Even without checking the vote, the application may still be able to pass all its tests too. But there is actually a bug there.

Thus the safe way as it does now, is to let RaftCore check vote first, and then call RaftStorage::instal_snapshot() in a blocking mode, so that before instal_snapshot() returns the vote won't change.

@zach-schoenberger
Copy link
Contributor Author

zach-schoenberger commented Nov 20, 2022

I finally see where my confusion and the disconnect on this is! I thought for Option 2 above, the downloading of the segments and building of the local (remote) snapshot would be done before the install_snapshot call to the RaftNode, not in the RaftStorage implementation. Interesting idea about not having to persist the vote.

I like the idea of having exposing more API abstractions around the snapshot. Maybe something that would combine RaftSnapshotBuilder and the RaftStorage::begin_receiving_snapshot?
With something like this for the sending side:

  • build: build the snapshot
  • send: start sending the snapshot
  • finish: finalize the snapshot
  • cancel: cancel sending the snapshot (also use for snapshot cleanup)

The receiving side would implement the endpoint RPC's for the sender to send too. And would call the RaftNode::install_snapshot on finalize.
With the goals being:

  • RaftCore does all the raft logic around snapshot state management, and can cancel a snapshot that is being sent
  • Receiver node doesn't have to block RaftCore and doesn't have to implement Raft logic. Just whatever they need for consolidating the snapshot in order to call RaftNode::install_snapshot on in the finish call.

@drmingdrmer
Copy link
Member

@zach-schoenberger

Building a snapshot is simple. There won't be any change AFAIK. :)

To send a snapshot, first, the C::SD(SnapshotData) on the leader provides an API that returns an iterator of snapshot segments, e.g.: SnapshotData::segments() -> (SegmentId, SegmentData).

Then let each InstallSnapshot RPC transmit a segment.

On the receiving node, RaftStorage should provide an API to initiate a customized snapshot receiving, e.g., RaftStorage::begin_receiving_snapshot() -> impl SnapshotReceive.

For every received segment, RaftCore checks vote and then pass the segment to a user-defined receiving API: <T as SnapshotReceive>::receive(SegmentId, Segment).

When all segments are transmitted, the sending node sends another InstallSnapshot RPC to inform the receiving node to finalize and then install the snapshot.

The receiving node then calls <T as SnapshotReceive>::finalize() -> C::SD to rebuild a snapshot from segments. If the application barely transmits a tiny descriptor of a snapshot, in this step the actual snapshot data should be downloaded. Because this step does not modify any raft data, it does not block RaftCore.

The final step is RaftCore checks vote and then installs snapshot data to replace the state machine.

I skipped some non-critical parts such as canceling a snapshot transmission.

Any opinions? :)

@zach-schoenberger
Copy link
Contributor Author

I think the above would solve most of issues I've run into!

When all segments are transmitted, the sending node sends another InstallSnapshot RPC to inform the receiving node to finalize and then install the snapshot.
The receiving node then calls <T as SnapshotReceive>::finalize() -> C::SD to rebuild a snapshot from segments. If the application barely transmits a tiny descriptor of a snapshot, in this step the actual snapshot data should be downloaded. Because this step does not modify any raft data, it does not block RaftCore.

The final step is RaftCore checks vote and then installs snapshot data to replace the state machine.

I just want to make sure I understand the difference between this and what is currently being done. By following this pattern, <T as SnapshotReceive>::finalize() -> C::SD would be called by the user directly to generate the snapshot data to install and passed to the RaftCore? If it is, I would say my only issue is that it seems a bit odd to be passing the individual segments into the RaftCore and then need still need access to the SnapshotReceive external to RaftCore to pass the finalize results. I guess one option there would be to have RaftCore expose a finalize call to invoke the SnapshotReceive call in it's own task. But that also seems a bit odd looking.

@drmingdrmer
Copy link
Member

I'd like an application just handle RPCs by simply forwarding them to Raft, the control handle of RaftCore. Raft then passes the RPC request to RaftCore. And RaftCore should be the only component that accesses RaftStorage and SnapshotReceive.

I just want to make sure I understand the difference between this and what is currently being done. By following this pattern , <T as SnapshotReceive>::finalize() -> C::SD would be called by the user directly to generate the snapshot data to install and passed to the RaftCore?

By my design, it should be called by RaftCore, when RaftCore receives an InstallSnapshot RPC indicating a finalize action.

You gave me a good point: whether a receiving snapshot process finishes can be returned by <T as SnapshotReceive>::receive(). The final RPC can be gotten rid of, I think.

If it is, I would say my only issue is that it seems a bit odd to be passing the individual segments into the RaftCore and then need still need access to the SnapshotReceive external to RaftCore to pass the finalize results.

Right, it's weird.

I guess one option there would be to have RaftCore expose a finalize call to invoke the SnapshotReceive call in it's own task. But that also seems a bit odd looking.

Yes, the application calls Raft and Raft calls RaftCore, and then RaftCore calls RaftStorage.

If I didn't explain something clearly, just let me know:)

@zach-schoenberger
Copy link
Contributor Author

zach-schoenberger commented Nov 21, 2022

Got it. So <T as SnapshotReceive>::receive() -> C::SD would be the only method then? It would still need to be in it's own tokio task to not block the RaftCore if I'm not mistaken.

@drmingdrmer
Copy link
Member

Got it. So <T as SnapshotReceive>::receive() -> C::SD would be the only method then? It would still need to be in it's own tokio task to not block the RaftCore if I'm not mistaken.

One of the options is to provide two APIs:
<T as SnapshotReceive>::receive() -> bool /* finished or not */ and <T as SnapshotReceive>::finalize() so that finalize() can run in another tokio-task.

:S

@zach-schoenberger
Copy link
Contributor Author

zach-schoenberger commented Nov 21, 2022

One of the options is to provide two APIs: <T as SnapshotReceive>::receive() -> bool /* finished or not */ and <T as SnapshotReceive>::finalize() so that finalize() can run in another tokio-task.

Thinking about it a bit more, <T as SnapshotReceive>::finalize() -> C::SD is probably always necessary, as is the explicit finalize API call. Since the goal is to send the segments in parallel, you can't rely on the done field since it can come in out of order. Having the <T as SnapshotReceive>::finalize() -> C::SD be run in another task sounds good. It's an expensive thing to do but snapshots themselves are much more expensive so it shouldn't be noticeable.

@zach-schoenberger
Copy link
Contributor Author

I think we resolved this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants