Change: remove AsyncSeek trait from snapshot #605

zach-schoenberger · 2022-11-15T18:48:39Z

Snapshots don't really need to be AsyncSeek. Simplifying the trait bounds for a snapshot.
Checklist

Updated guide with pertinent info (may not always apply).
Squash down commits to one or two logical commits which clearly describe the work you've done.
Unittest is a friend:)

This change is

drmingdrmer

The receiving end should not expect that snapshot segments will arrive in order. Thus seek will be required.

zach-schoenberger · 2022-11-18T12:43:36Z

I'm not sure I follow why. If that's the case then the receiving end might get a message with done == true before the snapshot has finished even in the current implementation. Would you mind explaining the scenario where that is possible?

drmingdrmer · 2022-11-18T16:26:02Z

I'm not sure I follow why. If that's the case then the receiving end might get a message with done == true before the snapshot has finished even in the current implementation. Would you mind explaining the scenario where that is possible?

You are right the current implementation does not need AsyncSeek.

In the future, several snapshot segments may be transmitted simultaneously and received in arbitrary orders to speed up snapshot transfer.

Since it is a usual way for optimization thus, for now, it would be better not to assume segments are received in order.

make sense? :D

zach-schoenberger · 2022-11-18T16:46:30Z

If the project's plan is to send chunks of the same snapshot in parallel then that makes sense. IMO that also seems like the kind of optimization that the user of the api should be making or let them be exposed to that option. Keeping AsyncSeek to me feels like it forces the underlying snapshot to be a single file or something that can fit in memory. Which severely limits what the users of the lib can do.

schreter · 2022-11-18T17:05:19Z

Just my 2c:

I'm also having my share of thoughts about the snapshot handling and I don't particularly like the stream-based API, and especially the AsyncSeek part :-). Our snapshot is a completely different beast and can't be easily pressed into a stream. What we'll use is probably some kind of out-of-band transport mechanism just sending a tiny information about the snapshot, which is then handled in an application-specific way.

Speeding up the snapshot transfer in general makes sense - but then probably using a better protocol. But even a "regular" TCP/IP can already transfer >=4GB/s with a single stream (see for example https://doc.tm.uka.de/2019-LCN-100g-tuning-authors-copy.pdf for some pointers). There are also newer protocols better optimized for high-speed transfer, so I wouldn't force using a simple stream, seekable or not.

zach-schoenberger · 2022-11-18T18:34:50Z

@schreter I'm pretty sure I'm in the same boat as you (Rocksdb with large files). I was also looking at doing that approach or similar, but it alway seemed odd to have a massive timeout on sending the snapshot part (I'm still doing it anyways though). My original thought was to just let the user do whatever they want #600 but it sounds like that exposes to much of the raft logic to the user to worry about/implement.

drmingdrmer · 2022-11-19T07:40:27Z

Looks like there are two options:

1, C::SD(the snapshot data type) is the snapshot data.
2, C::SD is just a manifest of snapshot segments.

For option 1, either treat the snapshot as a sequence of bytes, as it does now, by making it seekable, or an application must define how snapshot data is organized, e.g., a list of rocksdb SSTable plus the level and key space of every SSTable.

For option 2, the RaftStorage implementation will take some time to download the segments. RaftCore has to block until it finishes installing the snapshot. If we want RaftCore not to block on this operation, before installing a snapshot, RaftStorage has to check the vote to determine whether the snapshot is still legal to install.

drmingdrmer · 2022-11-19T07:56:43Z

If the project's plan is to send chunks of the same snapshot in parallel then that makes sense.

:D

IMO that also seems like the kind of optimization that the user of the api should be making or let them be exposed to that option. Keeping AsyncSeek to me feels like it forces the underlying snapshot to be a single file or something that can fit in memory. Which severely limits what the users of the lib can do.

IMHO, a trait such as AsyncSeek defines the internal structure of a snapshot. To remove this trait bound, RaftStorage has to provide several additional API for sending and receiving segments, e.g.:

read_snapshot_segment<SegmentId, Segment>(snapshot_id: SnapshotId, segment_id: SegmentId) -> Segment;

// Return true if all segments are received.
save_snapshot_segment<SegmentId, Segment>(snapshot_id: SnapshotId, segment_id: SegmentId, s: Segment) -> bool;

For @schreter 's need, save_snapshot_segment() will download snapshot data in another tokio task.

Finally, when save_snapshot_segment() returns true, RaftCore checks vote again and then decides if to install the snapshot to the state machine.

zach-schoenberger · 2022-11-20T14:14:39Z

Sorry I keep asking what feels like the same question, but for:

If we want RaftCore not to block on this operation, before installing a snapshot, RaftStorage has to check the vote to determine whether the snapshot is still legal to install.

What is the downside to relying on this? Was it that the snapshot streaming would require too much raft state management?

drmingdrmer · 2022-11-20T14:29:22Z

Sorry I keep asking what feels like the same question, but

Is OK:)

If we want RaftCore not to block on this operation, before installing a snapshot, RaftStorage has to check the vote to
determine
whether the snapshot is still
legal
to install.

What is the downside to relying on this? Was it that the snapshot streaming would require too much raft state management ?

RaftStorage is implemented by an application developer. Openraft has no control on how it implements a method.
There is no way to force the implementation to check the vote before installing a snapshot.

Even without checking the vote, the application may still be able to pass all its tests too. But there is actually a bug there.

Thus the safe way as it does now, is to let RaftCore check vote first, and then call RaftStorage::instal_snapshot() in a blocking mode, so that before instal_snapshot() returns the vote won't change.

zach-schoenberger · 2022-11-20T15:03:57Z

I finally see where my confusion and the disconnect on this is! I thought for Option 2 above, the downloading of the segments and building of the local (remote) snapshot would be done before the install_snapshot call to the RaftNode, not in the RaftStorage implementation. Interesting idea about not having to persist the vote.

I like the idea of having exposing more API abstractions around the snapshot. Maybe something that would combine RaftSnapshotBuilder and the RaftStorage::begin_receiving_snapshot?
With something like this for the sending side:

build: build the snapshot
send: start sending the snapshot
finish: finalize the snapshot
cancel: cancel sending the snapshot (also use for snapshot cleanup)

The receiving side would implement the endpoint RPC's for the sender to send too. And would call the RaftNode::install_snapshot on finalize.
With the goals being:

RaftCore does all the raft logic around snapshot state management, and can cancel a snapshot that is being sent
Receiver node doesn't have to block RaftCore and doesn't have to implement Raft logic. Just whatever they need for consolidating the snapshot in order to call RaftNode::install_snapshot on in the finish call.

drmingdrmer · 2022-11-21T01:13:09Z

@zach-schoenberger

Building a snapshot is simple. There won't be any change AFAIK. :)

To send a snapshot, first, the C::SD(SnapshotData) on the leader provides an API that returns an iterator of snapshot segments, e.g.: SnapshotData::segments() -> (SegmentId, SegmentData).

Then let each InstallSnapshot RPC transmit a segment.

On the receiving node, RaftStorage should provide an API to initiate a customized snapshot receiving, e.g., RaftStorage::begin_receiving_snapshot() -> impl SnapshotReceive.

For every received segment, RaftCore checks vote and then pass the segment to a user-defined receiving API: <T as SnapshotReceive>::receive(SegmentId, Segment).

When all segments are transmitted, the sending node sends another InstallSnapshot RPC to inform the receiving node to finalize and then install the snapshot.

The receiving node then calls <T as SnapshotReceive>::finalize() -> C::SD to rebuild a snapshot from segments. If the application barely transmits a tiny descriptor of a snapshot, in this step the actual snapshot data should be downloaded. Because this step does not modify any raft data, it does not block RaftCore.

The final step is RaftCore checks vote and then installs snapshot data to replace the state machine.

I skipped some non-critical parts such as canceling a snapshot transmission.

Any opinions? :)

zach-schoenberger · 2022-11-21T13:07:51Z

I think the above would solve most of issues I've run into!

When all segments are transmitted, the sending node sends another InstallSnapshot RPC to inform the receiving node to finalize and then install the snapshot.
The receiving node then calls <T as SnapshotReceive>::finalize() -> C::SD to rebuild a snapshot from segments. If the application barely transmits a tiny descriptor of a snapshot, in this step the actual snapshot data should be downloaded. Because this step does not modify any raft data, it does not block RaftCore.

The final step is RaftCore checks vote and then installs snapshot data to replace the state machine.

I just want to make sure I understand the difference between this and what is currently being done. By following this pattern, <T as SnapshotReceive>::finalize() -> C::SD would be called by the user directly to generate the snapshot data to install and passed to the RaftCore? If it is, I would say my only issue is that it seems a bit odd to be passing the individual segments into the RaftCore and then need still need access to the SnapshotReceive external to RaftCore to pass the finalize results. I guess one option there would be to have RaftCore expose a finalize call to invoke the SnapshotReceive call in it's own task. But that also seems a bit odd looking.

drmingdrmer · 2022-11-21T13:42:13Z

I'd like an application just handle RPCs by simply forwarding them to Raft, the control handle of RaftCore. Raft then passes the RPC request to RaftCore. And RaftCore should be the only component that accesses RaftStorage and SnapshotReceive.

I just want to make sure I understand the difference between this and what is currently being done. By following this pattern , <T as SnapshotReceive>::finalize() -> C::SD would be called by the user directly to generate the snapshot data to install and passed to the RaftCore?

By my design, it should be called by RaftCore, when RaftCore receives an InstallSnapshot RPC indicating a finalize action.

You gave me a good point: whether a receiving snapshot process finishes can be returned by <T as SnapshotReceive>::receive(). The final RPC can be gotten rid of, I think.

If it is, I would say my only issue is that it seems a bit odd to be passing the individual segments into the RaftCore and then need still need access to the SnapshotReceive external to RaftCore to pass the finalize results.

Right, it's weird.

I guess one option there would be to have RaftCore expose a finalize call to invoke the SnapshotReceive call in it's own task. But that also seems a bit odd looking.

Yes, the application calls Raft and Raft calls RaftCore, and then RaftCore calls RaftStorage.

If I didn't explain something clearly, just let me know:)

zach-schoenberger · 2022-11-21T13:49:56Z

Got it. So <T as SnapshotReceive>::receive() -> C::SD would be the only method then? It would still need to be in it's own tokio task to not block the RaftCore if I'm not mistaken.

drmingdrmer · 2022-11-21T14:01:41Z

Got it. So <T as SnapshotReceive>::receive() -> C::SD would be the only method then? It would still need to be in it's own tokio task to not block the RaftCore if I'm not mistaken.

One of the options is to provide two APIs:
<T as SnapshotReceive>::receive() -> bool /* finished or not */ and <T as SnapshotReceive>::finalize() so that finalize() can run in another tokio-task.

:S

zach-schoenberger · 2022-11-21T16:54:55Z

One of the options is to provide two APIs: <T as SnapshotReceive>::receive() -> bool /* finished or not */ and <T as SnapshotReceive>::finalize() so that finalize() can run in another tokio-task.

Thinking about it a bit more, <T as SnapshotReceive>::finalize() -> C::SD is probably always necessary, as is the explicit finalize API call. Since the goal is to send the segments in parallel, you can't rely on the done field since it can come in out of order. Having the <T as SnapshotReceive>::finalize() -> C::SD be run in another task sounds good. It's an expensive thing to do but snapshots themselves are much more expensive so it shouldn't be noticeable.

zach-schoenberger · 2023-10-20T01:57:49Z

I think we resolved this.

zach-schoenberger force-pushed the no_seek branch 2 times, most recently from b649785 to 1b1f046 Compare November 17, 2022 20:00

Change: remove AsyncSeek trait from snapshot

a7a733a

zach-schoenberger force-pushed the no_seek branch from 1b1f046 to a7a733a Compare November 17, 2022 21:36

drmingdrmer reviewed Nov 18, 2022

View reviewed changes

drmingdrmer mentioned this pull request Nov 22, 2022

Refactor snapshot API #606

Closed

drmingdrmer mentioned this pull request Jun 6, 2023

Tracking issue for refactoring snapshot API #857

Closed

4 tasks

drmingdrmer mentioned this pull request Jun 27, 2023

Tracking issue for rkyv support #880

Open

1 task

zach-schoenberger closed this Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change: remove AsyncSeek trait from snapshot #605

Change: remove AsyncSeek trait from snapshot #605

zach-schoenberger commented Nov 15, 2022 •

edited

Loading

drmingdrmer left a comment

zach-schoenberger commented Nov 18, 2022

drmingdrmer commented Nov 18, 2022

zach-schoenberger commented Nov 18, 2022

schreter commented Nov 18, 2022

zach-schoenberger commented Nov 18, 2022

drmingdrmer commented Nov 19, 2022

drmingdrmer commented Nov 19, 2022

zach-schoenberger commented Nov 20, 2022

drmingdrmer commented Nov 20, 2022

zach-schoenberger commented Nov 20, 2022 •

edited

Loading

drmingdrmer commented Nov 21, 2022

zach-schoenberger commented Nov 21, 2022

drmingdrmer commented Nov 21, 2022

zach-schoenberger commented Nov 21, 2022 •

edited

Loading

drmingdrmer commented Nov 21, 2022

zach-schoenberger commented Nov 21, 2022 •

edited

Loading

zach-schoenberger commented Oct 20, 2023

Change: remove AsyncSeek trait from snapshot #605

Change: remove AsyncSeek trait from snapshot #605

Conversation

zach-schoenberger commented Nov 15, 2022 • edited Loading

drmingdrmer left a comment

Choose a reason for hiding this comment

zach-schoenberger commented Nov 18, 2022

drmingdrmer commented Nov 18, 2022

zach-schoenberger commented Nov 18, 2022

schreter commented Nov 18, 2022

zach-schoenberger commented Nov 18, 2022

drmingdrmer commented Nov 19, 2022

drmingdrmer commented Nov 19, 2022

zach-schoenberger commented Nov 20, 2022

drmingdrmer commented Nov 20, 2022

zach-schoenberger commented Nov 20, 2022 • edited Loading

drmingdrmer commented Nov 21, 2022

zach-schoenberger commented Nov 21, 2022

drmingdrmer commented Nov 21, 2022

zach-schoenberger commented Nov 21, 2022 • edited Loading

drmingdrmer commented Nov 21, 2022

zach-schoenberger commented Nov 21, 2022 • edited Loading

zach-schoenberger commented Oct 20, 2023

zach-schoenberger commented Nov 15, 2022 •

edited

Loading

zach-schoenberger commented Nov 20, 2022 •

edited

Loading

zach-schoenberger commented Nov 21, 2022 •

edited

Loading

zach-schoenberger commented Nov 21, 2022 •

edited

Loading