New API for storage based libfabric clients #7902

iziemba · 2022-07-27T00:03:00Z

iziemba
Jul 27, 2022
Collaborator

We have been looking to optimally support storage libfabric clients with the CXI provider. One common trend we are seeing with storage libfabric clients is they implement an RPC layer over libfabric, and the underlying libfabric operations the RPCs map to are similar.

For example, the following is a generalization of how the Mecury bulk RPC uses libfabric. Note: The following message sequence diagram assumes the provider requires FI_MR_ENDPOINT.

The client does the following:

Allocate a "use-once" MR for the server to read/write data from/to
Use messaging to send the RKEY to the server
Wait for the server to send a completion message back

The server does the following:

Use RMA and client RKEY to read/write payload
Use messageing to notify the client when the operation has completed

The main issue we are trying to optimize is to have the server avoid doing two network operations (RMA + message) to move data across the fabric and notify the client when this is completed. For a client bulk read operation, tagged messaging could be used to resolve this. The server can now do a fi_tsend to the client to move data and notify the client instead of having to do an RMA write + fi_send.

Using tagged messaging only helps the client bulk read operation. The bulk write still requires an RMA read + fi_send.

To address the bulk write issue, it seems like a new libfabric API is needed to collapse an RMA read + fi_send into a single behavior. Looking at the libfabric API, AMO operations have the ability to target a remote MR or a tagged receive buffer (FI_TAGGED AMO). Since FI_TAGGED AMO targets a receive buffer, this would seem to result in a two-sided AMO operation to a use-once buffer.

Building on the notion that an AMO operation can be a two-sided operation, allowing RMA to target a tagged receive buffer instead a remote MR would allow RMA + fi_send operations to be collapsed into a single operation. In the above example, the RMA read + fi_send operations a server would have to issue for a client bulk write now becomes a single tagged RMA operation.

Initial thoughts?

shefty · 2022-07-27T00:36:47Z

shefty
Jul 27, 2022
Maintainer

We should discuss this in an OFIWG call.

I had an email exchange on this topic yesterday, and the problem was described differently as being related to memory registration costs.

The current RMA APIs are insecure, particularly when unconnected endpoints are used. During the time that the RMA buffer is exposed, any number of peers could read/write the target buffer. Having a truly 'use-once' RMA buffer would help solve this. A more secure model would associate the buffer with a specific peer (for the RDM endpoint case). An FI_TAGGED_RMA feature could meet both of those requirements and would be easy to support by allowing the use of FI_TAGGED as a flag to the RMA calls. The tagged APIs work better for memory registration, in that the buffer can remain registered without the buffer itself being exposed to the network after the message is received.

However, conceptually, a tagged write doesn't seem any different from the app using a tagged send call. Tagged atomics differ in that an operation is performed on the buffer, whereas writes just places the data into the buffer, same as a send. I need to understand the reason for wanting a tagged RMA write versus the app just using a tagged send.

A tagged read seems like it would be very confused at the network layer, particularly when it tries to handle retransmissions and errors. This is why today we don't have any notifications at the target as part of an RMA read operation or have read operations change state at the target.

In terms of memory registration costs, a possible alternative would be to introduce some flag to memory registration to indicate that the region is 'use-once', maybe with a new call to 'reset' the region. (I don't prefer this option, just mentioning it).

2 replies

iziemba Jul 28, 2022
Collaborator Author

We do have a memory registration cost issue with the CXI provider today. We are working on “optimizing” this by defining a FI_MR_PROV_KEY variant. This should help resolve the issue with Mercury. But, it still does raise the question about if something better could be done here.

An FI_TAGGED_RMA feature could meet both of those requirements and would be easy to support by allowing the use of FI_TAGGED as a flag to the RMA calls. The tagged APIs work better for memory registration, in that the buffer can remain registered without the buffer itself being exposed to the network after the message is received.

Security has been a concern of mine with the current RMA model. The only way to prevent independent clients from communicating with each other is to use separate authorization keys. The challenge here is the server would have to operate on multiple authorization keys. While this is doable (endpoint per authorization key), this approach may not scale. Nor is there an API to allow an endpoint to operate on multiple authorization keys (maybe something for the future?). Using tagged buffers may have some amount of value here.

However, conceptually, a tagged write doesn't seem any different from the app using a tagged send call.

At face value, I agree. Thinking out loud a bit, I wonder if a tagged write could offer different ordering semantics than a tagged send.

A tagged read seems like it would be very confused at the network layer, particularly when it tries to handle retransmissions and errors. This is why today we don't have any notifications at the target as part of an RMA read operation or have read operations change state at the target.

I can see how this can be a challenge. It seems like the sender would have to notifier target at the completion of the RMA read operation for event generation to happen. It seems like a provider and/or hardware could handle this.

In terms of memory registration costs, a possible alternative would be to introduce some flag to memory registration to indicate that the region is 'use-once', maybe with a new call to 'reset' the region. (I don't prefer this option, just mentioning it).

That would help for CXI provider fi_close() issue. But, at least for the CXI provider, fi_mr_enable() would style require MR resource configuration and thus blocking. We are looking into some amount of caching for remote MRs.

Curious…. Looking at the Portals 4.2 spec, it defines a PtlGet interface which generates events at both the initiator and target. Has there been previous discussions about defining a similar interface in libfabric? It seems like tagged RMA read could be such an interface.

shefty Jul 28, 2022
Maintainer

I haven't seen any use case for event generation at the target for read/get operations. From an implementation view, the most practical way to do this is for the initiator to send some delayed ack after it has received all of the read data. The delayed ack would itself need to be acked... This approach only tells the target that the data was received by the initiating NIC, which may not be sufficient. It also seems equivalent to the app fencing a send after the read. Basically, I'm not convinced it's useful.

Send and RMA ordering are tracked separately, so there could be something related to ordering for tagged RMA. Hmm... or maybe tagged RMA semantics could be different from tagged messages. For example, disallowing duplicate tags and the ignore bits must be 0, or allowing the provider to select the tag (aligning with the FI_MR_PROV_KEY value). These could allow for use-once RMA buffers without needing to iterate over a list of receive buffers. This is worth thinking about more.

soumagne · 2022-07-27T22:18:25Z

soumagne
Jul 27, 2022

I think it is indeed mostly related to registration costs. Just to add some more background to that discussion. Messages are not sent for the sole purpose of transmitting the rkey to the server, there's also additional metadata that gets sent along with the rkey, i.e. other RPC arguments that can fit in an eager message that are not part of the user-owned buffer. For instance, as a simple client/server example, let's say we want to execute the following RPC to implement an I/O forwarder:

ssize_t write(int fd, const void *buf, size_t count);

fd, count and the MR key that describes buf (buf is registered by calling fi_mr_reg()) are first all packed together into a pre-allocated eager size buffer (owned by mercury) and then sent to the server using an unexpected msg send (by unexpected send I mean here an fi_tsend() call using an arbitrary tag value). At the same time, because that RPC is expecting a response, we also post an expected recv for the ssize_t return value (using fi_trecv() and a specific tag value that identifies the RPC).

On the other end, the server has already pre-posted an arbitrary number of unexpected recvs using eager size buffers to handle incoming RPCs. Once the server receives the message, it then unpacks those arguments and uses the MR key to do an fi_read() of the user buffer. Once the fi_read() completes, it can then execute locally the RPC and send the response (the ssize_t return value) back to the client using an expected send (fi_tsend()). Once the client receives it using the fi_trecv() it already pre-posted it can unpack the return value of the RPC and return to the user, who can now assume that it is now safe to re-use buf. Note that there's no real need in that particular case to signal the client earlier of the completion of the fi_read() as we expect it to be notified as part of the RPC response.

In my opinion but please let me know if I missed something, we can't just simply collapse operations, ie RPC request and RDMA read/write. Those are done from buffers that are owned by separate entities and serve different purposes as there's also additional metadata that must be transmitted along with the user-buffer that remains const and we can't memcpy that buffer either. Surely RDMA read/write can be implemented on top of two sided operations but we'd then add another couple of send/recv and lose the one-sided semantics that we currently have so we'd not be collapsing operations but adding another one on the client. Also data may be moved from device to server and server to device using FI_HMEM. To my understanding this is only achieved using fi_write() / fi_read().

It is true though that buf is only a use-once type of buffer / MR and by successfully making RPCs calls of this type, we just currently rely on the MR cache to negate the registration costs. We could use some type of flag to indicate that.
We have of course other cases where buffers are pre-registered and are re-used etc that's the ideal type of situation but not all cases fall into that category.

Hope this helps in clarifying some of the current RPC aspects.

1 reply

iziemba Jul 28, 2022
Collaborator Author

This was useful.

On the other end, the server has already pre-posted an arbitrary number of unexpected recvs using eager size buffers to handle incoming RPCs. Once the server receives the message, it then unpacks those arguments and uses the MR key to do an fi_read() of the user buffer. Once the fi_read() completes, it can then execute locally the RPC and send the response (the ssize_t return value) back to the client using an expected send (fi_tsend()).

This may be the biggest hurdle in Mercury for adopting a tagged RMA approach. Based on the above example, it may not work for Mercury.

A non-Mercury use case we have explored for tagged RMA had an RPC layer over the transport layer. The transport layer doing a tagged RMA style of communication to move data from client and server. The RPC layer handle the execution and acking. Using a tagged RMA style within the transport layer allowed the client to receive notification sooner and progress its side of the transfer without having to wait for the target.

shefty · 2022-08-09T16:55:00Z

shefty
Aug 9, 2022
Maintainer

Notes from ofiwg on August 9:
Storage apps (particularly DAOS) use RPC communication semantics.
CXI needs FI_MR_ENDPOINT. Would combining the 3 needed calls (reg+bind+enable) help? Answer is not really -- cost there is in the noise relative to other factors. Use of a MR cache helps, but also exposes buffers to any peer capable of submitting RMA.
Current atomic tagged API does not change receiver semantics. That is, there's a single tagged queue. Do we have separate conceptual queues based on the tag? I.e. when calling fi_trecvmsg(), does app need to specify send vs rma target?
Would need to define how unexpected messages should work with tagged atomic/RMA.
How do we handle unexpected messages? Tie into resource mgmt setting?
Could we have the provider return what tag to use?
Tagged RMA does provide extra security -- buffer can only be used with a single operation.
If tagged RMA has separate ordering, then tagged write may be different than tagged send.
Could an offset be useful with tagged write? Versus just using different tag values?
Can RMA still use wildcard? ignore bits? Farther we move from existing tag semantics, the more this is something new.
What is the behavior of fi_trecv() when paired with an RMA read? Is a completion generated? Does this basically mean we have RMA read target completion (cq entry)? Can we just use the completion levels to indicate what the completion means?
Tagged write may be the same as tagged send -- need to see if there's any semantic difference. I.e. completion? ordering? unexpected message handling?
For the proposed use case, unexpected messages should not occur. I.e. tagged RMA read should always find a matching receive.
Do we have separate ordering for tagged RMA, or do we align it with RMA (seems to make most sense)?

0 replies

shefty · 2022-08-09T17:08:55Z

shefty
Aug 9, 2022
Maintainer

My personal feedback is that I like the idea of having a single-use RMA target, and this seems to fit in nicely with the tagged API. Tagged buffers are locally accessed, so in theory do not need to be registered for remote RMA access. The bigger question is does this provide better semantics than tagged messages?

0 replies

iziemba · 2022-08-09T17:30:42Z

iziemba
Aug 9, 2022
Collaborator Author

Libfabric Tagged RMA.pdf
Slides from 8/9/22 OFIWG meeting.

0 replies

shefty · 2022-08-22T23:30:35Z

shefty
Aug 22, 2022
Maintainer

The target of a tagged RMA write behaves similar to a tagged receive. The tagged receive buffer is consumed by a single write. The buffer is registered relative to meeting local receive buffer requirements (registered for FI_RECV access if needed by the provider). A completion entry at the target seems necessary as a default. Unexpected message handling at the target could either match that of tagged messages or that of an RMA request. If it's the first option, then tagged writes are very close to tagged sends.

Tagged RMA reads look to me some sort of cross-breed. Conceptually, I keep wanting to think of this as a remote initiated send. For example, the target buffer is registered relative to meeting local send buffer requirements (FI_SEND access, if needed). That is, the buffer isn't just receiving data, so FI_RECV registration isn't enough. And the data is only sent in response to a tagged read, so full FI_REMOTE_READ isn't appropriate either. A completion generated at the target looks necessary as a default. But what is the completion semantic? I think it needs to match the standard transmit semantics (e.g. FI_TRANSMIT_COMPLETE or FI_DELIVERY_COMPLETE), with the application able to specify which one at posting time. The catch here is that the initiator of the tagged read needs to be able to adapt based on what the target requires. This is a potentially significant deviation from other data transfers.

3 replies

iziemba Aug 29, 2022
Collaborator Author

I was thinking about this some more this week, and I agree that a target completion is needed. I think the completion semantic to make this interface useful is "the NIC is done doing a tagged RMA read to the recv buffer".

shefty Aug 29, 2022
Maintainer

The completion semantic needs to be in terms of data visibility and error handling. For example, what errors can occur after a completion is generated, can the app detect those errors, and what's needed to recover from it? Similar, once a completion occurs, what does that mean in terms of the data being visible (in this case at the initiator)?

iziemba Sep 29, 2022
Collaborator Author

For example, what errors can occur after a completion is generated, can the app detect those errors, and what's needed to recover from it?

From the target completion point of view, I think we could have three different completion levels.

Local NIC (target of tagged RMA) is done with tagged RX buffer and can be reused. The payload may be in local NIC, fabric, and/or remote NIC. Status of the operation at the initiator is unknown.
Local NIC is done with tagged RX buffer and data has been committed to remote NIC. Visibility of the data at the remote NIC is unknown.
Local NIC is done with tagged RX buffer and data is globally visible at remote NIC.

Similar, once a completion occurs, what does that mean in terms of the data being visible (in this case at the initiator)?

Seems like we should align this to what RMA read offers today?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New API for storage based libfabric clients #7902

{{title}}

Replies: 6 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

New API for storage based libfabric clients #7902

iziemba Jul 27, 2022 Collaborator

Replies: 6 comments · 6 replies

shefty Jul 27, 2022 Maintainer

iziemba Jul 28, 2022 Collaborator Author

shefty Jul 28, 2022 Maintainer

soumagne Jul 27, 2022

iziemba Jul 28, 2022 Collaborator Author

shefty Aug 9, 2022 Maintainer

shefty Aug 9, 2022 Maintainer

iziemba Aug 9, 2022 Collaborator Author

shefty Aug 22, 2022 Maintainer

iziemba Aug 29, 2022 Collaborator Author

shefty Aug 29, 2022 Maintainer

iziemba Sep 29, 2022 Collaborator Author

iziemba
Jul 27, 2022
Collaborator

Replies: 6 comments 6 replies

shefty
Jul 27, 2022
Maintainer

iziemba Jul 28, 2022
Collaborator Author

shefty Jul 28, 2022
Maintainer

soumagne
Jul 27, 2022

iziemba Jul 28, 2022
Collaborator Author

shefty
Aug 9, 2022
Maintainer

shefty
Aug 9, 2022
Maintainer

iziemba
Aug 9, 2022
Collaborator Author

shefty
Aug 22, 2022
Maintainer

iziemba Aug 29, 2022
Collaborator Author

shefty Aug 29, 2022
Maintainer

iziemba Sep 29, 2022
Collaborator Author