Replies: 6 comments 6 replies
-
We should discuss this in an OFIWG call. I had an email exchange on this topic yesterday, and the problem was described differently as being related to memory registration costs. The current RMA APIs are insecure, particularly when unconnected endpoints are used. During the time that the RMA buffer is exposed, any number of peers could read/write the target buffer. Having a truly 'use-once' RMA buffer would help solve this. A more secure model would associate the buffer with a specific peer (for the RDM endpoint case). An FI_TAGGED_RMA feature could meet both of those requirements and would be easy to support by allowing the use of FI_TAGGED as a flag to the RMA calls. The tagged APIs work better for memory registration, in that the buffer can remain registered without the buffer itself being exposed to the network after the message is received. However, conceptually, a tagged write doesn't seem any different from the app using a tagged send call. Tagged atomics differ in that an operation is performed on the buffer, whereas writes just places the data into the buffer, same as a send. I need to understand the reason for wanting a tagged RMA write versus the app just using a tagged send. A tagged read seems like it would be very confused at the network layer, particularly when it tries to handle retransmissions and errors. This is why today we don't have any notifications at the target as part of an RMA read operation or have read operations change state at the target. In terms of memory registration costs, a possible alternative would be to introduce some flag to memory registration to indicate that the region is 'use-once', maybe with a new call to 'reset' the region. (I don't prefer this option, just mentioning it). |
Beta Was this translation helpful? Give feedback.
-
I think it is indeed mostly related to registration costs. Just to add some more background to that discussion. Messages are not sent for the sole purpose of transmitting the rkey to the server, there's also additional metadata that gets sent along with the rkey, i.e. other RPC arguments that can fit in an eager message that are not part of the user-owned buffer. For instance, as a simple client/server example, let's say we want to execute the following RPC to implement an I/O forwarder: ssize_t write(int fd, const void *buf, size_t count);
On the other end, the server has already pre-posted an arbitrary number of unexpected recvs using eager size buffers to handle incoming RPCs. Once the server receives the message, it then unpacks those arguments and uses the MR key to do an In my opinion but please let me know if I missed something, we can't just simply collapse operations, ie RPC request and RDMA read/write. Those are done from buffers that are owned by separate entities and serve different purposes as there's also additional metadata that must be transmitted along with the user-buffer that remains It is true though that Hope this helps in clarifying some of the current RPC aspects. |
Beta Was this translation helpful? Give feedback.
-
Notes from ofiwg on August 9: |
Beta Was this translation helpful? Give feedback.
-
My personal feedback is that I like the idea of having a single-use RMA target, and this seems to fit in nicely with the tagged API. Tagged buffers are locally accessed, so in theory do not need to be registered for remote RMA access. The bigger question is does this provide better semantics than tagged messages? |
Beta Was this translation helpful? Give feedback.
-
Libfabric Tagged RMA.pdf |
Beta Was this translation helpful? Give feedback.
-
The target of a tagged RMA write behaves similar to a tagged receive. The tagged receive buffer is consumed by a single write. The buffer is registered relative to meeting local receive buffer requirements (registered for FI_RECV access if needed by the provider). A completion entry at the target seems necessary as a default. Unexpected message handling at the target could either match that of tagged messages or that of an RMA request. If it's the first option, then tagged writes are very close to tagged sends. Tagged RMA reads look to me some sort of cross-breed. Conceptually, I keep wanting to think of this as a remote initiated send. For example, the target buffer is registered relative to meeting local send buffer requirements (FI_SEND access, if needed). That is, the buffer isn't just receiving data, so FI_RECV registration isn't enough. And the data is only sent in response to a tagged read, so full FI_REMOTE_READ isn't appropriate either. A completion generated at the target looks necessary as a default. But what is the completion semantic? I think it needs to match the standard transmit semantics (e.g. FI_TRANSMIT_COMPLETE or FI_DELIVERY_COMPLETE), with the application able to specify which one at posting time. The catch here is that the initiator of the tagged read needs to be able to adapt based on what the target requires. This is a potentially significant deviation from other data transfers. |
Beta Was this translation helpful? Give feedback.
-
We have been looking to optimally support storage libfabric clients with the CXI provider. One common trend we are seeing with storage libfabric clients is they implement an RPC layer over libfabric, and the underlying libfabric operations the RPCs map to are similar.
For example, the following is a generalization of how the Mecury bulk RPC uses libfabric. Note: The following message sequence diagram assumes the provider requires FI_MR_ENDPOINT.
The client does the following:
The server does the following:
The main issue we are trying to optimize is to have the server avoid doing two network operations (RMA + message) to move data across the fabric and notify the client when this is completed. For a client bulk read operation, tagged messaging could be used to resolve this. The server can now do a fi_tsend to the client to move data and notify the client instead of having to do an RMA write + fi_send.
Using tagged messaging only helps the client bulk read operation. The bulk write still requires an RMA read + fi_send.
To address the bulk write issue, it seems like a new libfabric API is needed to collapse an RMA read + fi_send into a single behavior. Looking at the libfabric API, AMO operations have the ability to target a remote MR or a tagged receive buffer (FI_TAGGED AMO). Since FI_TAGGED AMO targets a receive buffer, this would seem to result in a two-sided AMO operation to a use-once buffer.
Building on the notion that an AMO operation can be a two-sided operation, allowing RMA to target a tagged receive buffer instead a remote MR would allow RMA + fi_send operations to be collapsed into a single operation. In the above example, the RMA read + fi_send operations a server would have to issue for a client bulk write now becomes a single tagged RMA operation.
Initial thoughts?
Beta Was this translation helpful? Give feedback.
All reactions