Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fabric: Add fi_hmem_attr to fi_info #10400

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions include/rdma/fabric.h
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,22 @@ enum {
FI_TC_NETWORK_CTRL,
};

enum fi_hmem_iface {
FI_HMEM_SYSTEM = 0,
FI_HMEM_CUDA,
FI_HMEM_ROCR,
FI_HMEM_ZE,
FI_HMEM_NEURON,
FI_HMEM_SYNAPSEAI,
};

enum fi_hmem_attr_opt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use the work 'opt' here because there are setopt calls to change some of these values. But these hit me more as configuration settings, than attributes. See below.

FI_HMEM_ATTR_UNSPEC = 0,
FI_HMEM_ATTR_REQUIRED,
FI_HMEM_ATTR_PREFERRED,
FI_HMEM_ATTR_DISABLED,
};

static inline uint32_t fi_tc_dscp_set(uint8_t dscp)
{
return ((uint32_t) dscp) | FI_TC_DSCP;
Expand Down Expand Up @@ -465,6 +481,14 @@ struct fi_fabric_attr {
uint32_t api_version;
};

struct fi_hmem_attr {
enum fi_hmem_iface iface;
enum fi_hmem_attr_opt api_permitted;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is acting as a bool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields will be both input and output. Since a boolean can represent either false or an unspecified value, FI_HMEM_ATTR_UNSPEC is added to differentiate them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the provider can call the GPU API or not. There's not a third state here.

These settings are dictating to the provider how it must implement data transfers to/from GPU buffers. For some providers, it means that the provider cannot support GPU buffers at all. (There is no PCI peer to peer support for TCP or shmem.) There' a significant difference between these variables and other attribute values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means that the provider cannot support GPU buffers at all. (There is no PCI peer to peer support for TCP or shmem.)

I think that is part of our goal for making it in fi_info because it can help the provider filtering. For provider that doesn't support PCIe peer-to-peer, like shmem, we can use such configuration to filter it early in fi_info. You remember we have had challenges to toggle shmem on/off inside EFA provider without using environment variables. You believed shm usage is a data transfer level decision so making it as a ep level setopt() makes more sense. The option can be either a general FI_OPT_SHARED_MEMORY_PERMITTED or FI_OPT_HMEM_P2P

We currently use FI_OPT_SHARED_MEMORY_PERMITTED, but such toggle in ep level is still too late for us because we have created shm info/domain/av/cq/mr earlier. Cleaning all of them at an ep call is troublesome and also error-prone.

Making such toggle as early as in fi_info level can resolve such challenge.

There' a significant difference between these variables and other attribute values.

I agree on this point. If possible, we can consider alternatives to move them to appropriate attribute groups (like domain/tx/rx/ep_attr) separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree having this information up front is useful. I disagree that these are extended attributes. They're something else. Environment variables are the closest thing to what this intends to capture.

Imagine 2 upper libraries (e.g. MPI and NCCL) calling libfabric. These libraries could drive a provider in different directions. NCCL says "no, don't use CUDA", but MPI says "go ahead and use CUDA". That doesn't work. There are global settings at play here, not per domain or endpoint settings. These settings may not even be per provider. NCCL might be using ucx, but MPI verbs, yet the restrictions from NCCL need to carry over both providers.

enum fi_hmem_attr_opt use_p2p;
enum fi_hmem_attr_opt use_dev_reg_copy;
struct fi_hmem_attr *next;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are more along the lines of a configuration than an attribute. The app is wanting to configure the provider behavior here, versus discovering provider capabilities.

I want to avoid an embedded linked list. It adds complexity, and it's unlikely that a system will realistically have more than 1 type of hmem installed anyway.

My first thought is this should somehow be linked to memory registration, since that's ultimately where the provider is making decisions on how to perform a data transfer. And the above configuration settings are basically letting the provider know what sort of optimizations it can perform when a data buffer is located in hmem.

Maybe there should be a more involved set of APIs to query and configure memory registration related functionality. I can see the MR cache being part of this.

Copy link
Contributor

@shijin-aws shijin-aws Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid an embedded linked list. It adds complexity, and it's unlikely that a system will realistically have more than 1 type of hmem installed anyway.

We were debating internally on that. If we are confident there wouldn't be > 1 hmem type, we can get rid of the linked list structure. We start from such implementation for flexibility of multiple hmem iface, and we are open to feedback on that.

Copy link
Contributor Author

@jiaxiyan jiaxiyan Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to add it to fi_info for the following reasons:

  1. the user can see if these configurations are being used by running fi_info.
  2. The api_permitted field is not limited to memory registration. It is also used to prevent the application and libfabric from touching the same resources and negatively impacting the performance.
  3. Adding use_p2p to fi_info can help filter out SHM provider early and allow applications to specify preference for P2P mode early.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These settings are configurations (restrictions) specified by the application, not the provider. They differ from the fi_info attributes in that regard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The application may not care about the implementation details and leave these configurations unspecified, then it is up to the provider to choose whether to use them and return in fi_info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values make no sense for a provider to return: REQUIRED, PREFERRED, etc. These are application restrictions on the provider implementation. If the app doesn't specify a restriction, it doesn't care what the provider does. Except that an admin might... Some restrictions don't even make sense for some providers, except to disable the provider completely.

There are a much larger set of restrictions that an application may need to set. HMEM settings, MR cache controls, use of CPU atomics, use of shared memory, eager message sizes, receive side buffering for unexpected messages... Whether these restrictions are per provider, per endpoint, per domain, or global is unknown.

struct fi_info isn't the place for this. Consider the proposal has a linked list of these restrictions. This is input into fi_getinfo(). What is a provider supposed to do with this list? Pick the one it likes? Apply all of them? How does it resolve conflicts? What does a provider do if it uses shared memory for local communication, where a setting doesn't apply?

Conceptually, the application or an administrator is programming the provider implementation through some sort of configuration mechanism. That's typically been done using environment variables, or setop() when done programmatically. We don't want to link all these configuration values off of fi_info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, the application or an administrator is programming the provider implementation through some sort of configuration mechanism. That's typically been done using environment variables, or setop() when done programmatically. We don't want to link all these configuration values off of fi_info.

We always prefer a programmatical way to do such configuration instead of environment variables. I explained why setopt is not an ideal place either for such configuration in the thread below.

There are a much larger set of restrictions that an application may need to set. HMEM settings, MR cache controls, use of CPU atomics, use of shared memory, eager message sizes, receive side buffering for unexpected messages... Whether these restrictions are per provider, per endpoint, per domain, or global is unknown.

There are always adhoc or provider specific configuration that you don't want to involve in an API. What we are achieving is to address the common pain points in the FI_HMEM interface that all providers may share by introducing incremental changes to the interface.

Among the 3 attributes we are introducing, use_p2p and api_permitted are something AWS or HPE (@iziemba correct me if wrong) showed interest. use_dev_reg_copy is something on the data transfer details that we can totally cut.


struct fi_info {
struct fi_info *next;
uint64_t caps;
Expand All @@ -481,6 +505,7 @@ struct fi_info {
struct fi_domain_attr *domain_attr;
struct fi_fabric_attr *fabric_attr;
struct fid_nic *nic;
struct fi_hmem_attr *hmem_attr;
};

struct fi_device_attr {
Expand Down Expand Up @@ -771,6 +796,7 @@ enum fi_type {
FI_TYPE_MR_ATTR,
FI_TYPE_CNTR_ATTR,
FI_TYPE_CQ_ERR_ENTRY,
FI_TYPE_HMEM_ATTR,
};

char *fi_tostr(const void *data, enum fi_type datatype);
Expand Down
9 changes: 0 additions & 9 deletions include/rdma/fi_domain.h
Original file line number Diff line number Diff line change
Expand Up @@ -128,15 +128,6 @@ struct fid_mr {
uint64_t key;
};

enum fi_hmem_iface {
FI_HMEM_SYSTEM = 0,
FI_HMEM_CUDA,
FI_HMEM_ROCR,
FI_HMEM_ZE,
FI_HMEM_NEURON,
FI_HMEM_SYNAPSEAI,
};

static inline int fi_hmem_ze_device(int driver_index, int device_index)
{
return driver_index << 16 | device_index;
Expand Down
9 changes: 9 additions & 0 deletions src/abi_1_0.c
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,14 @@ struct fi_domain_attr_1_7 {
size_t max_ep_auth_key;
};

struct fi_hmem_attr_1_7 {
enum fi_hmem_iface iface;
enum fi_hmem_attr_opt api_permitted;
enum fi_hmem_attr_opt use_p2p;
enum fi_hmem_attr_opt use_dev_reg_copy;
struct fi_hmem_attr *next;
};

#define fi_tx_attr_1_7 fi_tx_attr_1_3
#define fi_rx_attr_1_7 fi_rx_attr_1_3
#define fi_ep_attr_1_7 fi_ep_attr_1_3
Expand All @@ -303,6 +311,7 @@ struct fi_info_1_7 {
struct fi_domain_attr_1_7 *domain_attr;
struct fi_fabric_attr_1_7 *fabric_attr;
struct fid_nic_1_7 *nic;
struct fi_hmem_attr_1_7 *hmem_attr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to this file need to be dropped. These are definitions for the prior ABI structures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

};

#define ofi_dup_attr(dst, src) \
Expand Down