Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fabric: Add fi_hmem_attr to fi_info #10400

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions include/ofi_util.h
Original file line number Diff line number Diff line change
Expand Up @@ -1117,6 +1117,9 @@ int ofi_check_rx_attr(const struct fi_provider *prov,
int ofi_check_tx_attr(const struct fi_provider *prov,
const struct fi_tx_attr *prov_attr,
const struct fi_tx_attr *user_attr, uint64_t info_mode);
int ofi_check_hmem_attr(const struct fi_provider *prov,
const struct fi_hmem_attr *prov_attr,
const struct fi_info *user_info);
int ofi_check_attr_subset(const struct fi_provider *prov,
uint64_t base_caps, uint64_t requested_caps);
int ofi_prov_check_info(const struct util_prov *util_prov,
Expand Down
26 changes: 26 additions & 0 deletions include/rdma/fabric.h
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,22 @@ enum {
FI_TC_NETWORK_CTRL,
};

enum fi_hmem_iface {
FI_HMEM_SYSTEM = 0,
FI_HMEM_CUDA,
FI_HMEM_ROCR,
FI_HMEM_ZE,
FI_HMEM_NEURON,
FI_HMEM_SYNAPSEAI,
};

enum fi_hmem_attr_opt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use the work 'opt' here because there are setopt calls to change some of these values. But these hit me more as configuration settings, than attributes. See below.

FI_HMEM_ATTR_UNSPEC = 0,
FI_HMEM_ATTR_REQUIRED,
FI_HMEM_ATTR_PREFERRED,
FI_HMEM_ATTR_DISABLED,
};

static inline uint32_t fi_tc_dscp_set(uint8_t dscp)
{
return ((uint32_t) dscp) | FI_TC_DSCP;
Expand Down Expand Up @@ -465,6 +481,14 @@ struct fi_fabric_attr {
uint32_t api_version;
};

struct fi_hmem_attr {
enum fi_hmem_iface iface;
enum fi_hmem_attr_opt api_permitted;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is acting as a bool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields will be both input and output. Since a boolean can represent either false or an unspecified value, FI_HMEM_ATTR_UNSPEC is added to differentiate them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the provider can call the GPU API or not. There's not a third state here.

These settings are dictating to the provider how it must implement data transfers to/from GPU buffers. For some providers, it means that the provider cannot support GPU buffers at all. (There is no PCI peer to peer support for TCP or shmem.) There' a significant difference between these variables and other attribute values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means that the provider cannot support GPU buffers at all. (There is no PCI peer to peer support for TCP or shmem.)

I think that is part of our goal for making it in fi_info because it can help the provider filtering. For provider that doesn't support PCIe peer-to-peer, like shmem, we can use such configuration to filter it early in fi_info. You remember we have had challenges to toggle shmem on/off inside EFA provider without using environment variables. You believed shm usage is a data transfer level decision so making it as a ep level setopt() makes more sense. The option can be either a general FI_OPT_SHARED_MEMORY_PERMITTED or FI_OPT_HMEM_P2P

We currently use FI_OPT_SHARED_MEMORY_PERMITTED, but such toggle in ep level is still too late for us because we have created shm info/domain/av/cq/mr earlier. Cleaning all of them at an ep call is troublesome and also error-prone.

Making such toggle as early as in fi_info level can resolve such challenge.

There' a significant difference between these variables and other attribute values.

I agree on this point. If possible, we can consider alternatives to move them to appropriate attribute groups (like domain/tx/rx/ep_attr) separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree having this information up front is useful. I disagree that these are extended attributes. They're something else. Environment variables are the closest thing to what this intends to capture.

Imagine 2 upper libraries (e.g. MPI and NCCL) calling libfabric. These libraries could drive a provider in different directions. NCCL says "no, don't use CUDA", but MPI says "go ahead and use CUDA". That doesn't work. There are global settings at play here, not per domain or endpoint settings. These settings may not even be per provider. NCCL might be using ucx, but MPI verbs, yet the restrictions from NCCL need to carry over both providers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These libraries could drive a provider in different directions. NCCL says "no, don't use CUDA", but MPI says "go ahead and use CUDA". That doesn't work.

This is exactly the problem we are solving. In the consideration of resource management and other factors, making Libfabric use cuda calls in both control and data transfer interfaces for NCCL application may cause unexpected risks and overhead. But we don't have such concern for MPI application.

NCCL might be using ucx, but MPI verbs, yet the restrictions from NCCL need to carry over both providers.

Can you explain to me the software stack here? I know NCCL can use UCX/OFI for network offload via the plugins. It can also use MPI (via a NCCL-MPI plugin ?) for the same purpose?

I learned NCCL is already warning users that using NCCL and CUDA aware MPI together is not safe: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to a single process using BOTH NCCL and MPI. Or a single process accessing libfabric through more than 1 middleware (NCCL and DAOS, MPI and DAOS, etc.). The point is that these settings aren't per domain or per provider but are global to the process. That is, they are settings which apply to the environment as a whole.

Copy link
Contributor

@shijin-aws shijin-aws Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL doesn't like users to use CUDA aware MPI calls in the same progress because it's already not safe. So I don't see a reason we cannot make MPI allow libfabric to use CUDA while NCCL doesn't allow?

enum fi_hmem_attr_opt use_p2p;
enum fi_hmem_attr_opt use_dev_reg_copy;
struct fi_hmem_attr *next;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are more along the lines of a configuration than an attribute. The app is wanting to configure the provider behavior here, versus discovering provider capabilities.

I want to avoid an embedded linked list. It adds complexity, and it's unlikely that a system will realistically have more than 1 type of hmem installed anyway.

My first thought is this should somehow be linked to memory registration, since that's ultimately where the provider is making decisions on how to perform a data transfer. And the above configuration settings are basically letting the provider know what sort of optimizations it can perform when a data buffer is located in hmem.

Maybe there should be a more involved set of APIs to query and configure memory registration related functionality. I can see the MR cache being part of this.

Copy link
Contributor

@shijin-aws shijin-aws Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid an embedded linked list. It adds complexity, and it's unlikely that a system will realistically have more than 1 type of hmem installed anyway.

We were debating internally on that. If we are confident there wouldn't be > 1 hmem type, we can get rid of the linked list structure. We start from such implementation for flexibility of multiple hmem iface, and we are open to feedback on that.

Copy link
Contributor Author

@jiaxiyan jiaxiyan Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to add it to fi_info for the following reasons:

  1. the user can see if these configurations are being used by running fi_info.
  2. The api_permitted field is not limited to memory registration. It is also used to prevent the application and libfabric from touching the same resources and negatively impacting the performance.
  3. Adding use_p2p to fi_info can help filter out SHM provider early and allow applications to specify preference for P2P mode early.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These settings are configurations (restrictions) specified by the application, not the provider. They differ from the fi_info attributes in that regard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The application may not care about the implementation details and leave these configurations unspecified, then it is up to the provider to choose whether to use them and return in fi_info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values make no sense for a provider to return: REQUIRED, PREFERRED, etc. These are application restrictions on the provider implementation. If the app doesn't specify a restriction, it doesn't care what the provider does. Except that an admin might... Some restrictions don't even make sense for some providers, except to disable the provider completely.

There are a much larger set of restrictions that an application may need to set. HMEM settings, MR cache controls, use of CPU atomics, use of shared memory, eager message sizes, receive side buffering for unexpected messages... Whether these restrictions are per provider, per endpoint, per domain, or global is unknown.

struct fi_info isn't the place for this. Consider the proposal has a linked list of these restrictions. This is input into fi_getinfo(). What is a provider supposed to do with this list? Pick the one it likes? Apply all of them? How does it resolve conflicts? What does a provider do if it uses shared memory for local communication, where a setting doesn't apply?

Conceptually, the application or an administrator is programming the provider implementation through some sort of configuration mechanism. That's typically been done using environment variables, or setop() when done programmatically. We don't want to link all these configuration values off of fi_info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, the application or an administrator is programming the provider implementation through some sort of configuration mechanism. That's typically been done using environment variables, or setop() when done programmatically. We don't want to link all these configuration values off of fi_info.

We always prefer a programmatical way to do such configuration instead of environment variables. I explained why setopt is not an ideal place either for such configuration in the thread below.

There are a much larger set of restrictions that an application may need to set. HMEM settings, MR cache controls, use of CPU atomics, use of shared memory, eager message sizes, receive side buffering for unexpected messages... Whether these restrictions are per provider, per endpoint, per domain, or global is unknown.

There are always adhoc or provider specific configuration that you don't want to involve in an API. What we are achieving is to address the common pain points in the FI_HMEM interface that all providers may share by introducing incremental changes to the interface.

Among the 3 attributes we are introducing, use_p2p and api_permitted are something AWS or HPE (@iziemba correct me if wrong) showed interest. use_dev_reg_copy is something on the data transfer details that we can totally cut.


struct fi_info {
struct fi_info *next;
uint64_t caps;
Expand All @@ -481,6 +505,7 @@ struct fi_info {
struct fi_domain_attr *domain_attr;
struct fi_fabric_attr *fabric_attr;
struct fid_nic *nic;
struct fi_hmem_attr *hmem_attr;
};

struct fi_device_attr {
Expand Down Expand Up @@ -771,6 +796,7 @@ enum fi_type {
FI_TYPE_MR_ATTR,
FI_TYPE_CNTR_ATTR,
FI_TYPE_CQ_ERR_ENTRY,
FI_TYPE_HMEM_ATTR,
};

char *fi_tostr(const void *data, enum fi_type datatype);
Expand Down
9 changes: 0 additions & 9 deletions include/rdma/fi_domain.h
Original file line number Diff line number Diff line change
Expand Up @@ -128,15 +128,6 @@ struct fid_mr {
uint64_t key;
};

enum fi_hmem_iface {
FI_HMEM_SYSTEM = 0,
FI_HMEM_CUDA,
FI_HMEM_ROCR,
FI_HMEM_ZE,
FI_HMEM_NEURON,
FI_HMEM_SYNAPSEAI,
};

static inline int fi_hmem_ze_device(int driver_index, int device_index)
{
return driver_index << 16 | device_index;
Expand Down
5 changes: 5 additions & 0 deletions man/fabric.7.md
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,11 @@ Added new fields to the following attributes:
*fi_domain_attr*
: Added max_group_id

*fi_info*
: The fi_info structure was expanded to reference a new fabric object,
fi_hmem_attr. When available, the fi_hmem_attr references a new set of
attributes related to heterogeneous memory.

# SEE ALSO

[`fi_info`(1)](fi_info.1.html),
Expand Down
3 changes: 3 additions & 0 deletions man/fi_fabric.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,9 @@ datatype or field value.
*FI_TYPE_LOG_SUBSYS*
: enum fi_log_subsys

*FI_TYPE_HMEM_ATTR*
: struct fi_hmem_attr

fi_tostr() will return a pointer to an internal libfabric buffer that
should not be modified, and will be overwritten the next time
fi_tostr() is invoked. fi_tostr() is not thread safe.
Expand Down
68 changes: 68 additions & 0 deletions man/fi_getinfo.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ struct fi_info {
struct fi_domain_attr *domain_attr;
struct fi_fabric_attr *fabric_attr;
struct fid_nic *nic;
struct fi_hmem_attr *hmem_attr;
};
```

Expand Down Expand Up @@ -249,6 +250,73 @@ struct fi_info {
closely associated with a hardware NIC. See
[`fi_nic`(3)](fi_nic.3.html) for details.

*hmem_attr - heterogeneous memory attributes*
: Optionally supplied HMEM attributes. HMEM attributes may be
specified and returned as part of fi_getinfo. When provided as
hints, requested values of struct fi_hmem_attr should be set. On
output, the actual HMEM attributes that can be provided will be
returned.

## HMEM ATTRIBUTES

```c
enum fi_hmem_attr_opt {
FI_HMEM_ATTR_UNSPEC,
FI_HMEM_ATTR_REQUIRED,
FI_HMEM_ATTR_PREFERRED,
FI_HMEM_ATTR_DISABLED
};

struct fi_hmem_attr {
enum fi_hmem_iface iface;
enum fi_hmem_attr_opt api_permitted;
enum fi_hmem_attr_opt use_p2p;
enum fi_hmem_attr_opt use_dev_reg_copy;
struct fi_hmem_attr *next;
};
```
- *fi_hmem_attr_opt - int*
: Defines how the provider should handle HMEM attributes for an interface.
By default, the provider will chose whether to use the attributes
(FI_HMEM_ATTR_UNSPEC).
Valid values defined in fabric.h are:
* FI_HMEM_ATTR_UNSPEC: The attribute may be used by the provider
and is subject to the provider implementation.
* FI_HMEM_ATTR_REQUIRED: The attribute must be used for this interface,
operations that cannot be performed will be reported as failing.
* FI_HMEM_ATTR_PREFERRED: The attribute should be used by the
provider if available, but the provider may choose other implementation
if it is unavailable.
* FI_HMEM_ATTR_DISABLED: The attribute should not be used.

- *iface*

Indicates the software interfaces used by the application, details in
[`fi_mr`(3)](fi_mr.3.html)

- *api_permitted*

Controls whether libfabric is allowed to make device-specific API calls.
By default, libfabric is permitted to call device-specific API(e.g. CUDA API).
If user wish to prohibit libfabric from making such calls, user can achieve
that by set this field to FI_HMEM_ATTR_DISABLED.
The setopt option FI_OPT_CUDA_API_PERMITTED for endpoint takes precedence
over this attribute when api_permitted is not disabled.

- *use_p2p*

Controls whether peer to peer FI_HMEM transfers should be used.
The FI_OPT_FI_HMEM_P2P setopt option discussed in
[`fi_endpoint`(3)](fi_endpoint.3.html) takes precedence over this attribute.

- *use_dev_reg_copy*

Controls whether optimized memcpy for device memory is used, e.g. GDR copy.

- *next*

Pointer to the next fi_hmem_attr if using multiple non-system iface.

# CAPABILITIES

Interface capabilities are obtained by OR-ing the following flags
Expand Down
5 changes: 5 additions & 0 deletions man/fi_info.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,11 @@ fi_info:
speed: 0
state: FI_LINK_UP
network_type: InfiniBand
fi_hmem_attr:
iface: FI_HMEM_SYSTEM
api_permitted: FI_HMEM_ATTR_UNSPEC
use_p2p: FI_HMEM_ATTR_UNSPEC
use_dev_reg_copy: FI_HMEM_ATTR_UNSPEC
```

To see libfabric related environment variables `-e` option.
Expand Down
103 changes: 103 additions & 0 deletions prov/util/src/util_attr.c
Original file line number Diff line number Diff line change
Expand Up @@ -1002,6 +1002,83 @@ int ofi_check_tx_attr(const struct fi_provider *prov,
return 0;
}

static bool ofi_compare_hmem_attr_opt(enum fi_hmem_attr_opt prov_opt,
enum fi_hmem_attr_opt user_opt)
{
switch (user_opt) {
case FI_HMEM_ATTR_UNSPEC:
return true;
case FI_HMEM_ATTR_REQUIRED:
case FI_HMEM_ATTR_PREFERRED:
return prov_opt != FI_HMEM_ATTR_DISABLED;
case FI_HMEM_ATTR_DISABLED:
return prov_opt != FI_HMEM_ATTR_REQUIRED;
default:
return false;
}
}

static int
ofi_validate_hmem_attr_compat(const struct fi_provider *prov,
const struct fi_hmem_attr *prov_attr_head,
const struct fi_hmem_attr *user_attr)
{
const struct fi_hmem_attr *prov_attr = prov_attr_head;

while (prov_attr) {
if (prov_attr->iface == user_attr->iface) {
if (!ofi_compare_hmem_attr_opt(
prov_attr->api_permitted,
user_attr->api_permitted)) {
FI_INFO(prov, FI_LOG_CORE,
"api_permitted option not supported\n");
return -FI_ENODATA;
}

if (!ofi_compare_hmem_attr_opt(
prov_attr->use_p2p,
user_attr->use_p2p)) {
FI_INFO(prov, FI_LOG_CORE,
"use_p2p option not supported\n");
return -FI_ENODATA;
}

if (!ofi_compare_hmem_attr_opt(
prov_attr->use_dev_reg_copy,
user_attr->use_dev_reg_copy)) {
FI_INFO(prov, FI_LOG_CORE,
"use_dev_reg_copy option not supported\n");
return -FI_ENODATA;
}

return 0;
}
prov_attr = prov_attr->next;
}

return -FI_ENODATA;
}

int ofi_check_hmem_attr(const struct fi_provider *prov,
const struct fi_hmem_attr *prov_attr,
const struct fi_info *user_info)
{
struct fi_hmem_attr *user_attr = user_info->hmem_attr;

if (!(user_info->caps & FI_HMEM)) {
FI_INFO(prov, FI_LOG_CORE, "FI_HMEM not set\n");
return -FI_ENODATA;
}

while (user_attr) {
if (ofi_validate_hmem_attr_compat(prov, prov_attr, user_attr) < 0)
return -FI_ENODATA;
user_attr = user_attr->next;
}

return 0;
}

/* Use if there are multiple fi_info in the provider:
* check provider's info */
int ofi_prov_check_info(const struct util_prov *util_prov,
Expand Down Expand Up @@ -1152,6 +1229,13 @@ int ofi_check_info(const struct util_prov *util_prov,
if (ret)
return ret;
}

if (user_info->hmem_attr) {
ret = ofi_check_hmem_attr(prov, prov_info->hmem_attr, user_info);
if (ret)
return ret;
}

return 0;
}

Expand Down Expand Up @@ -1271,6 +1355,24 @@ static void fi_alter_tx_attr(struct fi_tx_attr *attr,
attr->rma_iov_limit = hints->rma_iov_limit;
}

static void fi_alter_hmem_attr(struct fi_hmem_attr *attr,
const struct fi_hmem_attr *hints)
{
if (!hints)
return;

if (hints->iface)
attr->iface = hints->iface;
if (hints->api_permitted)
attr->api_permitted = hints->api_permitted;
if (hints->use_p2p)
attr->use_p2p = hints->use_p2p;
if (hints->use_dev_reg_copy)
attr->use_dev_reg_copy = hints->use_dev_reg_copy;
if (hints->next)
attr->next = hints->next;
}

static uint64_t ofi_get_info_caps(const struct fi_info *prov_info,
const struct fi_info *user_info,
uint32_t api_version)
Expand Down Expand Up @@ -1336,5 +1438,6 @@ void ofi_alter_info(struct fi_info *info, const struct fi_info *hints,
info->caps);
fi_alter_tx_attr(info->tx_attr, hints ? hints->tx_attr : NULL,
info->caps);
fi_alter_hmem_attr(info->hmem_attr, hints ? hints->hmem_attr : NULL);
}
}
9 changes: 9 additions & 0 deletions src/abi_1_0.c
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,14 @@ struct fi_domain_attr_1_7 {
size_t max_ep_auth_key;
};

struct fi_hmem_attr_1_7 {
enum fi_hmem_iface iface;
enum fi_hmem_attr_opt api_permitted;
enum fi_hmem_attr_opt use_p2p;
enum fi_hmem_attr_opt use_dev_reg_copy;
struct fi_hmem_attr *next;
};

#define fi_tx_attr_1_7 fi_tx_attr_1_3
#define fi_rx_attr_1_7 fi_rx_attr_1_3
#define fi_ep_attr_1_7 fi_ep_attr_1_3
Expand All @@ -303,6 +311,7 @@ struct fi_info_1_7 {
struct fi_domain_attr_1_7 *domain_attr;
struct fi_fabric_attr_1_7 *fabric_attr;
struct fid_nic_1_7 *nic;
struct fi_hmem_attr_1_7 *hmem_attr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to this file need to be dropped. These are definitions for the prior ABI structures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

};

#define ofi_dup_attr(dst, src) \
Expand Down
Loading
Loading