-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fabric: Add fi_hmem_attr to fi_info #10400
Changes from all commits
0b85f7e
bdb7fcc
fcb5ac4
b26e5fe
c7c8cd1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -360,6 +360,22 @@ enum { | |
FI_TC_NETWORK_CTRL, | ||
}; | ||
|
||
enum fi_hmem_iface { | ||
FI_HMEM_SYSTEM = 0, | ||
FI_HMEM_CUDA, | ||
FI_HMEM_ROCR, | ||
FI_HMEM_ZE, | ||
FI_HMEM_NEURON, | ||
FI_HMEM_SYNAPSEAI, | ||
}; | ||
|
||
enum fi_hmem_attr_opt { | ||
FI_HMEM_ATTR_UNSPEC = 0, | ||
FI_HMEM_ATTR_REQUIRED, | ||
FI_HMEM_ATTR_PREFERRED, | ||
FI_HMEM_ATTR_DISABLED, | ||
}; | ||
|
||
static inline uint32_t fi_tc_dscp_set(uint8_t dscp) | ||
{ | ||
return ((uint32_t) dscp) | FI_TC_DSCP; | ||
|
@@ -465,6 +481,14 @@ struct fi_fabric_attr { | |
uint32_t api_version; | ||
}; | ||
|
||
struct fi_hmem_attr { | ||
enum fi_hmem_iface iface; | ||
enum fi_hmem_attr_opt api_permitted; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is acting as a bool. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These fields will be both input and output. Since a boolean can represent either false or an unspecified value, FI_HMEM_ATTR_UNSPEC is added to differentiate them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Either the provider can call the GPU API or not. There's not a third state here. These settings are dictating to the provider how it must implement data transfers to/from GPU buffers. For some providers, it means that the provider cannot support GPU buffers at all. (There is no PCI peer to peer support for TCP or shmem.) There' a significant difference between these variables and other attribute values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think that is part of our goal for making it in fi_info because it can help the provider filtering. For provider that doesn't support PCIe peer-to-peer, like shmem, we can use such configuration to filter it early in fi_info. You remember we have had challenges to toggle shmem on/off inside EFA provider without using environment variables. You believed shm usage is a data transfer level decision so making it as a ep level setopt() makes more sense. The option can be either a general We currently use Making such toggle as early as in fi_info level can resolve such challenge.
I agree on this point. If possible, we can consider alternatives to move them to appropriate attribute groups (like domain/tx/rx/ep_attr) separately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree having this information up front is useful. I disagree that these are extended attributes. They're something else. Environment variables are the closest thing to what this intends to capture. Imagine 2 upper libraries (e.g. MPI and NCCL) calling libfabric. These libraries could drive a provider in different directions. NCCL says "no, don't use CUDA", but MPI says "go ahead and use CUDA". That doesn't work. There are global settings at play here, not per domain or endpoint settings. These settings may not even be per provider. NCCL might be using ucx, but MPI verbs, yet the restrictions from NCCL need to carry over both providers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is exactly the problem we are solving. In the consideration of resource management and other factors, making Libfabric use cuda calls in both control and data transfer interfaces for NCCL application may cause unexpected risks and overhead. But we don't have such concern for MPI application.
Can you explain to me the software stack here? I know NCCL can use UCX/OFI for network offload via the plugins. It can also use MPI (via a NCCL-MPI plugin ?) for the same purpose? I learned NCCL is already warning users that using NCCL and CUDA aware MPI together is not safe: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm referring to a single process using BOTH NCCL and MPI. Or a single process accessing libfabric through more than 1 middleware (NCCL and DAOS, MPI and DAOS, etc.). The point is that these settings aren't per domain or per provider but are global to the process. That is, they are settings which apply to the environment as a whole. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NCCL doesn't like users to use CUDA aware MPI calls in the same progress because it's already not safe. So I don't see a reason we cannot make MPI allow libfabric to use CUDA while NCCL doesn't allow? |
||
enum fi_hmem_attr_opt use_p2p; | ||
enum fi_hmem_attr_opt use_dev_reg_copy; | ||
struct fi_hmem_attr *next; | ||
}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these are more along the lines of a configuration than an attribute. The app is wanting to configure the provider behavior here, versus discovering provider capabilities. I want to avoid an embedded linked list. It adds complexity, and it's unlikely that a system will realistically have more than 1 type of hmem installed anyway. My first thought is this should somehow be linked to memory registration, since that's ultimately where the provider is making decisions on how to perform a data transfer. And the above configuration settings are basically letting the provider know what sort of optimizations it can perform when a data buffer is located in hmem. Maybe there should be a more involved set of APIs to query and configure memory registration related functionality. I can see the MR cache being part of this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We were debating internally on that. If we are confident there wouldn't be > 1 hmem type, we can get rid of the linked list structure. We start from such implementation for flexibility of multiple hmem iface, and we are open to feedback on that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We prefer to add it to fi_info for the following reasons:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These settings are configurations (restrictions) specified by the application, not the provider. They differ from the fi_info attributes in that regard. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The application may not care about the implementation details and leave these configurations unspecified, then it is up to the provider to choose whether to use them and return in fi_info. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The values make no sense for a provider to return: REQUIRED, PREFERRED, etc. These are application restrictions on the provider implementation. If the app doesn't specify a restriction, it doesn't care what the provider does. Except that an admin might... Some restrictions don't even make sense for some providers, except to disable the provider completely. There are a much larger set of restrictions that an application may need to set. HMEM settings, MR cache controls, use of CPU atomics, use of shared memory, eager message sizes, receive side buffering for unexpected messages... Whether these restrictions are per provider, per endpoint, per domain, or global is unknown. struct fi_info isn't the place for this. Consider the proposal has a linked list of these restrictions. This is input into fi_getinfo(). What is a provider supposed to do with this list? Pick the one it likes? Apply all of them? How does it resolve conflicts? What does a provider do if it uses shared memory for local communication, where a setting doesn't apply? Conceptually, the application or an administrator is programming the provider implementation through some sort of configuration mechanism. That's typically been done using environment variables, or setop() when done programmatically. We don't want to link all these configuration values off of fi_info. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We always prefer a programmatical way to do such configuration instead of environment variables. I explained why
There are always adhoc or provider specific configuration that you don't want to involve in an API. What we are achieving is to address the common pain points in the FI_HMEM interface that all providers may share by introducing incremental changes to the interface. Among the 3 attributes we are introducing, |
||
|
||
struct fi_info { | ||
struct fi_info *next; | ||
uint64_t caps; | ||
|
@@ -481,6 +505,7 @@ struct fi_info { | |
struct fi_domain_attr *domain_attr; | ||
struct fi_fabric_attr *fabric_attr; | ||
struct fid_nic *nic; | ||
struct fi_hmem_attr *hmem_attr; | ||
}; | ||
|
||
struct fi_device_attr { | ||
|
@@ -771,6 +796,7 @@ enum fi_type { | |
FI_TYPE_MR_ATTR, | ||
FI_TYPE_CNTR_ATTR, | ||
FI_TYPE_CQ_ERR_ENTRY, | ||
FI_TYPE_HMEM_ATTR, | ||
}; | ||
|
||
char *fi_tostr(const void *data, enum fi_type datatype); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -281,6 +281,14 @@ struct fi_domain_attr_1_7 { | |
size_t max_ep_auth_key; | ||
}; | ||
|
||
struct fi_hmem_attr_1_7 { | ||
enum fi_hmem_iface iface; | ||
enum fi_hmem_attr_opt api_permitted; | ||
enum fi_hmem_attr_opt use_p2p; | ||
enum fi_hmem_attr_opt use_dev_reg_copy; | ||
struct fi_hmem_attr *next; | ||
}; | ||
|
||
#define fi_tx_attr_1_7 fi_tx_attr_1_3 | ||
#define fi_rx_attr_1_7 fi_rx_attr_1_3 | ||
#define fi_ep_attr_1_7 fi_ep_attr_1_3 | ||
|
@@ -303,6 +311,7 @@ struct fi_info_1_7 { | |
struct fi_domain_attr_1_7 *domain_attr; | ||
struct fi_fabric_attr_1_7 *fabric_attr; | ||
struct fid_nic_1_7 *nic; | ||
struct fi_hmem_attr_1_7 *hmem_attr; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changes to this file need to be dropped. These are definitions for the prior ABI structures. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack |
||
}; | ||
|
||
#define ofi_dup_attr(dst, src) \ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use the work 'opt' here because there are setopt calls to change some of these values. But these hit me more as configuration settings, than attributes. See below.