Skip to content

Commit

Permalink
add support for H100 NVL 94GB
Browse files Browse the repository at this point in the history
Signed-off-by: Tariq Ibrahim <[email protected]>
  • Loading branch information
tariq1890 committed May 10, 2024
1 parent b0e8e75 commit bcced06
Showing 1 changed file with 53 additions and 1 deletion.
54 changes: 53 additions & 1 deletion deployments/systemd/config-default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,50 @@ mig-configs:
mig-devices:
"7g.96gb": 1

# H100-80GB, H800-80GB, A100-40GB, A100-80GB, A800-40GB, A800-80GB, A30-24GB, PG506-96GB
# H100-94GB
all-1g.11gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.11gb": 7

all-1g.11gb.me:
- devices: all
mig-enabled: true
mig-devices:
"1g.11gb+me": 1

all-1g.22bg:
- devices: all
mig-enabled: true
mig-devices:
"1g.22gb": 4

all-2g.22gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.22gb": 3

all-3g.44gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.44gb": 2

all-4g.44gb:
- devices: all
mig-enabled: true
mig-devices:
"4g.44gb": 1

all-7g.88gb:
- devices: all
mig-enabled: true
mig-devices:
"7g.88gb": 1

# H100-94GB, H100-80GB, H800-80GB, A100-40GB, A100-80GB, A800-40GB, A800-80GB, A30-24GB, PG506-96GB
all-balanced:
# A100-40GB, A800-40GB
- device-filter: ["0x20B010DE", "0x20B110DE", "0x20F110DE", "0x20F610DE"]
Expand Down Expand Up @@ -187,3 +230,12 @@ mig-configs:
"1g.12gb": 2
"2g.24gb": 1
"3g.48gb": 1

# H100-94GB
- device-filter: "0x232110DE"
devices: all
mig-enabled: true
mig-devices:
"1g.11gb": 2
"2g.22gb": 1
"3g.44gb": 1

5 comments on commit bcced06

@vishnukarthikl
Copy link

@vishnukarthikl vishnukarthikl commented on bcced06 Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tariq1890 It doesn't seem like H100NVL (device id 0x2321) supports 1g.11g and 2g.22g. It supports 1g.12gb and 2g.24gb. I tested this with mig-parted 0.7.0. Nvidia driver version 535.183.01

lspci | grep -i nvidia
4e:00.0 3D controller: NVIDIA Corporation GA103 (rev a1)
62:00.0 3D controller: NVIDIA Corporation GA103 (rev a1)

cat /sys/bus/pci/devices/0000\:4e\:00.0/device 
0x2321
Executing: nvidia-mig-parted apply -f mig-config.yaml -c uniform-1g.11gb && nvidia-smi -L
MIG configuration applied successfully
GPU 0: NVIDIA H100 NVL (UUID: GPU-8f5092f4-7926-3618-f111-95d8a17b949e)
GPU 1: NVIDIA H100 NVL (UUID: GPU-4eb7241e-00ac-c0e8-79c3-e509d0ae1b96)

As you can see the profiles do not get applied.

Where as applying 1g.12gb works

Executing: nvidia-mig-parted apply -f mig-config.yaml -c uniform-1g.12gb && nvidia-smi -L
MIG configuration applied successfully
GPU 0: NVIDIA H100 NVL (UUID: GPU-8f5092f4-7926-3618-f111-95d8a17b949e)
  MIG 1g.12gb     Device  0: (UUID: MIG-a748cb0f-5e0e-507e-9468-9c9b84326bd1)
  MIG 1g.12gb     Device  1: (UUID: MIG-bffdc10b-dc47-5aa0-a380-b76cd96eac9a)
  MIG 1g.12gb     Device  2: (UUID: MIG-bfc3ff62-3cc9-50bc-9ea6-db6bbf55ca03)
  MIG 1g.12gb     Device  3: (UUID: MIG-5a02f0b2-2382-5c80-ac40-081d2994fcb4)
  MIG 1g.12gb     Device  4: (UUID: MIG-ffe0e056-d045-54c3-b84e-c23b9c301d1a)
  MIG 1g.12gb     Device  5: (UUID: MIG-aff9de05-d08c-5597-855f-a76794cef672)
  MIG 1g.12gb     Device  6: (UUID: MIG-502fa892-c6a0-5074-8509-223c96d16125)
GPU 1: NVIDIA H100 NVL (UUID: GPU-4eb7241e-00ac-c0e8-79c3-e509d0ae1b96)

Also the mig-manager example is using the correct one

- device-filter: ["0x232110DE", "0x233A10DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.12gb": 1
"2g.24gb": 1
"3g.47gb": 1
.

@tariq1890
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vishnukarthikl, thanks for bringing this to our notice. To provide context, I added that MIG config based on the configuration referenced in this doc.

You'll see a table under The table below shows the supported profiles on the H100 94GB product (PCIe and SXM5). and the mig configs specified are 1g.11gb, 1g.22gb, 2g.22gb..... Looks like the config that ended up working for you are the ones for H100 96 GB variants according to the MIG reference documentation. We will check with the MIG team and get back to you.

@tariq1890
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vishnukarthikl, can you share the output of nvidia-smi mig -lgip here?

@vishnukarthikl
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vishnukarthikl, can you share the output of nvidia-smi mig -lgip here?

nvidia-smi mig -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0  MIG 1g.12gb       19     7/7        10.75      No     16     1     0   |
|                                                             1     1     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.12gb+me    20     1/1        10.75      No     16     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.24gb       15     4/4        21.62      No     26     1     0   |
|                                                             1     1     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.24gb       14     3/3        21.62      No     32     2     0   |
|                                                             2     2     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 3g.47gb        9     2/2        46.38      No     60     3     0   |
|                                                             3     3     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 4g.47gb        5     1/1        46.38      No     64     4     0   |
|                                                             4     4     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 7g.94gb        0     1/1        93.12      No     132    7     0   |
|                                                             8     7     1   |
+-----------------------------------------------------------------------------+

@tariq1890
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vishnukarthikl ! I have created a new PR to fix this: #101

Please sign in to comment.