Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[linux-nvidia-internal-6.11][Backport] GPU passthrough cuda support #52

Open
wants to merge 71 commits into
base: 24.04_linux-nvidia-internal-6.11-next
Choose a base branch
from

Conversation

KobaKoNvidia
Copy link
Collaborator

[Description]
Backport patches from [0] [1] [2] to enable GPU passthrough for CUDA.

[0] https://docs.google.com/spreadsheets/d/1oLDUSup7xtSJDVLUNKxpCRaXWp4AqTSZfcQdJ_aYAUc/edit?gid=297279919#gid=297279919&range=55:57
[1] https://docs.google.com/spreadsheets/d/1oLDUSup7xtSJDVLUNKxpCRaXWp4AqTSZfcQdJ_aYAUc/edit?gid=297279919#gid=297279919&range=60:61
[2] https://git-master.nvidia.com/r/plugins/gitiles/linux-stable/+log/refs/heads/dev/nic/iommufd_vsmmu-12122024

[Test plan]

  1. boot up host
  2. boot up VMs in host with 1 GPU, 2 GPUs, 3 GPUs and 4 GPUs.
  3. Run the following to do the basic checks [3]
//
// Get CPU device
//
$ lspci | grep 3D
//
// Show GPU info
//
$ nvidia-smi
//
// Result must be passed
//
$ /root/r570/tests/runtime/gflops/gflops
$ /root/r570/tests/runtime/uvmConformance/uvmConformance -t texture_simple
$ /root/r570/tests/runtime/uvmConformance/uvmConformance -t ats_malloc_host
  1. Check the host's dmesg,

[Misc]

  1. Passed arm64&amd64 buildin in Noble.[5]

[3], [5], logs for VMs, host's dmesg and buit logs,
https://drive.google.com/drive/folders/1bJYyfSoIR_BmtW20BWp178WXX8tOXhHo?usp=sharing'

nicolinc and others added 30 commits January 22, 2025 16:48
Prepare for an embedded structure design for driver-level iommufd_viommu
objects:
    // include/linux/iommufd.h
    struct iommufd_viommu {
        struct iommufd_object obj;
        ....
    };

    // Some IOMMU driver
    struct iommu_driver_viommu {
        struct iommufd_viommu core;
        ....
    };

It has to expose struct iommufd_object and enum iommufd_object_type from
the core-level private header to the public iommufd header.

Link: https://patch.msgid.link/r/54a43b0768089d690104530754f499ca05ce0074.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit d1b3dad
linux)
Signed-off-by: Koba Ko <[email protected]>
The following patch will add a new vIOMMU allocator that will require this
_iommufd_object_alloc to be sharable with IOMMU drivers (and iommufd too).

Add a new driver.c file that will be built with CONFIG_IOMMUFD_DRIVER_CORE
selected by CONFIG_IOMMUFD, and put the CONFIG_DRIVER under that remaining
to be selectable for drivers to build the existing iova_bitmap.c file.

Link: https://patch.msgid.link/r/2f4f6e116dc49ffb67ff6c5e8a7a8e789ab9e98e.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 7d4f46c
linux)
Signed-off-by: Koba Ko <[email protected]>
Add a new IOMMUFD_OBJ_VIOMMU with an iommufd_viommu structure to represent
a slice of physical IOMMU device passed to or shared with a user space VM.
This slice, now a vIOMMU object, is a group of virtualization resources of
a physical IOMMU's, such as:
 - Security namespace for guest owned ID, e.g. guest-controlled cache tags
 - Non-device-affiliated event reporting, e.g. invalidation queue errors
 - Access to a sharable nesting parent pagetable across physical IOMMUs
 - Virtualization of various platforms IDs, e.g. RIDs and others
 - Delivery of paravirtualized invalidation
 - Direct assigned invalidation queues
 - Direct assigned interrupts

Add a new viommu_alloc op in iommu_ops, for drivers to allocate their own
vIOMMU structures. And this allocation also needs a free(), so add struct
iommufd_viommu_ops.

To simplify a vIOMMU allocation, provide a iommufd_viommu_alloc() helper.
It's suggested that a driver should embed a core-level viommu structure in
its driver-level viommu struct and call the iommufd_viommu_alloc() helper,
meanwhile the driver can also implement a viommu ops:
    struct my_driver_viommu {
        struct iommufd_viommu core;
        /* driver-owned properties/features */
        ....
    };

    static const struct iommufd_viommu_ops my_driver_viommu_ops = {
        .free = my_driver_viommu_free,
        /* future ops for virtualization features */
        ....
    };

    static struct iommufd_viommu my_driver_viommu_alloc(...)
    {
        struct my_driver_viommu *my_viommu =
                iommufd_viommu_alloc(ictx, my_driver_viommu, core,
                                     my_driver_viommu_ops);
        /* Init my_viommu and related HW feature */
        ....
        return &my_viommu->core;
    }

    static struct iommu_domain_ops my_driver_domain_ops = {
        ....
        .viommu_alloc = my_driver_viommu_alloc,
    };

Link: https://patch.msgid.link/r/64685e2b79dea0f1dc56f6ede04809b72d578935.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 6b22d56
linux)
Signed-off-by: Koba Ko <[email protected]>
To support driver-allocated vIOMMU objects, it's required for IOMMU driver
to call the provided iommufd_viommu_alloc helper to embed the core struct.
However, there is no guarantee that every driver will call it and allocate
objects properly.

Make the iommufd_object_finalize/abort functions more robust to verify if
the xarray slot indexed by the input obj->id is having an XA_ZERO_ENTRY,
which is the reserved value stored by xa_alloc via iommufd_object_alloc.

Link: https://patch.msgid.link/r/334bd4dde8e0a88eb30fa67eeef61827cdb546f9.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit d56d1e8
linux)
Signed-off-by: Koba Ko <[email protected]>
Add a new ioctl for user space to do a vIOMMU allocation. It must be based
on a nesting parent HWPT, so take its refcount.

IOMMU driver wanting to support vIOMMUs must define its IOMMU_VIOMMU_TYPE_
in the uAPI header and implement a viommu_alloc op in its iommu_ops.

Link: https://patch.msgid.link/r/dc2b8ba9ac935007beff07c1761c31cd097ed780.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 4db97c2)
Signed-off-by: Koba Ko <[email protected]>
Allow IOMMU driver to use a vIOMMU object that holds a nesting parent
hwpt/domain to allocate a nested domain.

Link: https://patch.msgid.link/r/2dcdb5e405dc0deb68230564530d989d285d959c.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 69d2689
linux)
Signed-off-by: Koba Ko <[email protected]>
Now a vIOMMU holds a shareable nesting parent HWPT. So, it can act like
that nesting parent HWPT to allocate a nested HWPT.

Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc.

Also, add an iommufd_viommu_alloc_hwpt_nested helper to allocate a nested
HWPT for a vIOMMU object. Since a vIOMMU object holds the parent hwpt's
refcount already, increase the refcount of the vIOMMU only.

Link: https://patch.msgid.link/r/a0f24f32bfada8b448d17587adcaedeeb50a67ed.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 13a7501
linux)
Signed-off-by: Koba Ko <[email protected]>
Use these inline helpers to shorten those container_of lines.

Note that one of them goes back and forth between iommu_domain and
mock_iommu_domain, which isn't necessary. So drop its container_of.

Link: https://patch.msgid.link/r/518ec64dae2e814eb29fd9f170f58a3aad56c81c.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit fd6b853
linux)
Signed-off-by: Koba Ko <[email protected]>
A nested domain now can be allocated for a parent domain or for a vIOMMU
object. Rework the existing allocators to prepare for the latter case.

Link: https://patch.msgid.link/r/f62894ad8ccae28a8a616845947fe4b76135d79b.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 18f8199
linux)
Signed-off-by: Koba Ko <[email protected]>
For an iommu_dev that can unplug (so far only this selftest does so), the
viommu->iommu_dev pointer has no guarantee of its life cycle after it is
copied from the idev->dev->iommu->iommu_dev.

Track the user count of the iommu_dev. Postpone the exit routine using a
completion, if refcount is unbalanced. The refcount inc/dec will be added
in the following patch.

Link: https://patch.msgid.link/r/33f28d64841b497eebef11b49a571e03103c5d24.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 8607056
linux)
Signed-off-by: Koba Ko <[email protected]>
Implement the viommu alloc/free functions to increase/reduce refcount of
its dependent mock iommu device. User space can verify this loop via the
IOMMU_VIOMMU_TYPE_SELFTEST.

Link: https://patch.msgid.link/r/9d755a215a3007d4d8d1c2513846830332db62aa.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit db70827
linux)
Signed-off-by: Koba Ko <[email protected]>
Add a new iommufd_viommu FIXTURE and setup it up with a vIOMMU object.

Any new vIOMMU feature will be added as a TEST_F under that.

Link: https://patch.msgid.link/r/abe267c9d004b29cb1712ceba2f378209d4b7e01.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 7156cd9
linux)
Signed-off-by: Koba Ko <[email protected]>
With the introduction of the new object and its infrastructure, update the
doc to reflect that and add a new graph.

Link: https://patch.msgid.link/r/7e4302064e0d02137c1b1e139342affc0485ed3f.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Bagas Sanjaya <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 87210b1
linux)
Signed-off-by: Koba Ko <[email protected]>
Introduce a new IOMMUFD_OBJ_VDEVICE to represent a physical device (struct
device) against a vIOMMU (struct iommufd_viommu) object in a VM.

This vDEVICE object (and its structure) holds all the infos and attributes
in the VM, regarding the device related to the vIOMMU.

As an initial patch, add a per-vIOMMU virtual ID. This can be:
 - Virtual StreamID on a nested ARM SMMUv3, an index to a Stream Table
 - Virtual DeviceID on a nested AMD IOMMU, an index to a Device Table
 - Virtual RID on a nested Intel VT-D IOMMU, an index to a Context Table
Potentially, this vDEVICE structure would hold some vData for Confidential
Compute Architecture (CCA). Use this virtual ID to index an "vdevs" xarray
that belongs to a vIOMMU object.

Add a new ioctl for vDEVICE allocations. Since a vDEVICE is a connection
of a device object and an iommufd_viommu object, take two refcounts in the
ioctl handler.

Link: https://patch.msgid.link/r/cda8fd2263166e61b8191a3b3207e0d2b08545bf.1730836308.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 0ce5c24
linux)
Signed-off-by: Koba Ko <[email protected]>
Add a vdevice_alloc op to the viommu mock_viommu_ops for the coverage of
IOMMU_VIOMMU_TYPE_SELFTEST allocations. Then, add a vdevice_alloc TEST_F
to cover the IOMMU_VDEVICE_ALLOC ioctl.

Link: https://patch.msgid.link/r/4b9607e5b86726c8baa7b89bd48123fb44104a23.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 5778c75
linux)
Signed-off-by: Koba Ko <[email protected]>
This per-vIOMMU cache_invalidate op is like the cache_invalidate_user op
in struct iommu_domain_ops, but wider, supporting device cache (e.g. PCI
ATC invaldiations).

Link: https://patch.msgid.link/r/90138505850fa6b165135e78a87b4cc7022869a4.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 67db79d
linux)
Signed-off-by: Koba Ko <[email protected]>
With a vIOMMU object, use space can flush any IOMMU related cache that can
be directed via a vIOMMU object. It is similar to the IOMMU_HWPT_INVALIDATE
uAPI, but can cover a wider range than IOTLB, e.g. device/desciprtor cache.

Allow hwpt_id of the iommu_hwpt_invalidate structure to carry a viommu_id,
and reuse the IOMMU_HWPT_INVALIDATE uAPI for vIOMMU invalidations. Drivers
can define different structures for vIOMMU invalidations v.s. HWPT ones.

Since both the HWPT-based and vIOMMU-based invalidation pathways check own
cache invalidation op, remove the WARN_ON_ONCE in the allocator.

Update the uAPI, kdoc, and selftest case accordingly.

Link: https://patch.msgid.link/r/b411e2245e303b8a964f39f49453a5dff280968f.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 54ce69e
linux)
Signed-off-by: Koba Ko <[email protected]>
The iommu_copy_struct_from_user_array helper can be used to copy a single
entry from a user array which might not be efficient if the array is big.

Add a new iommu_copy_struct_from_full_user_array to copy the entire user
array at once. Update the existing iommu_copy_struct_from_user_array kdoc
accordingly.

Link: https://patch.msgid.link/r/5cd773d9c26920c5807d232b21d415ea79172e49.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 4f2e59c
linux)
Signed-off-by: Koba Ko <[email protected]>
This avoids a bigger trouble of exposing struct iommufd_device and struct
iommufd_vdevice in the public header.

Link: https://patch.msgid.link/r/84fa7c624db4d4508067ccfdf42059533950180a.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit c747e67
linux)
Signed-off-by: Koba Ko <[email protected]>
Similar to the coverage of cache_invalidate_user for iotlb invalidation,
add a device cache and a viommu_cache_invalidate function to test it out.

Link: https://patch.msgid.link/r/a29c7c23d7cd143fb26ab68b3618e0957f485fdb.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit d6563aa
linux)
Signed-off-by: Koba Ko <[email protected]>
Similar to IOMMU_TEST_OP_MD_CHECK_IOTLB verifying a mock_domain's iotlb,
IOMMU_TEST_OP_DEV_CHECK_CACHE will be used to verify a mock_dev's cache.

Link: https://patch.msgid.link/r/cd4082079d75427bd67ed90c3c825e15b5720a5f.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 576ad6e
linux)
Signed-off-by: Koba Ko <[email protected]>
Add a viommu_cache test function to cover vIOMMU invalidations using the
updated IOMMU_HWPT_INVALIDATE ioctl, which now allows passing in a vIOMMU
via its hwpt_id field.

Link: https://patch.msgid.link/r/f317f902041f3d05deaee4ca3fdd8ef4b8297361.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 49ad127
linux)
Signed-off-by: Koba Ko <[email protected]>
With the introduction of the new object and its infrastructure, update the
doc and the vIOMMU graph to reflect that.

Link: https://patch.msgid.link/r/e1ff278b7163909b2641ae04ff364bb41d2a2a2e.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Bagas Sanjaya <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit b047c06
linux)
Signed-off-by: Koba Ko <[email protected]>
Don't open code the calculations of the indexes for each level, provide
two functions to do that math and call them in all the places. Update all
the places computing indexes.

Calculate the L1 table size directly based on the max required index from
the cap. Remove STRTAB_L1_SZ_SHIFT in favour of STRTAB_NUM_L2_STES.

Use STRTAB_NUM_L2_STES to replace remaining open coded 1 << STRTAB_SPLIT.

Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit ce41041
linux)
Signed-off-by: Koba Ko <[email protected]>
Add types struct arm_smmu_strtab_l1 and l2 to represent the HW layout of
the descriptors, and use them in most places, following patches will get
the remaing places. The size of the l1 and l2 HW allocations are
sizeof(struct arm_smmu_strtab_l1/2).

This provides some more clarity than having raw __le64 *'s and sizes
computed via macros.

Remove STRTAB_L1_DESC_DWORDS.

Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit abb4f9d
linux)
Signed-off-by: Koba Ko <[email protected]>
The members here are being used for both the linear and the 2 level case,
with the meaning of each item slightly different in the two cases.

Split it into a clean union where both cases have their own struct with
their own logical names and correct types.

Adjust all the users to detect linear/2lvl and use the right sub structure
and types consistently.

Remove STRTAB_STE_DWORDS by changing the last places to use
sizeof(struct arm_smmu_ste).

Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 85196f5
linux)
Signed-off-by: Koba Ko <[email protected]>
These values can be computed from the other values already stored in the
config. Move the calculation to arm_smmu_write_strtab() and do it directly
before writing the registers.

This moves all the logic to calculate the two registers into one function
from three and saves an unimportant 16 bytes from the arm_smmu_device.

Suggested-by: Nicolin Chen <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 8c153ef
linux)
Signed-off-by: Koba Ko <[email protected]>
The master->cd_table is entirely contained within the struct
arm_smmu_master which is guaranteed to be freed by the core code under
arm_smmu_release_device().

There is no reason to use devm here, arm_smmu_free_cd_tables() is reliably
called to free the CD related memory. Remove it and save some memory.

Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 47b2de3
linux)
Signed-off-by: Koba Ko <[email protected]>
The top of the 2 level CD table is (at most) 1024 entries big, and two
high order allocations are required. One of __le64 which is programmed
into the HW (8k) and one of struct arm_smmu_l1_ctx_desc which holds the
CPU pointer (16k).

There are two copies of the l2ptr_dma, one is stored in the struct
arm_smmu_l1_ctx_desc, and another is encoded in the __le64 for the HW to
use. Instead of storing two copies just decode the value from the __le64.

Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit c0a25a9
linux)
Signed-off-by: Koba Ko <[email protected]>
As well as indexing helpers arm_smmu_cdtab_l1/2_idx().

Remove CTXDESC_L1_DESC_DWORDS and CTXDESC_CD_DWORDS replacing them all
with type specific calculations.

Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 7c567eb
linux)
Signed-off-by: Koba Ko <[email protected]>
nicolinc and others added 30 commits January 22, 2025 16:48
When VCMDQs are assigned to a VINTF owned by a guest (HYP_OWN bit unset),
only TLB and ATC invalidation commands are supported by the VCMDQ HW. So,
implement the new cmdq->supports_cmd op to scan the input cmd in order to
make sure that it is supported by the selected queue.

Note that the guest VM shouldn't have HYP_OWN bit being set regardless of
guest kernel driver writing it or not, i.e. the hypervisor running in the
host OS should wire this bit to zero when trapping a write access to this
VINTF_CONFIG register from a guest kernel.

Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/8160292337059b91271045800e5c62f7295e2c24.1724970714.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit a9d4028
linux)
Signed-off-by: Koba Ko <[email protected]>
This control causes the ARM SMMU drivers to choose a stage 2
implementation for the IO pagetable (vs the stage 1 usual default),
however this choice has no significant visible impact to the VFIO
user. Further qemu never implemented this and no other userspace user is
known.

The original description in commit f5c9ece ("vfio/iommu_type1: add
new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide
SMMU translation services to the guest operating system" however the rest
of the API to set the guest table pointer for the stage 1 and manage
invalidation was never completed, or at least never upstreamed, rendering
this part useless dead code.

Upstream has now settled on iommufd as the uAPI for controlling nested
translation. Choosing the stage 2 implementation should be done by through
the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation.

Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
enable_nesting iommu_domain_op.

Just in-case there is some userspace using this continue to treat
requesting it as a NOP, but do not advertise support any more.

Acked-by: Alex Williamson <[email protected]>
Reviewed-by: Mostafa Saleh <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jerry Snitselaar <[email protected]>
Reviewed-by: Donald Dutile <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 35890f8
linux)
Signed-off-by: Koba Ko <[email protected]>
ACPICA commit c4f5c083d24df9ddd71d5782c0988408cf0fc1ab

The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory
Access Flag field in the Memory Access Properties table, mainly for a PCI
Root Complex.

This CANWBS defines the coherency of memory accesses to be not marked IOWB
cacheable/shareable. Its value further implies the coherency impact from a
pair of mismatched memory attributes (e.g. in a nested translation case):
  0x0: Use of mismatched memory attributes for accesses made by this
       device may lead to a loss of coherency.
  0x1: Coherency of accesses made by this device to locations in
       Conventional memory are ensured as follows, even if the memory
       attributes for the accesses presented by the device or provided by
       the SMMU are different from Inner and Outer Write-back cacheable,
       Shareable.

Link: acpica/acpica@c4f5c083
Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Acked-by: Hanjun Guo <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Jerry Snitselaar <[email protected]>
Reviewed-by: Donald Dutile <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 1b8655b
linux)
Signed-off-by: Koba Ko <[email protected]>
The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory
Access Flag field in the Memory Access Properties table, mainly for a PCI
Root Complex.

This CANWBS defines the coherency of memory accesses to be not marked IOWB
cacheable/shareable. Its value further implies the coherency impact from a
pair of mismatched memory attributes (e.g. in a nested translation case):
  0x0: Use of mismatched memory attributes for accesses made by this
       device may lead to a loss of coherency.
  0x1: Coherency of accesses made by this device to locations in
       Conventional memory are ensured as follows, even if the memory
       attributes for the accesses presented by the device or provided by
       the SMMU are different from Inner and Outer Write-back cacheable,
       Shareable.

Note that the loss of coherency on a CANWBS-unsupported HW typically could
occur to an SMMU that doesn't implement the S2FWB feature where additional
cache flush operations would be required to prevent that from happening.

Add a new ACPI_IORT_MF_CANWBS flag and set IOMMU_FWSPEC_PCI_RC_CANWBS upon
the presence of this new flag.

CANWBS and S2FWB are similar features, in that they both guarantee the VM
can not violate coherency, however S2FWB can be bypassed by PCI No Snoop
TLPs, while CANWBS cannot. Thus CANWBS meets the requirements to set
IOMMU_CAP_ENFORCE_CACHE_COHERENCY.

Architecturally ARM has expected that VFIO would disable No Snoop through
PCI Config space, if this is done then the two would have the same
protections.

Tested-by: Nicolin Chen <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Acked-by: Hanjun Guo <[email protected]>
Reviewed-by: Jerry Snitselaar <[email protected]>
Reviewed-by: Donald Dutile <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 807404d
linux)
Signed-off-by: Koba Ko <[email protected]>
For SMMUv3 the parent must be a S2 domain, which can be composed
into a IOMMU_DOMAIN_NESTED.

In future the S2 parent will also need a VMID linked to the VIOMMU and
even to KVM.

Reviewed-by: Nicolin Chen <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jerry Snitselaar <[email protected]>
Reviewed-by: Mostafa Saleh <[email protected]>
Reviewed-by: Donald Dutile <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 874b87c
linux)
Signed-off-by: Koba Ko <[email protected]>
The arm-smmuv3-iommufd.c file will need to call these functions too.
Remove statics and put them in the header file. Remove the kunit
visibility protections from arm_smmu_make_abort_ste() and
arm_smmu_make_s2_domain_ste().

Reviewed-by: Nicolin Chen <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jerry Snitselaar <[email protected]>
Reviewed-by: Mostafa Saleh <[email protected]>
Reviewed-by: Donald Dutile <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit f6681ab
linux)
Signed-off-by: Koba Ko <[email protected]>
For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
as the parent and a user provided STE fragment that defines the CD table
and related data with addresses translated by the S2 iommu_domain.

The kernel only permits userspace to control certain allowed bits of the
STE that are safe for user/guest control.

IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
translation, but there is no way of knowing which S1 entries refer to a
range of S2.

For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
flush all ASIDs from the VMID after flushing the S2 on any change to the
S2.

The IOMMU_DOMAIN_NESTED can only be created from inside a VIOMMU as the
invalidation path relies on the VIOMMU to translate virtual stream ID used
in the invalidation commands for the CD table and ATS.

Link: https://patch.msgid.link/r/[email protected]
Reviewed-by: Nicolin Chen <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jerry Snitselaar <[email protected]>
Reviewed-by: Donald Dutile <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 1e8be08
linux)
Signed-off-by: Koba Ko <[email protected]>
Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
works. When S2FWB is supported and enabled the IOPTE will force cachable
access to IOMMU_CACHE memory when nesting with a S1 and deny cachable
access when !IOMMU_CACHE.

When using a single stage of translation, a simple S2 domain, it doesn't
change things for PCI devices as it is just a different encoding for the
existing mapping of the IOMMU protection flags to cachability attributes.
For non-PCI it also changes the combining rules when incoming transactions
have inconsistent attributes.

However, when used with a nested S1, FWB has the effect of preventing the
guest from choosing a MemAttr in it's S1 that would cause ordinary DMA to
bypass the cache. Consistent with KVM we wish to deny the guest the
ability to become incoherent with cached memory the hypervisor believes is
cachable so we don't have to flush it.

Allow NESTED domains to be created if the SMMU has S2FWB support and use
S2FWB for NESTING_PARENTS. This is an additional option to CANWBS.

Link: https://patch.msgid.link/r/[email protected]
Reviewed-by: Nicolin Chen <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jerry Snitselaar <[email protected]>
Reviewed-by: Donald Dutile <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 67e4fe3
linux)
Signed-off-by: Koba Ko <[email protected]>
The EATS flag needs to flow through the vSTE and into the pSTE, and ensure
physical ATS is enabled on the PCI device.

The physical ATS state must match the VM's idea of EATS as we rely on the
VM to issue the ATS invalidation commands. Thus ATS must remain off at the
device until EATS on a nesting domain turns it on. Attaching a nesting
domain is the point where the invalidation responsibility transfers to
userspace.

Update the ATS logic to track EATS for nesting domains and flush the
ATC whenever the S2 nesting parent changes.

Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Nicolin Chen <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit f27298a
linux)
Signed-off-by: Koba Ko <[email protected]>
…domain

In a 1-stage translation setup, a device is attached to a paging domain.
In a 2-stage translation setup, a device is attached to a nested domain,
which does not have the mappings for the MSI page but only an s2_parent
paging domain pointer that holds the mappings.

Add arm_smmu_get_msi_mapping_domain in arm_smmu_nested_ops to return the
correct paging domain.

Signed-off-by: Nicolin Chen <[email protected]>
(cherry picked from commit c019f15752e65bdc6c28c480d60eb92e58ae9188 nvidia/ksta
ble/dev/nic/wip/smmuv3_nesting-v4-1105202)
Signed-off-by: Koba Ko <[email protected]>
Currently, iommu-dma is the only place outside of IOMMUFD and drivers
which might need to be aware of the stage 2 domain encapsulated within
a nested domain. This would be still the RMR solution where we're using
host-managed MSIs with an identity mapping at stage 1, where it is
the underlying stage 2 domain which owns an MSI cookie and holds the
corresponding dynamic mappings. Hook up the new op to resolve what we
need from a nested domain.

Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Matthew R. Ochs <[email protected]>
Acked-by: Kai-Heng Feng <[email protected]>
Acked-by: Koba Ko <[email protected]>
Signed-off-by: Matthew R. Ochs <[email protected]>
(cherry picked from commit d096dab 24.04_linux
-nvidia-adv-6.8-next)
Signed-off-by: Koba Ko <[email protected]>
Signed-off-by: Ankit Agrawal <[email protected]>
Signed-off-by: Matthew R. Ochs <[email protected]>
Acked-by: Kai-Heng Feng <[email protected]>
Acked-by: Koba Ko <[email protected]>
Signed-off-by: Matthew R. Ochs <[email protected]>
(cherry picked from commit a556373 24.04_linux
-nvidia-adv-6.8-next)
Signed-off-by: KobaK <[email protected]>
Signed-off-by: Ankit Agrawal <[email protected]>
(cherry picked from commit d4223d6db2896ec510bfc57cf018010d07ff3659 nvidia/ksta
ble/dev/nic/iommufd_vsmmu-12122024)
Signed-off-by: Koba Ko <[email protected]>
This is used for GPU memory mapping. The solution is a WAR while waiting
for the upstream solution that would use dmabuf to map the entire range
in a single sequence.

Related topics:
https://lore.kernel.org/kvm/[email protected]/
https://lore.kernel.org/kvm/[email protected]/

Signed-off-by: Ankit Agrawal <[email protected]>
Signed-off-by: Matthew R. Ochs <[email protected]>
Acked-by: Kai-Heng Feng <[email protected]>
Acked-by: Koba Ko <[email protected]>
Signed-off-by: Matthew R. Ochs <[email protected]>
(cherry picked from commit 88f15bf 24.04_linux-nvidia-adv-6.8-next)
Signed-off-by: Koba Ko <[email protected]>
Fix typos/spellos in kernel-doc comments for readability.

Fixes: aad37e7 ("iommufd: IOCTLs for the io_pagetable")
Fixes: b7a0855 ("iommu: Add new flag to explictly request PASID capable domain")
Fixes: d68beb2 ("iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object")
Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Acked-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 7937a1b
linux)
Signed-off-by: Koba Ko <[email protected]>
Commit 69d9b31 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC")
started using _iommufd_object_alloc() without importing the IOMMUFD
module namespace, resulting in a modpost warning:

  WARNING: modpost: module arm_smmu_v3 uses symbol _iommufd_object_alloc from namespace IOMMUFD, but does not import it.

Commit d68beb2 ("iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE
using a VIOMMU object") added another warning by using
iommufd_viommu_find_dev():

  WARNING: modpost: module arm_smmu_v3 uses symbol iommufd_viommu_find_dev from namespace IOMMUFD, but does not import it.

Import the IOMMUFD module namespace to resolve the warnings.

Fixes: 69d9b31 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC")
Link: https://patch.msgid.link/r/20241114-arm-smmu-v3-import-iommufd-module-ns-v1-1-c551e7b972e9@kernel.org
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 6d026e6
linux)
Signed-off-by: Koba Ko <[email protected]>
Replace comma between expressions with semicolons.

Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.

Found by inspection.
No functional change intended.
Compile tested only.

Fixes: e3b1be2 ("iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg")
Signed-off-by: Chen Ni <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Lu Baolu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 7de7d35
linux)
Signed-off-by: Koba Ko <[email protected]>
The function arm_smmu_init_strtab_2lvl uses the expression

((1 << smmu->sid_bits) - 1)

to calculate the largest StreamID value. However, this fails for the
maximum allowed value of SMMU_IDR1.SIDSIZE which is 32. The C standard
states:

"If the value of the right operand is negative or is greater than or
equal to the width of the promoted left operand, the behavior is
undefined."

With smmu->sid_bits being 32, the prerequisites for undefined behavior
are met.  We observed that the value of (1 << 32) is 1 and not 0 as we
initially expected.

Similar bit shift operations in arm_smmu_init_strtab_linear seem to not
be affected, because it appears to be unlikely for an SMMU to have
SMMU_IDR1.SIDSIZE set to 32 but then not support 2-level Stream tables

This issue was found by Ryan Huang <[email protected]> on our team.

Fixes: ce41041 ("iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()")
Signed-off-by: Daniel Mentz <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit f63237f
linux)
Signed-off-by: Koba Ko <[email protected]>
During boot some of the calls to tegra241_cmdqv_get_cmdq() will happen
in preemptible context. As this function calls smp_processor_id(), if
CONFIG_DEBUG_PREEMPT is enabled, these calls will trigger a series of
"BUG: using smp_processor_id() in preemptible" backtraces.

As tegra241_cmdqv_get_cmdq() only calls smp_processor_id() to use the
CPU number as a factor to balance out traffic on cmdq usage, it is safe
to use raw_smp_processor_id() here.

Cc: <[email protected]>
Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Luis Claudio R. Goncalves <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 1f80621
linux)
Signed-off-by: Koba Ko <[email protected]>
When configuring a kernel with PAGE_SIZE=4KB, depending on its setting of
CONFIG_CMA_ALIGNMENT, VCMDQ_LOG2SIZE_MAX=19 could fail the alignment test
and trigger a WARN_ON:
    WARNING: at drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3646
    Call trace:
     arm_smmu_init_one_queue+0x15c/0x210
     tegra241_cmdqv_init_structures+0x114/0x338
     arm_smmu_device_probe+0xb48/0x1d90

Fix it by capping max_n_shift to CMDQ_MAX_SZ_SHIFT as SMMUv3 CMDQ does.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit a379971
linux)
Signed-off-by: Koba Ko <[email protected]>
Fix a sparse warning.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 89edbe8
linux)
Signed-off-by: Koba Ko <[email protected]>
…herent

It's observed that, when the first 4GB of system memory was reserved, all
VCMDQ allocations failed (even with the smallest qsz in the last attempt):
    arm-smmu-v3: found companion CMDQV device: NVDA200C:00
    arm-smmu-v3: option mask 0x10
    arm-smmu-v3: failed to allocate queue (0x8000 bytes) for vcmdq0
    acpi NVDA200C:00: tegra241_cmdqv: Falling back to standard SMMU CMDQ
    arm-smmu-v3: ias 48-bit, oas 48-bit (features 0x001e1fbf)
    arm-smmu-v3: allocated 524288 entries for cmdq
    arm-smmu-v3: allocated 524288 entries for evtq
    arm-smmu-v3: allocated 524288 entries for priq

This is because the 4GB reserved memory shifted the entire DMA zone from a
lower 32-bit range (on a system without the 4GB carveout) to higher range,
while the dev->coherent_dma_mask was set to DMA_BIT_MASK(32) by default.

The dma_set_mask_and_coherent() call is done in arm_smmu_device_hw_probe()
of the SMMU driver. So any DMA allocation from tegra241_cmdqv_probe() must
wait until the coherent_dma_mask is correctly set.

Move the vintf/vcmdq structure initialization routine into a different op,
"init_structures". Call it at the end of arm_smmu_init_structures(), where
standard SMMU queues get allocated.

Most of the impl_ops aren't ready until vintf/vcmdq structure are init-ed.
So replace the full impl_ops with an init_ops in __tegra241_cmdqv_probe().

And switch to tegra241_cmdqv_impl_ops later in arm_smmu_init_structures().
Note that tegra241_cmdqv_impl_ops does not link to the new init_structures
op after this switch, since there is no point in having it once it's done.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Reported-by: Matt Ochs <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/530993c3aafa1b0fc3d879b8119e13c629d12e2b.1725503154.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 483e0bd
linux)
Signed-off-by: Koba Ko <[email protected]>
This is likely a typo. Drop it.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/13fd3accb5b7ed6ec11cc6b7435f79f84af9f45f.1725503154.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 2408b81
linux)
Signed-off-by: Koba Ko <[email protected]>
The ioremap() function doesn't return error pointers, it returns NULL
on error so update the error handling.  Also just return directly
instead of calling iounmap() on the NULL pointer.  Calling
iounmap(NULL) doesn't cause a problem on ARM but on other architectures
it can trigger a warning so it'a bad habbit.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 086a3c4
linux)
Signed-off-by: Koba Ko <[email protected]>
…r_header

Kernel test robot reported a few trucation warnings at the snprintf:
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:
	In function ‘tegra241_vintf_free_lvcmdq’:
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:56:
	warning: ‘%u’ directive output may be truncated writing between 1 and
	5 bytes into a region of size between 3 and 11 [-Wformat-truncation=]
  239 |         snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ",
      |                                                        ^~
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:32: note: directive argument
	in the range [0, 65535]
  239 |         snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ",
      |                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:9: note: ‘snprintf’ output
	between 25 and 37 bytes into a destination of size 32
  239 |         snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ",
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  240 |                  vcmdq->vintf->idx, vcmdq->idx, vcmdq->lidx);

Fix by bumping up the size of the header to hold more characters.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit db184a1
linux)
Signed-off-by: Koba Ko <[email protected]>
NVIDIA is planning to productize a new Grace Hopper superchip
SKU with device ID 0x2348.

Add the SKU devid to nvgrace_gpu_vfio_pci_table.

Signed-off-by: Ankit Agrawal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alex Williamson <[email protected]>
(cherry picked from commit 12cd88a
linux)
Signed-off-by: Koba Ko <[email protected]>
…d for uncached resmem

NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip cache coherent interconnect.

There is a HW defect on GH systems to support the Multi-Instance
GPU (MIG) feature [1] that necessiated the presence of a 1G region
with uncached mapping carved out from the device memory. The 1G
region is shown as a fake BAR (comprising region 2 and 3) to
workaround the issue. This is fixed on the GB systems.

The presence of the fix for the HW defect is communicated by the
device firmware through the DVSEC PCI config register with ID 3.
The module reads this to take a different codepath on GB vs GH.

Scan through the DVSEC registers to identify the correct one and use
it to determine the presence of the fix. Save the value in the device's
nvgrace_gpu_pci_core_device structure.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Signed-off-by: Ankit Agrawal <[email protected]>
Ref: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Koba Ko <[email protected]>
…to the VM

There is a HW defect on Grace Hopper (GH) to support the
Multi-Instance GPU (MIG) feature [1] that necessiated the presence
of a 1G region carved out from the device memory and mapped as
uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3)
to workaround the issue.

The Grace Blackwell systems (GB) differ from GH systems in the following
aspects:
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch accommodate those GB changes by showing the 64b physical
device BAR1 (region2 and 3) to the VM instead of the fake one. This
takes care of both the differences.

Moreover, the entire device memory is exposed on GB as cacheable to
the VM as there is no carveout required.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]

Signed-off-by: Ankit Agrawal <[email protected]>
Ref: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Koba Ko <[email protected]>
…status

In contrast to Grace Hopper systems, the HBM training has been moved
out of the UEFI on the Grace Blackwell systems. This reduces the system
bootup time significantly.

The onus of checking whether the HBM training has completed thus falls
on the module.

The HBM training status can be determined from a BAR0 register.
Similarly, another BAR0 register exposes the status of the CPU-GPU
chip-to-chip (C2C) cache coherent interconnect.

Based on testing, 30s is determined to be sufficient to ensure
initialization completion on all the Grace based systems. Thus poll
these register and check for 30s. If the HBM training is not complete
or if the C2C link is not ready, fail the probe.

While the time is not required on Grace Hopper systems, it is
beneficial to make the check to ensure the device is in an
expected state. Hence keeping it generalized to both the generations.

Signed-off-by: Ankit Agrawal <[email protected]>
Ref: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Koba Ko <[email protected]>
virtualization

This adds the following config options to annotations:

            CONFIG_ARM_SMMU_V3_IOMMUFD=y
            CONFIG_IOMMUFD_DRIVER_CORE=y
            CONFIG_IOMMUFD_VFIO_CONTAINER=y
            CONFIG_NVGRACE_GPU_VFIO_PCI=m
            CONFIG_VFIO_CONTAINER=n
            CONFIG_VFIO_IOMMU_TYPE1=-
            CONFIG_TEGRA241_CMDQV=n

For CMA size requirements, the 64K kernel configuration needs 640MB
in the worst-case scenario, while the 4K kernel configuration requires 40MB.
Due to the current CMA alignment requirement of 512MB on 64k kernel and
128MB on 4k kernel, use each as default
            For 64k kernel, CONFIG_CMA_SIZE_MBYTES=1024
            For 4k kernel, CONFIG_CMA_SIZE_MBYTES=128

These config options has been defined in debian.master
            CONFIG_IOMMUFD=m
            CONFIG_IOMMU_IOPF=y

Signed-off-by: Matthew R. Ochs <[email protected]>
Acked-by: Kai-Heng Feng <[email protected]>
Acked-by: Koba Ko <[email protected]>
Signed-off-by: Matthew R. Ochs <[email protected]>
(backported from commit 35a55f3 24.04_linux-nvidia-adv-6.8-next)
Signed-off-by: Koba Ko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.