[linux-nvidia-internal-6.11][Backport] GPU passthrough cuda support #52

KobaKoNvidia · 2025-02-10T03:48:37Z

[Description]
Backport patches from [0] [1] [2] to enable GPU passthrough for CUDA.

[0] https://docs.google.com/spreadsheets/d/1oLDUSup7xtSJDVLUNKxpCRaXWp4AqTSZfcQdJ_aYAUc/edit?gid=297279919#gid=297279919&range=55:57
[1] https://docs.google.com/spreadsheets/d/1oLDUSup7xtSJDVLUNKxpCRaXWp4AqTSZfcQdJ_aYAUc/edit?gid=297279919#gid=297279919&range=60:61
[2] https://git-master.nvidia.com/r/plugins/gitiles/linux-stable/+log/refs/heads/dev/nic/iommufd_vsmmu-12122024

[Test plan]

boot up host
boot up VMs in host with 1 GPU, 2 GPUs, 3 GPUs and 4 GPUs.
Run the following to do the basic checks [3]

//
// Get CPU device
//
$ lspci | grep 3D
//
// Show GPU info
//
$ nvidia-smi
//
// Result must be passed
//
$ /root/r570/tests/runtime/gflops/gflops
$ /root/r570/tests/runtime/uvmConformance/uvmConformance -t texture_simple
$ /root/r570/tests/runtime/uvmConformance/uvmConformance -t ats_malloc_host

Check the host's dmesg,

[Misc]

Passed arm64&amd64 buildin in Noble.[5]

[3], [5], logs for VMs, host's dmesg and buit logs,
https://drive.google.com/drive/folders/1bJYyfSoIR_BmtW20BWp178WXX8tOXhHo?usp=sharing'

Prepare for an embedded structure design for driver-level iommufd_viommu objects: // include/linux/iommufd.h struct iommufd_viommu { struct iommufd_object obj; .... }; // Some IOMMU driver struct iommu_driver_viommu { struct iommufd_viommu core; .... }; It has to expose struct iommufd_object and enum iommufd_object_type from the core-level private header to the public iommufd header. Link: https://patch.msgid.link/r/54a43b0768089d690104530754f499ca05ce0074.1730836219.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit d1b3dad linux) Signed-off-by: Koba Ko <[email protected]>

The following patch will add a new vIOMMU allocator that will require this _iommufd_object_alloc to be sharable with IOMMU drivers (and iommufd too). Add a new driver.c file that will be built with CONFIG_IOMMUFD_DRIVER_CORE selected by CONFIG_IOMMUFD, and put the CONFIG_DRIVER under that remaining to be selectable for drivers to build the existing iova_bitmap.c file. Link: https://patch.msgid.link/r/2f4f6e116dc49ffb67ff6c5e8a7a8e789ab9e98e.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 7d4f46c linux) Signed-off-by: Koba Ko <[email protected]>

Add a new IOMMUFD_OBJ_VIOMMU with an iommufd_viommu structure to represent a slice of physical IOMMU device passed to or shared with a user space VM. This slice, now a vIOMMU object, is a group of virtualization resources of a physical IOMMU's, such as: - Security namespace for guest owned ID, e.g. guest-controlled cache tags - Non-device-affiliated event reporting, e.g. invalidation queue errors - Access to a sharable nesting parent pagetable across physical IOMMUs - Virtualization of various platforms IDs, e.g. RIDs and others - Delivery of paravirtualized invalidation - Direct assigned invalidation queues - Direct assigned interrupts Add a new viommu_alloc op in iommu_ops, for drivers to allocate their own vIOMMU structures. And this allocation also needs a free(), so add struct iommufd_viommu_ops. To simplify a vIOMMU allocation, provide a iommufd_viommu_alloc() helper. It's suggested that a driver should embed a core-level viommu structure in its driver-level viommu struct and call the iommufd_viommu_alloc() helper, meanwhile the driver can also implement a viommu ops: struct my_driver_viommu { struct iommufd_viommu core; /* driver-owned properties/features */ .... }; static const struct iommufd_viommu_ops my_driver_viommu_ops = { .free = my_driver_viommu_free, /* future ops for virtualization features */ .... }; static struct iommufd_viommu my_driver_viommu_alloc(...) { struct my_driver_viommu *my_viommu = iommufd_viommu_alloc(ictx, my_driver_viommu, core, my_driver_viommu_ops); /* Init my_viommu and related HW feature */ .... return &my_viommu->core; } static struct iommu_domain_ops my_driver_domain_ops = { .... .viommu_alloc = my_driver_viommu_alloc, }; Link: https://patch.msgid.link/r/64685e2b79dea0f1dc56f6ede04809b72d578935.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 6b22d56 linux) Signed-off-by: Koba Ko <[email protected]>

To support driver-allocated vIOMMU objects, it's required for IOMMU driver to call the provided iommufd_viommu_alloc helper to embed the core struct. However, there is no guarantee that every driver will call it and allocate objects properly. Make the iommufd_object_finalize/abort functions more robust to verify if the xarray slot indexed by the input obj->id is having an XA_ZERO_ENTRY, which is the reserved value stored by xa_alloc via iommufd_object_alloc. Link: https://patch.msgid.link/r/334bd4dde8e0a88eb30fa67eeef61827cdb546f9.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit d56d1e8 linux) Signed-off-by: Koba Ko <[email protected]>

Add a new ioctl for user space to do a vIOMMU allocation. It must be based on a nesting parent HWPT, so take its refcount. IOMMU driver wanting to support vIOMMUs must define its IOMMU_VIOMMU_TYPE_ in the uAPI header and implement a viommu_alloc op in its iommu_ops. Link: https://patch.msgid.link/r/dc2b8ba9ac935007beff07c1761c31cd097ed780.1730836219.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 4db97c2) Signed-off-by: Koba Ko <[email protected]>

Allow IOMMU driver to use a vIOMMU object that holds a nesting parent hwpt/domain to allocate a nested domain. Link: https://patch.msgid.link/r/2dcdb5e405dc0deb68230564530d989d285d959c.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 69d2689 linux) Signed-off-by: Koba Ko <[email protected]>

Now a vIOMMU holds a shareable nesting parent HWPT. So, it can act like that nesting parent HWPT to allocate a nested HWPT. Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc. Also, add an iommufd_viommu_alloc_hwpt_nested helper to allocate a nested HWPT for a vIOMMU object. Since a vIOMMU object holds the parent hwpt's refcount already, increase the refcount of the vIOMMU only. Link: https://patch.msgid.link/r/a0f24f32bfada8b448d17587adcaedeeb50a67ed.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 13a7501 linux) Signed-off-by: Koba Ko <[email protected]>

Use these inline helpers to shorten those container_of lines. Note that one of them goes back and forth between iommu_domain and mock_iommu_domain, which isn't necessary. So drop its container_of. Link: https://patch.msgid.link/r/518ec64dae2e814eb29fd9f170f58a3aad56c81c.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit fd6b853 linux) Signed-off-by: Koba Ko <[email protected]>

A nested domain now can be allocated for a parent domain or for a vIOMMU object. Rework the existing allocators to prepare for the latter case. Link: https://patch.msgid.link/r/f62894ad8ccae28a8a616845947fe4b76135d79b.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 18f8199 linux) Signed-off-by: Koba Ko <[email protected]>

For an iommu_dev that can unplug (so far only this selftest does so), the viommu->iommu_dev pointer has no guarantee of its life cycle after it is copied from the idev->dev->iommu->iommu_dev. Track the user count of the iommu_dev. Postpone the exit routine using a completion, if refcount is unbalanced. The refcount inc/dec will be added in the following patch. Link: https://patch.msgid.link/r/33f28d64841b497eebef11b49a571e03103c5d24.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 8607056 linux) Signed-off-by: Koba Ko <[email protected]>

Implement the viommu alloc/free functions to increase/reduce refcount of its dependent mock iommu device. User space can verify this loop via the IOMMU_VIOMMU_TYPE_SELFTEST. Link: https://patch.msgid.link/r/9d755a215a3007d4d8d1c2513846830332db62aa.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit db70827 linux) Signed-off-by: Koba Ko <[email protected]>

Add a new iommufd_viommu FIXTURE and setup it up with a vIOMMU object. Any new vIOMMU feature will be added as a TEST_F under that. Link: https://patch.msgid.link/r/abe267c9d004b29cb1712ceba2f378209d4b7e01.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 7156cd9 linux) Signed-off-by: Koba Ko <[email protected]>

With the introduction of the new object and its infrastructure, update the doc to reflect that and add a new graph. Link: https://patch.msgid.link/r/7e4302064e0d02137c1b1e139342affc0485ed3f.1730836219.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Bagas Sanjaya <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 87210b1 linux) Signed-off-by: Koba Ko <[email protected]>

Introduce a new IOMMUFD_OBJ_VDEVICE to represent a physical device (struct device) against a vIOMMU (struct iommufd_viommu) object in a VM. This vDEVICE object (and its structure) holds all the infos and attributes in the VM, regarding the device related to the vIOMMU. As an initial patch, add a per-vIOMMU virtual ID. This can be: - Virtual StreamID on a nested ARM SMMUv3, an index to a Stream Table - Virtual DeviceID on a nested AMD IOMMU, an index to a Device Table - Virtual RID on a nested Intel VT-D IOMMU, an index to a Context Table Potentially, this vDEVICE structure would hold some vData for Confidential Compute Architecture (CCA). Use this virtual ID to index an "vdevs" xarray that belongs to a vIOMMU object. Add a new ioctl for vDEVICE allocations. Since a vDEVICE is a connection of a device object and an iommufd_viommu object, take two refcounts in the ioctl handler. Link: https://patch.msgid.link/r/cda8fd2263166e61b8191a3b3207e0d2b08545bf.1730836308.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 0ce5c24 linux) Signed-off-by: Koba Ko <[email protected]>

Add a vdevice_alloc op to the viommu mock_viommu_ops for the coverage of IOMMU_VIOMMU_TYPE_SELFTEST allocations. Then, add a vdevice_alloc TEST_F to cover the IOMMU_VDEVICE_ALLOC ioctl. Link: https://patch.msgid.link/r/4b9607e5b86726c8baa7b89bd48123fb44104a23.1730836308.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 5778c75 linux) Signed-off-by: Koba Ko <[email protected]>

This per-vIOMMU cache_invalidate op is like the cache_invalidate_user op in struct iommu_domain_ops, but wider, supporting device cache (e.g. PCI ATC invaldiations). Link: https://patch.msgid.link/r/90138505850fa6b165135e78a87b4cc7022869a4.1730836308.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 67db79d linux) Signed-off-by: Koba Ko <[email protected]>

With a vIOMMU object, use space can flush any IOMMU related cache that can be directed via a vIOMMU object. It is similar to the IOMMU_HWPT_INVALIDATE uAPI, but can cover a wider range than IOTLB, e.g. device/desciprtor cache. Allow hwpt_id of the iommu_hwpt_invalidate structure to carry a viommu_id, and reuse the IOMMU_HWPT_INVALIDATE uAPI for vIOMMU invalidations. Drivers can define different structures for vIOMMU invalidations v.s. HWPT ones. Since both the HWPT-based and vIOMMU-based invalidation pathways check own cache invalidation op, remove the WARN_ON_ONCE in the allocator. Update the uAPI, kdoc, and selftest case accordingly. Link: https://patch.msgid.link/r/b411e2245e303b8a964f39f49453a5dff280968f.1730836308.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 54ce69e linux) Signed-off-by: Koba Ko <[email protected]>

The iommu_copy_struct_from_user_array helper can be used to copy a single entry from a user array which might not be efficient if the array is big. Add a new iommu_copy_struct_from_full_user_array to copy the entire user array at once. Update the existing iommu_copy_struct_from_user_array kdoc accordingly. Link: https://patch.msgid.link/r/5cd773d9c26920c5807d232b21d415ea79172e49.1730836308.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 4f2e59c linux) Signed-off-by: Koba Ko <[email protected]>

This avoids a bigger trouble of exposing struct iommufd_device and struct iommufd_vdevice in the public header. Link: https://patch.msgid.link/r/84fa7c624db4d4508067ccfdf42059533950180a.1730836308.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit c747e67 linux) Signed-off-by: Koba Ko <[email protected]>

Similar to the coverage of cache_invalidate_user for iotlb invalidation, add a device cache and a viommu_cache_invalidate function to test it out. Link: https://patch.msgid.link/r/a29c7c23d7cd143fb26ab68b3618e0957f485fdb.1730836308.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit d6563aa linux) Signed-off-by: Koba Ko <[email protected]>

Similar to IOMMU_TEST_OP_MD_CHECK_IOTLB verifying a mock_domain's iotlb, IOMMU_TEST_OP_DEV_CHECK_CACHE will be used to verify a mock_dev's cache. Link: https://patch.msgid.link/r/cd4082079d75427bd67ed90c3c825e15b5720a5f.1730836308.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 576ad6e linux) Signed-off-by: Koba Ko <[email protected]>

Add a viommu_cache test function to cover vIOMMU invalidations using the updated IOMMU_HWPT_INVALIDATE ioctl, which now allows passing in a vIOMMU via its hwpt_id field. Link: https://patch.msgid.link/r/f317f902041f3d05deaee4ca3fdd8ef4b8297361.1730836308.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 49ad127 linux) Signed-off-by: Koba Ko <[email protected]>

With the introduction of the new object and its infrastructure, update the doc and the vIOMMU graph to reflect that. Link: https://patch.msgid.link/r/e1ff278b7163909b2641ae04ff364bb41d2a2a2e.1730836308.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Bagas Sanjaya <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit b047c06 linux) Signed-off-by: Koba Ko <[email protected]>

Don't open code the calculations of the indexes for each level, provide two functions to do that math and call them in all the places. Update all the places computing indexes. Calculate the L1 table size directly based on the max required index from the cap. Remove STRTAB_L1_SZ_SHIFT in favour of STRTAB_NUM_L2_STES. Use STRTAB_NUM_L2_STES to replace remaining open coded 1 << STRTAB_SPLIT. Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit ce41041 linux) Signed-off-by: Koba Ko <[email protected]>

Add types struct arm_smmu_strtab_l1 and l2 to represent the HW layout of the descriptors, and use them in most places, following patches will get the remaing places. The size of the l1 and l2 HW allocations are sizeof(struct arm_smmu_strtab_l1/2). This provides some more clarity than having raw __le64 *'s and sizes computed via macros. Remove STRTAB_L1_DESC_DWORDS. Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit abb4f9d linux) Signed-off-by: Koba Ko <[email protected]>

The members here are being used for both the linear and the 2 level case, with the meaning of each item slightly different in the two cases. Split it into a clean union where both cases have their own struct with their own logical names and correct types. Adjust all the users to detect linear/2lvl and use the right sub structure and types consistently. Remove STRTAB_STE_DWORDS by changing the last places to use sizeof(struct arm_smmu_ste). Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 85196f5 linux) Signed-off-by: Koba Ko <[email protected]>

These values can be computed from the other values already stored in the config. Move the calculation to arm_smmu_write_strtab() and do it directly before writing the registers. This moves all the logic to calculate the two registers into one function from three and saves an unimportant 16 bytes from the arm_smmu_device. Suggested-by: Nicolin Chen <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 8c153ef linux) Signed-off-by: Koba Ko <[email protected]>

The master->cd_table is entirely contained within the struct arm_smmu_master which is guaranteed to be freed by the core code under arm_smmu_release_device(). There is no reason to use devm here, arm_smmu_free_cd_tables() is reliably called to free the CD related memory. Remove it and save some memory. Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 47b2de3 linux) Signed-off-by: Koba Ko <[email protected]>

The top of the 2 level CD table is (at most) 1024 entries big, and two high order allocations are required. One of __le64 which is programmed into the HW (8k) and one of struct arm_smmu_l1_ctx_desc which holds the CPU pointer (16k). There are two copies of the l2ptr_dma, one is stored in the struct arm_smmu_l1_ctx_desc, and another is encoded in the __le64 for the HW to use. Instead of storing two copies just decode the value from the __le64. Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit c0a25a9 linux) Signed-off-by: Koba Ko <[email protected]>

As well as indexing helpers arm_smmu_cdtab_l1/2_idx(). Remove CTXDESC_L1_DESC_DWORDS and CTXDESC_CD_DWORDS replacing them all with type specific calculations. Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 7c567eb linux) Signed-off-by: Koba Ko <[email protected]>

When VCMDQs are assigned to a VINTF owned by a guest (HYP_OWN bit unset), only TLB and ATC invalidation commands are supported by the VCMDQ HW. So, implement the new cmdq->supports_cmd op to scan the input cmd in order to make sure that it is supported by the selected queue. Note that the guest VM shouldn't have HYP_OWN bit being set regardless of guest kernel driver writing it or not, i.e. the hypervisor running in the host OS should wire this bit to zero when trapping a write access to this VINTF_CONFIG register from a guest kernel. Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/8160292337059b91271045800e5c62f7295e2c24.1724970714.git.nicolinc@nvidia.com Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit a9d4028 linux) Signed-off-by: Koba Ko <[email protected]>

This control causes the ARM SMMU drivers to choose a stage 2 implementation for the IO pagetable (vs the stage 1 usual default), however this choice has no significant visible impact to the VFIO user. Further qemu never implemented this and no other userspace user is known. The original description in commit f5c9ece ("vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide SMMU translation services to the guest operating system" however the rest of the API to set the guest table pointer for the stage 1 and manage invalidation was never completed, or at least never upstreamed, rendering this part useless dead code. Upstream has now settled on iommufd as the uAPI for controlling nested translation. Choosing the stage 2 implementation should be done by through the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation. Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the enable_nesting iommu_domain_op. Just in-case there is some userspace using this continue to treat requesting it as a NOP, but do not advertise support any more. Acked-by: Alex Williamson <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Reviewed-by: Donald Dutile <[email protected]> Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 35890f8 linux) Signed-off-by: Koba Ko <[email protected]>

ACPICA commit c4f5c083d24df9ddd71d5782c0988408cf0fc1ab The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory Access Flag field in the Memory Access Properties table, mainly for a PCI Root Complex. This CANWBS defines the coherency of memory accesses to be not marked IOWB cacheable/shareable. Its value further implies the coherency impact from a pair of mismatched memory attributes (e.g. in a nested translation case): 0x0: Use of mismatched memory attributes for accesses made by this device may lead to a loss of coherency. 0x1: Coherency of accesses made by this device to locations in Conventional memory are ensured as follows, even if the memory attributes for the accesses presented by the device or provided by the SMMU are different from Inner and Outer Write-back cacheable, Shareable. Link: acpica/acpica@c4f5c083 Acked-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Acked-by: Hanjun Guo <[email protected]> Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Reviewed-by: Donald Dutile <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 1b8655b linux) Signed-off-by: Koba Ko <[email protected]>

The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory Access Flag field in the Memory Access Properties table, mainly for a PCI Root Complex. This CANWBS defines the coherency of memory accesses to be not marked IOWB cacheable/shareable. Its value further implies the coherency impact from a pair of mismatched memory attributes (e.g. in a nested translation case): 0x0: Use of mismatched memory attributes for accesses made by this device may lead to a loss of coherency. 0x1: Coherency of accesses made by this device to locations in Conventional memory are ensured as follows, even if the memory attributes for the accesses presented by the device or provided by the SMMU are different from Inner and Outer Write-back cacheable, Shareable. Note that the loss of coherency on a CANWBS-unsupported HW typically could occur to an SMMU that doesn't implement the S2FWB feature where additional cache flush operations would be required to prevent that from happening. Add a new ACPI_IORT_MF_CANWBS flag and set IOMMU_FWSPEC_PCI_RC_CANWBS upon the presence of this new flag. CANWBS and S2FWB are similar features, in that they both guarantee the VM can not violate coherency, however S2FWB can be bypassed by PCI No Snoop TLPs, while CANWBS cannot. Thus CANWBS meets the requirements to set IOMMU_CAP_ENFORCE_CACHE_COHERENCY. Architecturally ARM has expected that VFIO would disable No Snoop through PCI Config space, if this is done then the two would have the same protections. Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Acked-by: Hanjun Guo <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Reviewed-by: Donald Dutile <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 807404d linux) Signed-off-by: Koba Ko <[email protected]>

For SMMUv3 the parent must be a S2 domain, which can be composed into a IOMMU_DOMAIN_NESTED. In future the S2 parent will also need a VMID linked to the VIOMMU and even to KVM. Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Reviewed-by: Donald Dutile <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 874b87c linux) Signed-off-by: Koba Ko <[email protected]>

The arm-smmuv3-iommufd.c file will need to call these functions too. Remove statics and put them in the header file. Remove the kunit visibility protections from arm_smmu_make_abort_ste() and arm_smmu_make_s2_domain_ste(). Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Reviewed-by: Donald Dutile <[email protected]> Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit f6681ab linux) Signed-off-by: Koba Ko <[email protected]>

For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting as the parent and a user provided STE fragment that defines the CD table and related data with addresses translated by the S2 iommu_domain. The kernel only permits userspace to control certain allowed bits of the STE that are safe for user/guest control. IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2 translation, but there is no way of knowing which S1 entries refer to a range of S2. For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to flush all ASIDs from the VMID after flushing the S2 on any change to the S2. The IOMMU_DOMAIN_NESTED can only be created from inside a VIOMMU as the invalidation path relies on the VIOMMU to translate virtual stream ID used in the invalidation commands for the CD table and ATS. Link: https://patch.msgid.link/r/[email protected] Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Reviewed-by: Donald Dutile <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 1e8be08 linux) Signed-off-by: Koba Ko <[email protected]>

Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field works. When S2FWB is supported and enabled the IOPTE will force cachable access to IOMMU_CACHE memory when nesting with a S1 and deny cachable access when !IOMMU_CACHE. When using a single stage of translation, a simple S2 domain, it doesn't change things for PCI devices as it is just a different encoding for the existing mapping of the IOMMU protection flags to cachability attributes. For non-PCI it also changes the combining rules when incoming transactions have inconsistent attributes. However, when used with a nested S1, FWB has the effect of preventing the guest from choosing a MemAttr in it's S1 that would cause ordinary DMA to bypass the cache. Consistent with KVM we wish to deny the guest the ability to become incoherent with cached memory the hypervisor believes is cachable so we don't have to flush it. Allow NESTED domains to be created if the SMMU has S2FWB support and use S2FWB for NESTING_PARENTS. This is an additional option to CANWBS. Link: https://patch.msgid.link/r/[email protected] Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Reviewed-by: Donald Dutile <[email protected]> Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 67e4fe3 linux) Signed-off-by: Koba Ko <[email protected]>

The EATS flag needs to flow through the vSTE and into the pSTE, and ensure physical ATS is enabled on the PCI device. The physical ATS state must match the VM's idea of EATS as we rely on the VM to issue the ATS invalidation commands. Thus ATS must remain off at the device until EATS on a nesting domain turns it on. Attaching a nesting domain is the point where the invalidation responsibility transfers to userspace. Update the ATS logic to track EATS for nesting domains and flush the ATC whenever the S2 nesting parent changes. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Nicolin Chen <[email protected]> Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit f27298a linux) Signed-off-by: Koba Ko <[email protected]>

…domain In a 1-stage translation setup, a device is attached to a paging domain. In a 2-stage translation setup, a device is attached to a nested domain, which does not have the mappings for the MSI page but only an s2_parent paging domain pointer that holds the mappings. Add arm_smmu_get_msi_mapping_domain in arm_smmu_nested_ops to return the correct paging domain. Signed-off-by: Nicolin Chen <[email protected]> (cherry picked from commit c019f15752e65bdc6c28c480d60eb92e58ae9188 nvidia/ksta ble/dev/nic/wip/smmuv3_nesting-v4-1105202) Signed-off-by: Koba Ko <[email protected]>

Currently, iommu-dma is the only place outside of IOMMUFD and drivers which might need to be aware of the stage 2 domain encapsulated within a nested domain. This would be still the RMR solution where we're using host-managed MSIs with an identity mapping at stage 1, where it is the underlying stage 2 domain which owns an MSI cookie and holds the corresponding dynamic mappings. Hook up the new op to resolve what we need from a nested domain. Signed-off-by: Robin Murphy <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit d096dab 24.04_linux -nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]>

Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit a556373 24.04_linux -nvidia-adv-6.8-next) Signed-off-by: KobaK <[email protected]>

Signed-off-by: Ankit Agrawal <[email protected]> (cherry picked from commit d4223d6db2896ec510bfc57cf018010d07ff3659 nvidia/ksta ble/dev/nic/iommufd_vsmmu-12122024) Signed-off-by: Koba Ko <[email protected]>

This is used for GPU memory mapping. The solution is a WAR while waiting for the upstream solution that would use dmabuf to map the entire range in a single sequence. Related topics: https://lore.kernel.org/kvm/[email protected]/ https://lore.kernel.org/kvm/[email protected]/ Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 88f15bf 24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]>

Fix typos/spellos in kernel-doc comments for readability. Fixes: aad37e7 ("iommufd: IOCTLs for the io_pagetable") Fixes: b7a0855 ("iommu: Add new flag to explictly request PASID capable domain") Fixes: d68beb2 ("iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object") Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Acked-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 7937a1b linux) Signed-off-by: Koba Ko <[email protected]>

Commit 69d9b31 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC") started using _iommufd_object_alloc() without importing the IOMMUFD module namespace, resulting in a modpost warning: WARNING: modpost: module arm_smmu_v3 uses symbol _iommufd_object_alloc from namespace IOMMUFD, but does not import it. Commit d68beb2 ("iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object") added another warning by using iommufd_viommu_find_dev(): WARNING: modpost: module arm_smmu_v3 uses symbol iommufd_viommu_find_dev from namespace IOMMUFD, but does not import it. Import the IOMMUFD module namespace to resolve the warnings. Fixes: 69d9b31 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC") Link: https://patch.msgid.link/r/20241114-arm-smmu-v3-import-iommufd-module-ns-v1-1-c551e7b972e9@kernel.org Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 6d026e6 linux) Signed-off-by: Koba Ko <[email protected]>

Replace comma between expressions with semicolons. Using a ',' in place of a ';' can have unintended side effects. Although that is not the case here, it is seems best to use ';' unless ',' is intended. Found by inspection. No functional change intended. Compile tested only. Fixes: e3b1be2 ("iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg") Signed-off-by: Chen Ni <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 7de7d35 linux) Signed-off-by: Koba Ko <[email protected]>

The function arm_smmu_init_strtab_2lvl uses the expression ((1 << smmu->sid_bits) - 1) to calculate the largest StreamID value. However, this fails for the maximum allowed value of SMMU_IDR1.SIDSIZE which is 32. The C standard states: "If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined." With smmu->sid_bits being 32, the prerequisites for undefined behavior are met. We observed that the value of (1 << 32) is 1 and not 0 as we initially expected. Similar bit shift operations in arm_smmu_init_strtab_linear seem to not be affected, because it appears to be unlikely for an SMMU to have SMMU_IDR1.SIDSIZE set to 32 but then not support 2-level Stream tables This issue was found by Ryan Huang <[email protected]> on our team. Fixes: ce41041 ("iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()") Signed-off-by: Daniel Mentz <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit f63237f linux) Signed-off-by: Koba Ko <[email protected]>

During boot some of the calls to tegra241_cmdqv_get_cmdq() will happen in preemptible context. As this function calls smp_processor_id(), if CONFIG_DEBUG_PREEMPT is enabled, these calls will trigger a series of "BUG: using smp_processor_id() in preemptible" backtraces. As tegra241_cmdqv_get_cmdq() only calls smp_processor_id() to use the CPU number as a factor to balance out traffic on cmdq usage, it is safe to use raw_smp_processor_id() here. Cc: <[email protected]> Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Signed-off-by: Luis Claudio R. Goncalves <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 1f80621 linux) Signed-off-by: Koba Ko <[email protected]>

When configuring a kernel with PAGE_SIZE=4KB, depending on its setting of CONFIG_CMA_ALIGNMENT, VCMDQ_LOG2SIZE_MAX=19 could fail the alignment test and trigger a WARN_ON: WARNING: at drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3646 Call trace: arm_smmu_init_one_queue+0x15c/0x210 tegra241_cmdqv_init_structures+0x114/0x338 arm_smmu_device_probe+0xb48/0x1d90 Fix it by capping max_n_shift to CMDQ_MAX_SZ_SHIFT as SMMUv3 CMDQ does. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Signed-off-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit a379971 linux) Signed-off-by: Koba Ko <[email protected]>

Fix a sparse warning. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 89edbe8 linux) Signed-off-by: Koba Ko <[email protected]>

…herent It's observed that, when the first 4GB of system memory was reserved, all VCMDQ allocations failed (even with the smallest qsz in the last attempt): arm-smmu-v3: found companion CMDQV device: NVDA200C:00 arm-smmu-v3: option mask 0x10 arm-smmu-v3: failed to allocate queue (0x8000 bytes) for vcmdq0 acpi NVDA200C:00: tegra241_cmdqv: Falling back to standard SMMU CMDQ arm-smmu-v3: ias 48-bit, oas 48-bit (features 0x001e1fbf) arm-smmu-v3: allocated 524288 entries for cmdq arm-smmu-v3: allocated 524288 entries for evtq arm-smmu-v3: allocated 524288 entries for priq This is because the 4GB reserved memory shifted the entire DMA zone from a lower 32-bit range (on a system without the 4GB carveout) to higher range, while the dev->coherent_dma_mask was set to DMA_BIT_MASK(32) by default. The dma_set_mask_and_coherent() call is done in arm_smmu_device_hw_probe() of the SMMU driver. So any DMA allocation from tegra241_cmdqv_probe() must wait until the coherent_dma_mask is correctly set. Move the vintf/vcmdq structure initialization routine into a different op, "init_structures". Call it at the end of arm_smmu_init_structures(), where standard SMMU queues get allocated. Most of the impl_ops aren't ready until vintf/vcmdq structure are init-ed. So replace the full impl_ops with an init_ops in __tegra241_cmdqv_probe(). And switch to tegra241_cmdqv_impl_ops later in arm_smmu_init_structures(). Note that tegra241_cmdqv_impl_ops does not link to the new init_structures op after this switch, since there is no point in having it once it's done. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Reported-by: Matt Ochs <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/530993c3aafa1b0fc3d879b8119e13c629d12e2b.1725503154.git.nicolinc@nvidia.com Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 483e0bd linux) Signed-off-by: Koba Ko <[email protected]>

This is likely a typo. Drop it. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/13fd3accb5b7ed6ec11cc6b7435f79f84af9f45f.1725503154.git.nicolinc@nvidia.com Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 2408b81 linux) Signed-off-by: Koba Ko <[email protected]>

The ioremap() function doesn't return error pointers, it returns NULL on error so update the error handling. Also just return directly instead of calling iounmap() on the NULL pointer. Calling iounmap(NULL) doesn't cause a problem on ARM but on other architectures it can trigger a warning so it'a bad habbit. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 086a3c4 linux) Signed-off-by: Koba Ko <[email protected]>

…r_header Kernel test robot reported a few trucation warnings at the snprintf: drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c: In function ‘tegra241_vintf_free_lvcmdq’: drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:56: warning: ‘%u’ directive output may be truncated writing between 1 and 5 bytes into a region of size between 3 and 11 [-Wformat-truncation=] 239 | snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ", | ^~ drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:32: note: directive argument in the range [0, 65535] 239 | snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:9: note: ‘snprintf’ output between 25 and 37 bytes into a destination of size 32 239 | snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 240 | vcmdq->vintf->idx, vcmdq->idx, vcmdq->lidx); Fix by bumping up the size of the header to hold more characters. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit db184a1 linux) Signed-off-by: Koba Ko <[email protected]>

NVIDIA is planning to productize a new Grace Hopper superchip SKU with device ID 0x2348. Add the SKU devid to nvgrace_gpu_vfio_pci_table. Signed-off-by: Ankit Agrawal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alex Williamson <[email protected]> (cherry picked from commit 12cd88a linux) Signed-off-by: Koba Ko <[email protected]>

…d for uncached resmem NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a continuation with the Grace Hopper (GH) superchip that provides a cache coherent access to CPU and GPU to each other's memory with an internal proprietary chip-to-chip cache coherent interconnect. There is a HW defect on GH systems to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region with uncached mapping carved out from the device memory. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. This is fixed on the GB systems. The presence of the fix for the HW defect is communicated by the device firmware through the DVSEC PCI config register with ID 3. The module reads this to take a different codepath on GB vs GH. Scan through the DVSEC registers to identify the correct one and use it to determine the presence of the fix. Save the value in the device's nvgrace_gpu_pci_core_device structure. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] Signed-off-by: Ankit Agrawal <[email protected]> Ref: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Koba Ko <[email protected]>

…to the VM There is a HW defect on Grace Hopper (GH) to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region carved out from the device memory and mapped as uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. The Grace Blackwell systems (GB) differ from GH systems in the following aspects: 1. The aforementioned HW defect is fixed on GB systems. 2. There is a usable BAR1 (region 2 and 3) on GB systems for the GPUdirect RDMA feature [2]. This patch accommodate those GB changes by showing the 64b physical device BAR1 (region2 and 3) to the VM instead of the fake one. This takes care of both the differences. Moreover, the entire device memory is exposed on GB as cacheable to the VM as there is no carveout required. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2] Signed-off-by: Ankit Agrawal <[email protected]> Ref: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Koba Ko <[email protected]>

…status In contrast to Grace Hopper systems, the HBM training has been moved out of the UEFI on the Grace Blackwell systems. This reduces the system bootup time significantly. The onus of checking whether the HBM training has completed thus falls on the module. The HBM training status can be determined from a BAR0 register. Similarly, another BAR0 register exposes the status of the CPU-GPU chip-to-chip (C2C) cache coherent interconnect. Based on testing, 30s is determined to be sufficient to ensure initialization completion on all the Grace based systems. Thus poll these register and check for 30s. If the HBM training is not complete or if the C2C link is not ready, fail the probe. While the time is not required on Grace Hopper systems, it is beneficial to make the check to ensure the device is in an expected state. Hence keeping it generalized to both the generations. Signed-off-by: Ankit Agrawal <[email protected]> Ref: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Koba Ko <[email protected]>

virtualization This adds the following config options to annotations: CONFIG_ARM_SMMU_V3_IOMMUFD=y CONFIG_IOMMUFD_DRIVER_CORE=y CONFIG_IOMMUFD_VFIO_CONTAINER=y CONFIG_NVGRACE_GPU_VFIO_PCI=m CONFIG_VFIO_CONTAINER=n CONFIG_VFIO_IOMMU_TYPE1=- CONFIG_TEGRA241_CMDQV=n For CMA size requirements, the 64K kernel configuration needs 640MB in the worst-case scenario, while the 4K kernel configuration requires 40MB. Due to the current CMA alignment requirement of 512MB on 64k kernel and 128MB on 4k kernel, use each as default For 64k kernel, CONFIG_CMA_SIZE_MBYTES=1024 For 4k kernel, CONFIG_CMA_SIZE_MBYTES=128 These config options has been defined in debian.master CONFIG_IOMMUFD=m CONFIG_IOMMU_IOPF=y Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (backported from commit 35a55f3 24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]>

nicolinc and others added 30 commits January 22, 2025 16:48

NVIDIA: SAUCE: WAR: Expose PCI PASID capability to userspace

3fcd6f4

Signed-off-by: Ankit Agrawal <[email protected]> (cherry picked from commit d4223d6db2896ec510bfc57cf018010d07ff3659 nvidia/ksta ble/dev/nic/iommufd_vsmmu-12122024) Signed-off-by: Koba Ko <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[linux-nvidia-internal-6.11][Backport] GPU passthrough cuda support #52

[linux-nvidia-internal-6.11][Backport] GPU passthrough cuda support #52

KobaKoNvidia commented Feb 10, 2025

[linux-nvidia-internal-6.11][Backport] GPU passthrough cuda support #52

Are you sure you want to change the base?

[linux-nvidia-internal-6.11][Backport] GPU passthrough cuda support #52

Conversation

KobaKoNvidia commented Feb 10, 2025