Add layer implementation of vkCreateDevice

ARM-software · Dec 30, 2024 · a10ccbc · a10ccbc
1 parent f795b71
commit a10ccbc
Show file tree

Hide file tree

Showing 10 changed files with 274 additions and 576 deletions.
diff --git a/layer_gpu_performance/README_LAYER.md b/layer_gpu_performance/README_LAYER.md
@@ -5,7 +5,7 @@ analyze the workloads that make up a single frame.
 
 This layer supports two modes:
 
-* Per workload time, read via queries
+* Per workload time, read via Vulkan API queries
 * Per workload performance counters, read via a non-API mechanism
 
 ## What devices are supported?
@@ -23,16 +23,16 @@ a way which is compatible with the way that a tile-based renderer schedules
 render passes.
 
 Under normal scheduling, tile-based renderers split render passes into two
-pieces which are independently scheduled and can overlap with other work that
-is running on the GPU. Blindly timing render passes using timer queries can
-result in confusing results because the time includes time spend processing
-unrelated workloads running in parallel.
+pieces which are independently scheduled and that can overlap with other work
+that is running on the GPU. Blindly timing render passes using timer queries
+can result in confusing results because the reported time might include time
+spent processing unrelated workloads that happen to be running in parallel.
 
-The diagram shows one possible arrangement of workloads scheduled on the GPU
-hardware queues for an Arm 5th Generation architecture GPU. We're trying to
-time render pass 1 indicated by the `1` characters in the diagram, starting a
-timer query when this render pass starts (`S`) in the binning phase queue, and
-stopping when it ends (`E`) in the main phase queue.
+The timing diagram below shows one possible arrangement of workloads scheduled
+on the GPU hardware queues for an Arm 5th Generation architecture GPU. We are
+trying to time render pass 1 indicated by the `1` characters in the diagram,
+starting a timer query when this render pass starts (`S`) in the binning phase
+queue, and stopping when it ends (`E`) in the main phase queue.
 
 ```
          Compute:              222
@@ -41,16 +41,86 @@ stopping when it ends (`E`) in the main phase queue.
 ```
 
 In this scenario the timer query correctly reflects the elapsed time of the
-render pass, but is not an accurate measure of cost of this workload. The
-elapsed time includes time where other workloads are running in parallel,
-indicated by the `0`, `2`, and `3` characters. It also includes time between
-the two phases where workload `1` is not running at all, because the binning
-phase work has completed, but is waiting for the main phase queue to finish an
-earlier workload.
+render pass, but does not give an accurate measure of its cost. The elapsed
+time includes time where other workloads are running in parallel, indicated by
+the `0`, `2`, and `3` characters. It also includes time between the two phases
+where workload `1` is not running at all, because the binning phase work has
+completed and the main phase work is stuck waiting for an earlier workload to
+finish to free up the hardware.
 
 To accurately cost workloads on a tile-based renderer, which will overlap and
 run workloads in parallel if it is allowed to, the layer must inject additional
-synchronization primitives to serialize all workloads within a queue and across
-queues. This ensures that timer query values reflect the cost of individual
-workloads, however it also means that overall frame performance will be reduced
-due to loss of workload parallelization.
+synchronization to serialize all workloads within a queue and across queues.
+This ensures that timer query values reflect the cost of individual workloads,
+however it also means that overall frame performance will be reduced due to
+loss of workload parallelization.
+
+# Design notes
+
+## Dependencies
+
+This layer uses timeline semaphores, so requires either Vulkan 1.1 or
+the `VK_KHR_timeline_semaphore` extension.
+
+## Implementing serialization
+
+Cross-queue serialization is implemented using an injected timeline semaphore.
+Each submit is assigned an incrementing `ID`, and will wait for `ID - 1` in the
+timeline before starting, and set `ID` in the timeline when completing. This
+allows us to implement serialization using a single sync primitive.
+
+Serialization within a queue is implemented by injecting a full pipeline
+barrier before the pre-workload timer query, ensuring that all prior work has
+completed before the time is sampled. Similarly we put a full pipeline barrier
+after the post-workload timer query, ensuring that no following work starts
+before the time is sampled.
+
+## Implementing query lifetime tracking
+
+Timer queries are implemented using query pools. The timer write commands are
+recorded into each command buffer alongside the user commands. Each timer write
+command specifies the specific counter slots used in a specific query pool, so
+the query pool and slot usage must be assigned when the command buffer is
+recorded.
+
+Query pools in the layer are a managed resource. We allocate query pools on
+demand, and maintain a free-list of query pools that have been freed and are
+ready for reuse.
+
+Query pools are allocated with enough space for 64 query results which is, in
+the best case, enough for 63 workloads (N+1 counters). This can reduce for
+render passes using multi-view rendering, which allocate 1 counter slot per
+view.
+
+Query pools are assigned to a command buffer when recording, and multiple
+query pools can be assigned to a single command buffer if more query result
+space is needed. Query pools are fully reset on first use in the command
+buffer. Query pools are returned to the layer free-list when the command buffer
+is reset or destroyed.
+
+### Multi-submit command buffers
+
+Reusable command buffers that are not one-time submit can be problematic for
+this type of instrumentation.
+
+A single primary command buffer could be submitted multiple times. This can be
+managed by serializing the workloads and ensuring that the query results are
+consumed between executions. This may impact performance due to additional
+serialization, but it can be made to work.
+
+**NOTE:** This impact of this case could be mitigated by having the layer
+inject a command buffer after the user command buffer, which inserts a copy
+command to copy the query results to a buffer. This buffer is owned by the
+layer and can be N-buffered to avoid stalls.
+
+The more problematic case is the case where a single secondary command buffer
+is executed multiple times from within the same primary. In this case there
+is no place to solve the collision with CPU-side synchronization, and relying
+on only CPU-side recording will only capture the last copy.
+
+### Split command buffers
+
+Vulkan 1.3 can split dynamic render passes over multiple command buffers,
+although all parts must be part of the same queue submit call. The layer will
+only emit timestamps for the final part of the render pass, and will ignore
+suspend/resumes.
diff --git a/layer_gpu_performance/source/CMakeLists.txt b/layer_gpu_performance/source/CMakeLists.txt
@@ -53,6 +53,7 @@ add_library(
         layer_device_functions_render_pass.cpp
         layer_device_functions_trace_rays.cpp
         layer_device_functions_transfer.cpp
+        layer_instance_functions_device.cpp
         performance_comms.cpp)
 
 target_include_directories(

diff --git a/layer_gpu_performance/source/layer_device_functions.hpp b/layer_gpu_performance/source/layer_device_functions.hpp
@@ -456,18 +456,18 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdCopyImageToBuffer2KHR<user_tag>(
 // Functions for debug
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerBeginEXT<user_tag>(
     VkCommandBuffer commandBuffer,
     const VkDebugMarkerMarkerInfoEXT* pMarkerInfo);
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerEndEXT<user_tag>(
     VkCommandBuffer commandBuffer);
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR void VKAPI_CALL layer_vkCmdBeginDebugUtilsLabelEXT<user_tag>(
     VkCommandBuffer commandBuffer,
     const VkDebugUtilsLabelEXT* pLabelInfo);
@@ -480,29 +480,29 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdEndDebugUtilsLabelEXT<user_tag>(
 // Functions for queues
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueuePresentKHR<user_tag>(
     VkQueue queue,
     const VkPresentInfoKHR* pPresentInfo);
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit<user_tag>(
     VkQueue queue,
     uint32_t submitCount,
     const VkSubmitInfo* pSubmits,
     VkFence fence);
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2<user_tag>(
     VkQueue queue,
     uint32_t submitCount,
     const VkSubmitInfo2* pSubmits,
     VkFence fence);
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2KHR<user_tag>(
     VkQueue queue,
     uint32_t submitCount,

diff --git a/layer_gpu_performance/source/layer_device_functions_debug.cpp b/layer_gpu_performance/source/layer_device_functions_debug.cpp
@@ -31,7 +31,7 @@
 extern std::mutex g_vulkanLock;
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerBeginEXT<user_tag>(
     VkCommandBuffer commandBuffer,
     const VkDebugMarkerMarkerInfoEXT* pMarkerInfo
@@ -54,7 +54,7 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerBeginEXT<user_tag>(
 }
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerEndEXT<user_tag>(
     VkCommandBuffer commandBuffer
 ) {
@@ -76,7 +76,7 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerEndEXT<user_tag>(
 }
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR void VKAPI_CALL layer_vkCmdBeginDebugUtilsLabelEXT<user_tag>(
     VkCommandBuffer commandBuffer,
     const VkDebugUtilsLabelEXT* pLabelInfo
@@ -99,7 +99,7 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdBeginDebugUtilsLabelEXT<user_tag>(
 }
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR void VKAPI_CALL layer_vkCmdEndDebugUtilsLabelEXT<user_tag>(
     VkCommandBuffer commandBuffer
 ) {

diff --git a/layer_gpu_performance/source/layer_device_functions_queue.cpp b/layer_gpu_performance/source/layer_device_functions_queue.cpp
@@ -38,7 +38,7 @@ using namespace std::placeholders;
 extern std::mutex g_vulkanLock;
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueuePresentKHR<user_tag>(
     VkQueue queue,
     const VkPresentInfoKHR* pPresentInfo
@@ -67,7 +67,7 @@ VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueuePresentKHR<user_tag>(
 }
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit<user_tag>(
     VkQueue queue,
     uint32_t submitCount,
@@ -104,7 +104,7 @@ VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit<user_tag>(
 }
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2<user_tag>(
     VkQueue queue,
     uint32_t submitCount,
@@ -141,7 +141,7 @@ VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2<user_tag>(
 }
 
 /* See Vulkan API for documentation. */
-template<>
+template <>
 VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2KHR<user_tag>(
     VkQueue queue,
     uint32_t submitCount,

diff --git a/layer_gpu_performance/source/layer_instance_functions.hpp b/layer_gpu_performance/source/layer_instance_functions.hpp
@@ -0,0 +1,38 @@
+/*
+ * SPDX-License-Identifier: MIT
+ * ----------------------------------------------------------------------------
+ * Copyright (c) 2024 Arm Limited
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ * ----------------------------------------------------------------------------
+ */
+
+#pragma once
+
+#include <vulkan/vulkan.h>
+
+#include "framework/utils.hpp"
+
+/* See Vulkan API for documentation. */
+template <>
+VKAPI_ATTR VkResult VKAPI_CALL layer_vkCreateDevice<user_tag>(
+    VkPhysicalDevice physicalDevice,
+    const VkDeviceCreateInfo* pCreateInfo,
+    const VkAllocationCallbacks* pAllocator,
+    VkDevice* pDevice);
diff --git a/layer_gpu_performance/source/layer_instance_functions_device.cpp b/layer_gpu_performance/source/layer_instance_functions_device.cpp
@@ -0,0 +1,80 @@
+/*
+ * SPDX-License-Identifier: MIT
+ * ----------------------------------------------------------------------------
+ * Copyright (c) 2024 Arm Limited
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ * ----------------------------------------------------------------------------
+ */
+
+#include <mutex>
+
+#include "framework/manual_functions.hpp"
+
+#include "device.hpp"
+#include "layer_instance_functions.hpp"
+
+extern std::mutex g_vulkanLock;
+
+/* See Vulkan API for documentation. */
+template <>
+VKAPI_ATTR VkResult VKAPI_CALL layer_vkCreateDevice<user_tag>(
+    VkPhysicalDevice physicalDevice,
+    const VkDeviceCreateInfo* pCreateInfo,
+    const VkAllocationCallbacks* pAllocator,
+    VkDevice* pDevice
+) {
+    LAYER_TRACE(__func__);
+
+    // Hold the lock to access layer-wide global store
+    std::unique_lock<std::mutex> lock { g_vulkanLock };
+    auto* layer = Instance::retrieve(physicalDevice);
+
+    // Release the lock to call into the driver
+    lock.unlock();
+
+    auto* chainInfo = getChainInfo(pCreateInfo);
+    auto fpGetInstanceProcAddr = chainInfo->u.pLayerInfo->pfnNextGetInstanceProcAddr;
+    auto fpGetDeviceProcAddr = chainInfo->u.pLayerInfo->pfnNextGetDeviceProcAddr;
+
+    auto extensions = getDeviceExtensionList(
+        layer->instance, physicalDevice, pCreateInfo);
+
+    auto fpCreateDevice = reinterpret_cast<PFN_vkCreateDevice>(
+        fpGetInstanceProcAddr(layer->instance, "vkCreateDevice"));
+    if (!fpCreateDevice)
+    {
+        return VK_ERROR_INITIALIZATION_FAILED;
+    }
+
+    // Advance the link info for the next element on the chain
+    chainInfo->u.pLayerInfo = chainInfo->u.pLayerInfo->pNext;
+    auto res = fpCreateDevice(physicalDevice, pCreateInfo, pAllocator, pDevice);
+    if (res != VK_SUCCESS)
+    {
+        return res;
+    }
+
+    // Retake the lock to access layer-wide global store
+    lock.lock();
+    auto device = std::make_unique<Device>(layer, physicalDevice, *pDevice, fpGetDeviceProcAddr);
+    Device::store(*pDevice, std::move(device));
+
+    return VK_SUCCESS;
+}