Skip to content

Commit

Permalink
Add layer implementation of vkCreateDevice
Browse files Browse the repository at this point in the history
  • Loading branch information
solidpixel committed Dec 30, 2024
1 parent f795b71 commit a10ccbc
Show file tree
Hide file tree
Showing 10 changed files with 274 additions and 576 deletions.
110 changes: 90 additions & 20 deletions layer_gpu_performance/README_LAYER.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ analyze the workloads that make up a single frame.

This layer supports two modes:

* Per workload time, read via queries
* Per workload time, read via Vulkan API queries
* Per workload performance counters, read via a non-API mechanism

## What devices are supported?
Expand All @@ -23,16 +23,16 @@ a way which is compatible with the way that a tile-based renderer schedules
render passes.

Under normal scheduling, tile-based renderers split render passes into two
pieces which are independently scheduled and can overlap with other work that
is running on the GPU. Blindly timing render passes using timer queries can
result in confusing results because the time includes time spend processing
unrelated workloads running in parallel.
pieces which are independently scheduled and that can overlap with other work
that is running on the GPU. Blindly timing render passes using timer queries
can result in confusing results because the reported time might include time
spent processing unrelated workloads that happen to be running in parallel.

The diagram shows one possible arrangement of workloads scheduled on the GPU
hardware queues for an Arm 5th Generation architecture GPU. We're trying to
time render pass 1 indicated by the `1` characters in the diagram, starting a
timer query when this render pass starts (`S`) in the binning phase queue, and
stopping when it ends (`E`) in the main phase queue.
The timing diagram below shows one possible arrangement of workloads scheduled
on the GPU hardware queues for an Arm 5th Generation architecture GPU. We are
trying to time render pass 1 indicated by the `1` characters in the diagram,
starting a timer query when this render pass starts (`S`) in the binning phase
queue, and stopping when it ends (`E`) in the main phase queue.

```
Compute: 222
Expand All @@ -41,16 +41,86 @@ stopping when it ends (`E`) in the main phase queue.
```

In this scenario the timer query correctly reflects the elapsed time of the
render pass, but is not an accurate measure of cost of this workload. The
elapsed time includes time where other workloads are running in parallel,
indicated by the `0`, `2`, and `3` characters. It also includes time between
the two phases where workload `1` is not running at all, because the binning
phase work has completed, but is waiting for the main phase queue to finish an
earlier workload.
render pass, but does not give an accurate measure of its cost. The elapsed
time includes time where other workloads are running in parallel, indicated by
the `0`, `2`, and `3` characters. It also includes time between the two phases
where workload `1` is not running at all, because the binning phase work has
completed and the main phase work is stuck waiting for an earlier workload to
finish to free up the hardware.

To accurately cost workloads on a tile-based renderer, which will overlap and
run workloads in parallel if it is allowed to, the layer must inject additional
synchronization primitives to serialize all workloads within a queue and across
queues. This ensures that timer query values reflect the cost of individual
workloads, however it also means that overall frame performance will be reduced
due to loss of workload parallelization.
synchronization to serialize all workloads within a queue and across queues.
This ensures that timer query values reflect the cost of individual workloads,
however it also means that overall frame performance will be reduced due to
loss of workload parallelization.

# Design notes

## Dependencies

This layer uses timeline semaphores, so requires either Vulkan 1.1 or
the `VK_KHR_timeline_semaphore` extension.

## Implementing serialization

Cross-queue serialization is implemented using an injected timeline semaphore.
Each submit is assigned an incrementing `ID`, and will wait for `ID - 1` in the
timeline before starting, and set `ID` in the timeline when completing. This
allows us to implement serialization using a single sync primitive.

Serialization within a queue is implemented by injecting a full pipeline
barrier before the pre-workload timer query, ensuring that all prior work has
completed before the time is sampled. Similarly we put a full pipeline barrier
after the post-workload timer query, ensuring that no following work starts
before the time is sampled.

## Implementing query lifetime tracking

Timer queries are implemented using query pools. The timer write commands are
recorded into each command buffer alongside the user commands. Each timer write
command specifies the specific counter slots used in a specific query pool, so
the query pool and slot usage must be assigned when the command buffer is
recorded.

Query pools in the layer are a managed resource. We allocate query pools on
demand, and maintain a free-list of query pools that have been freed and are
ready for reuse.

Query pools are allocated with enough space for 64 query results which is, in
the best case, enough for 63 workloads (N+1 counters). This can reduce for
render passes using multi-view rendering, which allocate 1 counter slot per
view.

Query pools are assigned to a command buffer when recording, and multiple
query pools can be assigned to a single command buffer if more query result
space is needed. Query pools are fully reset on first use in the command
buffer. Query pools are returned to the layer free-list when the command buffer
is reset or destroyed.

### Multi-submit command buffers

Reusable command buffers that are not one-time submit can be problematic for
this type of instrumentation.

A single primary command buffer could be submitted multiple times. This can be
managed by serializing the workloads and ensuring that the query results are
consumed between executions. This may impact performance due to additional
serialization, but it can be made to work.

**NOTE:** This impact of this case could be mitigated by having the layer
inject a command buffer after the user command buffer, which inserts a copy
command to copy the query results to a buffer. This buffer is owned by the
layer and can be N-buffered to avoid stalls.

The more problematic case is the case where a single secondary command buffer
is executed multiple times from within the same primary. In this case there
is no place to solve the collision with CPU-side synchronization, and relying
on only CPU-side recording will only capture the last copy.

### Split command buffers

Vulkan 1.3 can split dynamic render passes over multiple command buffers,
although all parts must be part of the same queue submit call. The layer will
only emit timestamps for the final part of the render pass, and will ignore
suspend/resumes.
1 change: 1 addition & 0 deletions layer_gpu_performance/source/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ add_library(
layer_device_functions_render_pass.cpp
layer_device_functions_trace_rays.cpp
layer_device_functions_transfer.cpp
layer_instance_functions_device.cpp
performance_comms.cpp)

target_include_directories(
Expand Down
14 changes: 7 additions & 7 deletions layer_gpu_performance/source/layer_device_functions.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -456,18 +456,18 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdCopyImageToBuffer2KHR<user_tag>(
// Functions for debug

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerBeginEXT<user_tag>(
VkCommandBuffer commandBuffer,
const VkDebugMarkerMarkerInfoEXT* pMarkerInfo);

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerEndEXT<user_tag>(
VkCommandBuffer commandBuffer);

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR void VKAPI_CALL layer_vkCmdBeginDebugUtilsLabelEXT<user_tag>(
VkCommandBuffer commandBuffer,
const VkDebugUtilsLabelEXT* pLabelInfo);
Expand All @@ -480,29 +480,29 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdEndDebugUtilsLabelEXT<user_tag>(
// Functions for queues

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueuePresentKHR<user_tag>(
VkQueue queue,
const VkPresentInfoKHR* pPresentInfo);

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit<user_tag>(
VkQueue queue,
uint32_t submitCount,
const VkSubmitInfo* pSubmits,
VkFence fence);

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2<user_tag>(
VkQueue queue,
uint32_t submitCount,
const VkSubmitInfo2* pSubmits,
VkFence fence);

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2KHR<user_tag>(
VkQueue queue,
uint32_t submitCount,
Expand Down
8 changes: 4 additions & 4 deletions layer_gpu_performance/source/layer_device_functions_debug.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
extern std::mutex g_vulkanLock;

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerBeginEXT<user_tag>(
VkCommandBuffer commandBuffer,
const VkDebugMarkerMarkerInfoEXT* pMarkerInfo
Expand All @@ -54,7 +54,7 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerBeginEXT<user_tag>(
}

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerEndEXT<user_tag>(
VkCommandBuffer commandBuffer
) {
Expand All @@ -76,7 +76,7 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdDebugMarkerEndEXT<user_tag>(
}

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR void VKAPI_CALL layer_vkCmdBeginDebugUtilsLabelEXT<user_tag>(
VkCommandBuffer commandBuffer,
const VkDebugUtilsLabelEXT* pLabelInfo
Expand All @@ -99,7 +99,7 @@ VKAPI_ATTR void VKAPI_CALL layer_vkCmdBeginDebugUtilsLabelEXT<user_tag>(
}

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR void VKAPI_CALL layer_vkCmdEndDebugUtilsLabelEXT<user_tag>(
VkCommandBuffer commandBuffer
) {
Expand Down
8 changes: 4 additions & 4 deletions layer_gpu_performance/source/layer_device_functions_queue.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ using namespace std::placeholders;
extern std::mutex g_vulkanLock;

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueuePresentKHR<user_tag>(
VkQueue queue,
const VkPresentInfoKHR* pPresentInfo
Expand Down Expand Up @@ -67,7 +67,7 @@ VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueuePresentKHR<user_tag>(
}

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit<user_tag>(
VkQueue queue,
uint32_t submitCount,
Expand Down Expand Up @@ -104,7 +104,7 @@ VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit<user_tag>(
}

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2<user_tag>(
VkQueue queue,
uint32_t submitCount,
Expand Down Expand Up @@ -141,7 +141,7 @@ VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2<user_tag>(
}

/* See Vulkan API for documentation. */
template<>
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkQueueSubmit2KHR<user_tag>(
VkQueue queue,
uint32_t submitCount,
Expand Down
38 changes: 38 additions & 0 deletions layer_gpu_performance/source/layer_instance_functions.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
/*
* SPDX-License-Identifier: MIT
* ----------------------------------------------------------------------------
* Copyright (c) 2024 Arm Limited
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
* deal in the Software without restriction, including without limitation the
* rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
* sell copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
* ----------------------------------------------------------------------------
*/

#pragma once

#include <vulkan/vulkan.h>

#include "framework/utils.hpp"

/* See Vulkan API for documentation. */
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkCreateDevice<user_tag>(
VkPhysicalDevice physicalDevice,
const VkDeviceCreateInfo* pCreateInfo,
const VkAllocationCallbacks* pAllocator,
VkDevice* pDevice);
80 changes: 80 additions & 0 deletions layer_gpu_performance/source/layer_instance_functions_device.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
/*
* SPDX-License-Identifier: MIT
* ----------------------------------------------------------------------------
* Copyright (c) 2024 Arm Limited
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
* deal in the Software without restriction, including without limitation the
* rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
* sell copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
* ----------------------------------------------------------------------------
*/

#include <mutex>

#include "framework/manual_functions.hpp"

#include "device.hpp"
#include "layer_instance_functions.hpp"

extern std::mutex g_vulkanLock;

/* See Vulkan API for documentation. */
template <>
VKAPI_ATTR VkResult VKAPI_CALL layer_vkCreateDevice<user_tag>(
VkPhysicalDevice physicalDevice,
const VkDeviceCreateInfo* pCreateInfo,
const VkAllocationCallbacks* pAllocator,
VkDevice* pDevice
) {
LAYER_TRACE(__func__);

// Hold the lock to access layer-wide global store
std::unique_lock<std::mutex> lock { g_vulkanLock };
auto* layer = Instance::retrieve(physicalDevice);

// Release the lock to call into the driver
lock.unlock();

auto* chainInfo = getChainInfo(pCreateInfo);
auto fpGetInstanceProcAddr = chainInfo->u.pLayerInfo->pfnNextGetInstanceProcAddr;
auto fpGetDeviceProcAddr = chainInfo->u.pLayerInfo->pfnNextGetDeviceProcAddr;

auto extensions = getDeviceExtensionList(
layer->instance, physicalDevice, pCreateInfo);

auto fpCreateDevice = reinterpret_cast<PFN_vkCreateDevice>(
fpGetInstanceProcAddr(layer->instance, "vkCreateDevice"));
if (!fpCreateDevice)
{
return VK_ERROR_INITIALIZATION_FAILED;
}

// Advance the link info for the next element on the chain
chainInfo->u.pLayerInfo = chainInfo->u.pLayerInfo->pNext;
auto res = fpCreateDevice(physicalDevice, pCreateInfo, pAllocator, pDevice);
if (res != VK_SUCCESS)
{
return res;
}

// Retake the lock to access layer-wide global store
lock.lock();
auto device = std::make_unique<Device>(layer, physicalDevice, *pDevice, fpGetDeviceProcAddr);
Device::store(*pDevice, std::move(device));

return VK_SUCCESS;
}
Loading

0 comments on commit a10ccbc

Please sign in to comment.