Skip to content

Commit

Permalink
v24.10.260
Browse files Browse the repository at this point in the history
- Base: cache info per CPU cluster.
- Base: fixed Unix file with OpenAppend.
- fixed GLFW compilation and Linux CI.
- ResEditor: add performance test for branching in shader.
- ResEditor: add load/stop op for attachment in script.
  • Loading branch information
azhirnov committed Oct 19, 2024
1 parent 554f021 commit 2182035
Show file tree
Hide file tree
Showing 70 changed files with 2,865 additions and 619 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
- name: Configure dependencies
run: |
sudo apt install build-essential pkg-config libx11-dev libxcursor-dev \
libxinerama-dev libgl1-mesa-dev libglu-dev libasound2-dev libpulse-dev libudev-dev \
libxinerama-dev libasound2-dev libpulse-dev libudev-dev \
libxi-dev libxrandr-dev yasm liburing-dev libpng-dev libbz2-dev libwayland-dev \
libxkbcommon-dev
Expand Down
2 changes: 1 addition & 1 deletion AE/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ endif()
#----------------------------------------------------------

project( "AE"
VERSION 24.10.259 # year, month, version
VERSION 24.10.260 # year, month, version
LANGUAGES C CXX
DESCRIPTION "async game engine"
)
Expand Down
1 change: 1 addition & 0 deletions AE/docs/Papers.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@

* [HDR Display](papers/HDR_Display.md)
* [GPU Benchmarks](papers/GPU_Benchmarks.md)
* [Collection of CPU and GPU architecture details](https://github.com/azhirnov/cpu-gpu-arch)

## rus

Expand Down
2 changes: 1 addition & 1 deletion AE/docs/engine/Build.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Install [Android Studio](https://developer.android.com/studio) with NDK.
In terminal run:
```
sudo apt install build-essential pkg-config libx11-dev libxcursor-dev \
libxinerama-dev libgl1-mesa-dev libglu-dev libasound2-dev libpulse-dev libudev-dev \
libxinerama-dev libasound2-dev libpulse-dev libudev-dev \
libxi-dev libxrandr-dev yasm liburing-dev libpng-dev libbz2-dev libwayland-dev \
libxkbcommon-dev
```
Expand Down
95 changes: 76 additions & 19 deletions AE/docs/papers/GPU_Benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ Other:
- [9](#9-Texture-cache)
- [10](#10-Shared-memory)
- [11](#11-NaN)
- [12](#12-Branching)
- [13](#13-Circle-geometry)


# Comparison of Results
Expand All @@ -44,7 +46,7 @@ Other:
| Adreno 5xx | ? | as large as possible | - | | - | - |
| Adreno 6xx | 64/128 | as large as possible | **yes** | | **yes** | no |
| AMD GCN4 | 64 | - | no | **yes** | no | ? |
| Apple M1 | 32 | 16x16 | ? | ? | ? | ? |
| Apple M1 | 32 | 16x16 | no | **yes** | no | no |
| ARM Mali Midgard gen4 | (4) | 16x16 | - | - | - | - |
| ARM Mali Valhall gen1 | 16 | 16x16 | **yes** | **yes** | **yes** (rare) | no |
| Intel UHD 6xx 9.5gen | 16 | - | no | no | no | ? |
Expand All @@ -54,10 +56,10 @@ Other:

## Shader instructions

* FMA and MAD has 2 instructions (Mul, Add) but can execute at 1 cycle.
* Some GPUs supports HFMA2 - FMA for half2 with 2x performance.
* FMA and MAD has 2 operations (Mul, Add) but can execute at 1 cycle.
* Some GPUs supports 1 cycle HFMA2 - FMA for half2 with 2x performance (2 instructions for 2 half types - 4 ops/cycle).
* Some GPUs supports FAdd with 2x performance.
* FMA can be implemented only for fp32 and fp16 will lost performance to use this FMA, so MAD should be used instead.
* FMA can be implemented only for fp32 type, fp16 will lost performance when used this F32FMA, so MAD should be used instead.
* GPU has parallel datapath for fp32 and i32, scheduler can execute i32 instruction in parallel with fp32 without performance lost.
- NV Turing has 1:1 fp32:i32 config.
- NV Ampere has 1 full fp32 and 1 fp32:i32, so it can **not** execute i32 in parallel without fp32 performance lost.
Expand All @@ -68,12 +70,34 @@ Other:
| Adreno 6xx | fma | mad | - | 1 | 2:1 |
| AMD GCN4 | fma | - | - | 1 | no |
| Apple M1 | fma | no | fma | 1 | 2:1 |
| ARM Mali Midgard gen4 | mad | no | mad | 1 | no |
| ARM Mali Midgard gen4 | **mad** | no | mad | 1 | no |
| ARM Mali Valhall gen1 | fma | mad | - | 1 | 2:1 |
| Intel UHD 6xx 9.5gen | fma | **fma** | - | **2** | 2:1 |
| NV RTX 20xx (Turing) | fma | **fma** | - | **2** | 1:1 **(specs)** |
| PowerVR B‑Series | fma | no | mad | 1 | 1:1 |


## Branching

How match Mul and Matrix variants are slower than uniform Branch. [[12](#12-Branching)]
* Uniform branching is faster on most GPUs.
* GPU with vector architecture has faster Matrix version.
* If `Branch non-uniform < 2` it indicates that GPU can not optimize short branches.
* If `Branch non-uniform` is much greater than `Mul non-uniform` it indicates that branches has additional cost.

| GPU | Mul uniform | Matrix uniform | Mul non-uniform | Branch non-uniform | Matrix non-uniform | Mul avg | Branch avg | Matrix avg |
|----------|---|---|---|---|---|---|---|---|
| Adreno 5xx | 1.6 | 0.88 | 1.9 | 2.1 | 2.7 | 1.72 | **1.54** | 1.78 |
| Adreno 6xx | 1.6 | 1.0 | 2.3 | 1.8 | 3.0 | 1.95 | **1.4** | 2.0 |
| AMD GCN4 | 1.7 | 0.94 | 2.3 | 1.6 | 2.6 | 2.0 | **1.3** | 1.8 |
| Apple M1 | 1.1 | 0.8 | 1.4 | 1.1 | 1.8 | 1.24 | **1.03** | 1.26 |
| ARM Mali Midgard gen4 | 1.5 | 0.7 | 1.8 | 1.3 | 2.4 | 1.64 | **1.1** | 1.57 |
| ARM Mali Valhall gen1 | 2.1 | 1.4 | 2.3 | 2.1 | 3.5 | 2.18 | **1.56** | 2.45 |
| Intel UHD 6xx 9.5gen | 1.3 | 0.87 | 1.9 | 1.2 | 2.6 | 1.59 | **1.07** | 1.71 |
| NV RTX 20xx (Turing) | 2.1 | 1.5 | 2.4 | 3.1 | 3.0 | 2.1 | 2.1 | 2.1 |
| PowerVR B‑Series | 2.3 | 1.5 | 2.6 | 3.5 | 3.1 | 2.46 | **2.25** | 2.33 |


## Subgroup threads order

| GPU | graphics (quads) | graphics (image) | compute wg:8x8 (threads) | compute (image) |
Expand Down Expand Up @@ -481,17 +505,37 @@ Other:

## Memory

| GPU | VRAM bandwidth from specs (GB/s) | VRAM bandwidth measured (GB/s) | RAM to VRAM bandwidth (GB/s) | GMem - part of L2 (KB) | L2 cache per SM (KB) | L2 bandwidth (GB/s) | L1 cache per SM (KB) | Texture cache - part of L1 (KB) | L1 bandwidth (GB/s) |
|---|---|---|---|---|---|---|---|---|---|
| Adreno 505 | 6.4 | 5 | | 128 |
| Adreno 660 | 51.2 | 34 | | 1536 | 128 | ? | 4? | 2? | ? |
| AMD RX570 (GCN4) | 224.0 | 86 | | - |
| Apple M1 | 68.25 | | | ? |
| ARM Mali T830 (Midgard gen4) | 14.9 | 4 | | - |
| ARM Mali G57 (Valhall gen1) | 17.07 | 14.2 | | - | 512 | 49 | 32? | 32 | ? |
| Intel UHD 620 (9.5gen) | 29.8 | 23 | | - | 128 | 48? | 8? | 8? | 112? |
| NV RTX 2080 (Turing) | 448.0 | 403 | | - | 4096 | ? | 64 | 32 | ? |
| PowerVR BXM‑8‑256 | 51.2 | 14.2 | | ? | 1024 | ? | 256? | 256 | ? |
### RAM, VRAM

| GPU | VRAM bandwidth from specs (GB/s) | VRAM bandwidth measured (GB/s) | RAM to VRAM bandwidth from specs (GB/s) | RAM to VRAM bandwidth measured (GB/s) | VRAM to RAM bandwidth measured (GB/s) | RAM to RAM bandwidth measured (GB/s) |
|---|---|---|---|---|---|---|
| Adreno 505 | 6.4 | 5 | | | | |
| Adreno 660 | 51.2 | 34 | | | | |
| AMD RX570 (GCN4) | 224.0 | 86 | | | | |
| Apple M1 | 68.25 | | | | | |
| ARM Mali T830 (Midgard gen4) | 14.9 | 4 | | | | |
| ARM Mali G57 (Valhall gen1) | 17.07 | 14.2 | | | | |
| Intel UHD 620 (9.5gen) | 29.8 | 23 | | | | |
| NV RTX 2080 (Turing) | 448.0 | 403 | | | | |
| PowerVR BXM‑8‑256 | 51.2 | 14.2 | | | | |

### Cache

* GMem - part of L2 cache which is used to store attachments for TBDR.
- Adreno has dedicated memory.
- Mali use L2 and some times attachment can be evicted from L2 to RAM.

| GPU | GMem (KB) | L2 cache per SM (KB) | L2 bandwidth (GB/s) | L2 cache line (bytes) | L1 cache per SM (KB) | Texture cache - part of L1 (KB) | L1 bandwidth (GB/s) |
|---|---|---|---|---|---|---|---|
| Adreno 505 | 128 | | | | | | |
| Adreno 660 | 1536 | 128 | ? | | 4? | 2? | ? |
| AMD RX570 (GCN4) | - | | | | | | |
| Apple M1 | ? | | | | | | |
| ARM Mali T830 (Midgard gen4) | 4 | | | 64 | | | |
| ARM Mali G57 (Valhall gen1) | 8 | 512 | 49 | 64 | 32? | 32 | ? |
| Intel UHD 620 (9.5gen) | - | 128 | 48? | | 8? | 8? | 112? |
| NV RTX 2080 (Turing) | - | 4096 | ? | | 64 | 32 | ? |
| PowerVR BXM‑8‑256 | ? | 1024 | ? | ? | 256? | 256 | ? |

## Render target compression

Expand All @@ -503,11 +547,12 @@ Other:
| Adreno 5xx | 4x4 | 2.5 | 2.7 | ? | ? | exec time |
| Adreno 6xx | 16x16 | 1.9 | 6.9 | ? | 3.3 | exec time |
| AMD GCN4 | 4x4 | 2.3 | 3 | 2.3 | 3 | exec time |
| Apple M1 |
| Apple M1 | 8x8 | 3.4 | 3.4 | 6.8 | 6.8 | exec time |
| Intel UHD 6xx 9.5gen | 8x8 | 1.6 | 1.8 | 1.8 | 1.85 | exec time |
| NV RTX 20xx | 4x4 | 3 | 3.2 | 4.1 | 4.1 | exec time |
| ARM Mali Valhall gen1 | 4x4 | 6.9 | 60 | - | - | **mem traffic** | only 32bit formats |
| PowerVR B‑Series | 8x8 | 23 | 134 | 24 | 134 | **mem traffic** |
| ARM Mali Valhall gen1 | 4x4 | 1.9 | 3.9 | 1.9 | 3.7 | exec time | only 32bit formats, **V2** |
| ARM Mali Valhall gen1 | 4x4 | 5.9 | 19 | 5.7 | 20 | **mem traffic** | used performance counters |
| PowerVR B‑Series | 8x8 | 23 | 134 | 24 | 134 | **mem traffic** | used performance counters |


# Test Sources
Expand Down Expand Up @@ -568,3 +613,15 @@ Expected hierarchy:
### 11. NaN

[code](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/tests/NaN.as)

### 12. Branching

Transform 2D vector into 3D cube face. Uniform version has same cube face per warp. Non-uniform version has unique cube face per thread.

[code](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/perf/Branching-1.as)

### 13. Circle geometry

[Small circles](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/perf/CircleQuadOverdraw-1.as)<br/>
[Large circles blending](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/perf/CircleQuadOverdraw-2.as)

6 changes: 5 additions & 1 deletion AE/docs/papers/GraphicsNotes-ru.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,10 +131,14 @@ float3x3 ComputeTBNinFS (float2 uv, float3 worldPos)

## Ветвление в шейдерах

Сильно зависит от однородности потока выполнения внутри варпа (uniform control flow).
Если все потоки идут по одинаковому пути, то ветвление быстрее умножения, если по разным, то умножение будет быстрее на некоторых GPU (NVidia).


## Микрооптимизация шейдеров

* Компилятор заменяет повторяющиеся деления на одно переворачивание (1/x) и умножения.
* Реализация `Sign` через `Step`, который возвращает -1 или 1, намного быстрее чем `SignOrZero` (`sign` из GLSL), а `copysign` из MSL - быстрее `Step`.
* Реализация `FastSign` через `Step`, который возвращает -1 или 1, намного быстрее чем `SignOrZero` (`sign` из GLSL), а `copysign` из MSL - быстрее `Step`.
* `FMA` на мобильных работает через `fp32 FMA`, а на NV и Intel использует `fp16 FMA x2` что в 2 раза быстрее fp32 для half2, half4.
* `[[unroll]]` сильно замедляет компиляцию пайплайна, в редких случаях дает 2х ускорение, но часто слабо влияет.
* На NV mediump может работать медленнее чем highp, на мобильных аналогично fp16.
Expand Down
Loading

0 comments on commit 2182035

Please sign in to comment.