Skip to content

Commit

Permalink
v24.10.259
Browse files Browse the repository at this point in the history
- update GPU benchmarks and Spherical Cube papers
- fixed OpenAppend mode for WinFile
- ResEditor: update tools & samples
  • Loading branch information
azhirnov committed Oct 13, 2024
1 parent b3ac9e2 commit 8ec70d3
Show file tree
Hide file tree
Showing 59 changed files with 2,360 additions and 470 deletions.
2 changes: 1 addition & 1 deletion AE/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ endif()
#----------------------------------------------------------

project( "AE"
VERSION 24.9.258 # year, month, version
VERSION 24.10.259 # year, month, version
LANGUAGES C CXX
DESCRIPTION "async game engine"
)
Expand Down
365 changes: 216 additions & 149 deletions AE/docs/papers/GPU_Benchmarks.md

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions AE/docs/papers/GraphicsNotes-ru.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,9 +139,9 @@ float3x3 ComputeTBNinFS (float2 uv, float3 worldPos)
* `[[unroll]]` сильно замедляет компиляцию пайплайна, в редких случаях дает 2х ускорение, но часто слабо влияет.
* На NV mediump может работать медленнее чем highp, на мобильных аналогично fp16.
* Для uint `FindMSB` в 2 раза быстрее `FindLSB`, для int `FindLSB` может быть быстрее.
* На NV/AMD/Intel FP32ADD выполняется в 2 раза быстрее чем FP32FMA, FP32MUL.
* На NV/Intel FP32ADD выполняется в 2 раза быстрее чем FP32FMA, FP32MUL и соответствует максимальной производительности по спецификации.
* На мобилках FP32ADD, FP32MUL, FP32FMA выполняется за один цикл.
* В спецификациях считают FMA за 2 инструкции и указывают в 2 раза большую производительность.
* В спецификациях считают FMA за 2 инструкции и указывают в 2 раза большую производительность в FLOPS.

**SFU** pipe (special function unit) - на нем выполняются более редкие операции типа переворачивания (1/x), sqrt, sin, cos, exp, log, fract, ceil, round, sign и тд.
Чаще всего на 4 потока варпа приходится 1-2 SFU, поэтому все перечисленные операции относительно медленные, но некоторые выполняются за одну инструкцию, а другие эмулируются и занимают еще больше времени.
Expand Down
47 changes: 47 additions & 0 deletions AE/docs/papers/SphericalCube-ru.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,50 @@ UV куба дает распределение, близкое к равном
Код:<br/>
[коррекция в вычислительном шейдере](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/sphere/SphericalCube-4.as).<br/>
[коррекция во фрагментном шейдере](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/sphere/SphericalCube-5.as).


## Проекция из 2D

Существует 2 варианта записи данных в кубическую карту (cubemap):
1. Использовать UV координаты для каждой из граней и отдельно обрабатывать грани куба.
2. Сделать рендеринг в текстуру, где проецируется геометрия с текстурой или с UV для процедурной генерации в фрагментном шейдере.

### Рендеринг в текстуру

Проекция квадрата на границу между гранями кубической карты дает некорректные UV.

![](img/SC_RenderToTex_UVBug3D.png)

В текстуру квадрат рисуется с искажениями, чтобы после проекции в 3D сохранить пропорции, но из-за этого интерполяция UV работает некорректно.<br/>
Так выглядит грань кубической карты.

![](img/SC_RenderToTex_UVBug2D.png)

Тангенциальная проекция значительно улучшает интерполяцию UV, но граница все еще заметна.

![](img/SC_RenderToTex_UVBug_Tang.png)

Проблема решается, если на границе между гранями поставить дополнительные точки.

[Код](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/sphere/UVSphere-2.as)


## Сфера без геометрии

Когда стоит использовать геометрию:
* При высокой детализации и с картой смещений (displacement map).
* Рядом расположена другая геометрия. Например постройки на планете.
* При значительных деформациях геометрии. Например столкновение сфер.

В остальных случаях оптимальнее использовать процедурную сферу без геометрии.
Из геометрии только квадрат или шестиугольник, во фрагментном шейдере по UV расчитывается нормаль сферы в заданной точке. С дополнительными расчетами можно получить и глубину и записать в `gl_FragDepth`.
Далее идет попиксельная проекция (коррекция) текстурных координат.
Для перспективной проекции нормаль сферы нужно спроецировать, ведь в зависимости от расстояния между камерой и сферой видны разные части сферы.

Преимущества:
* Экономия памяти на геометрии.
* Геометрия при низкой детализации дает заметные углы по краям, а процедурная сфера всегда идеально круглая и со сглаживанием по краям.
* Геометрия при высокой детализации использует намного больше потоков фрагментного шейдера, так как по краям треугольников вызываются вспомогательные потоки (quad overdraw). А процедурная сфера использует с геометрией в виде шестиугольника расходует намного меньше потоков.
* Производительность в 2 раза выше даже на мобилках.

[Код](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/sphere/UVSphere-1.as)
12 changes: 6 additions & 6 deletions AE/docs/papers/bench/AMD_RX570.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
## Specs

* FP16: **5.095** TFLOPS (no supported in HW)
* FP32: **5.095** TFLOPS
* FP32: **5.095** TFLOPS (4.4 on FMA from tests)
* FP64: **318.5** GFLOPS
* Clock base: 1168 MHz, boost: 1244 MHz.
* Memory: 4GB, GDDR5, 256 bit, 1750 MHz, **224.0** GB/s (86 GB/s from tests)
* Driver: 2.0.106
* Driver: 2.0.279


## Shader
Expand Down Expand Up @@ -38,7 +38,7 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad

### Instruction cost

* [[4](../GPU_Benchmarks.md#4-Shader-instruction-benchmark)]:
* All instructions benchmark [[4](../GPU_Benchmarks.md#4-Shader-instruction-benchmark)]:
* fp32 FMA is preferred than single FMul or separate FMulAdd
* fp32 has fastest Length, Normalize (x1.0), Distance (x1.5)
* fp32 has fastest Clamp, ClampSNorm (x1.0), ClampUNorm (x1.0)
Expand All @@ -52,10 +52,10 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
* fp32 FastASin is x1.4 faster than native ASin
* fp32 Pow17 equal to Pow8 - native function used instead of MUL loop

* [[2](../GPU_Benchmarks.md#2-fp32-instruction-performance)]:
* FP32 instruction benchmark [[2](../GPU_Benchmarks.md#2-fp32-instruction-performance)]:
- Benchmarking in compute shader is a bit faster.

| TOp/s | ops | max TFLOPS | comments |
| TOp/s | ops | max TFLOPS |
|---|---|---|
| **2.2** | Add, Mul | **2.2** |
| **2.2** | FMA | **4.4** |
Expand All @@ -64,7 +64,7 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad

### NaN / Inf

* FP32, Mediump
* FP32, Mediump. [[11](../GPU_Benchmarks.md#11-NaN)]

| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
|---|---|---|---|---|---|---|---|---|
Expand Down
58 changes: 41 additions & 17 deletions AE/docs/papers/bench/ARM_Mali_G57.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Red - full quad, blue - only 1 thread per quad.<br/>

### Subgroups

* Subgroups in fragment shader reserve threads for helper invocations, even if they are not executed. [[6](../GPU_Benchmarks.md#6-Subgroups)]
* Helper invocation can be early terminated, but threads are allocated and number of warps with helper invocations and without are same (from performance counters). [[6](../GPU_Benchmarks.md#6-Subgroups)]

* Subgroup occupancy for single triangle with texturing. Helper invocations are executed and included as active thread. Red color - full subgroup. [[6](../GPU_Benchmarks.md#6-Subgroups)]<br/>
![](img/full-subgroup/valhall-1-tex.png)
Expand All @@ -61,8 +61,6 @@ Subgroup occupancy, red - full subgroup (16 threads), green: ~8 threads per subg
Triangles with different `gl_InstanceIndex` can be merged into a single subgroup but this is a rare case.<br/>
![](img/unique-subgroups/valhall-1-inst.png)

* Helper invocation can be early terminated, but threads are allocated and number of warps with helper invocations and without are same (from performance counters).


### Subgroup threads order

Expand All @@ -84,9 +82,18 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
* [[4](../GPU_Benchmarks.md#4-Shader-instruction-benchmark)]:
- Only fp32 FMA - *(fp16 and mediump use same fp32 FMA)*.
- Fp32 FMA is preferred than FMul or FMulAdd.
- Fp32 and i32 datapaths can execute in parallel in 2:1 rate
- Fp16 and mediump is 2x faster than fp32 in FMull, FAdd.
- Length is a bit faster than Distance and Normalize.
- ClampUNorm and ClampSNorm are fast.
* fp16x2 FMA is used, scalar FMA doesn't have x2 performance
* fp32 FastACos is x2.3 faster than native ACos
* fp32 FastASin is x2.6 faster than native ASin
* fp16 FastATan is x1.8 faster than native ATan
* fp16 FastACos is x3.7 faster than native ACos
* fp16 FastASin is x4.2 faster than native ASin
* fp32 Pow uses MUL loop - performance depends on power
* fp32 SignOrZero is x2.3 faster than Sign

* Fp32 performance: [[2](../GPU_Benchmarks.md#2-fp32-instruction-performance)]:
- Loop unrolling doesn't change performance.
Expand All @@ -95,18 +102,25 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
- Graphics and compute has same performance.
- Compute dispatch on 128 - 2K grid is faster.
- Compiler can optimize only addition, so test combine Add and Sub.
- **60.7** GOp/s at 950 MHz on Add, Mul, MulAdd, FMA.
- Equal to **120** GFLOPS on MulAdd and FMA.

| Gop/s | op | GFLOPS |
|---|---|---|
| **60.7** | Add, Mul | 60.7 |
| **60.7** | MulAdd, FMA | **121** |

* Fp16 (half float) performance: [[1](../GPU_Benchmarks.md#1-fp16-instruction-performance)]:
- **60** GOp/s at 950 MHz on FMA - equal to F32FMA.
- **121** GOp/s at 950 MHz on Add, Mul, MulAdd.
- Equal to **240** GFLOPS on MulAdd.
- Measured at 950 MHz

| Gop/s | op | GFLOPS | comments |
|---|---|---|---|
| **60** | FMA | 120 | equal to F32FMA |
| **121** | Add, Mul | 121 |
| **121** | MulAdd | **240** |


### NaN / Inf

* FP32
* FP32. [[11](../GPU_Benchmarks.md#11-NaN)]

| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
|---|---|---|---|---|---|---|---|---|
Expand Down Expand Up @@ -164,6 +178,16 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
| VoronoiContour3FBM, octaves=4 | 16K | 21.5 | **34** | 1344 |



## Blending

* Blend vs Discard in FS: TODO: use new test
- 1x opaque: 2.3ms
- 3.2x discard: 7.3ms
- 3.7x blend `src + dst * (1 - src.a)`: 8.5ms
- 6.5x blend `src * (1 - dst.a) + dst * (1 - src.a)`: 15ms - accessing `dst` is slow!


## Resource access


Expand Down Expand Up @@ -210,14 +234,14 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
## Texture cache

* RGBA8_UNorm texture with random access [[9](../GPU_Benchmarks.md#9-Texture-cache)]
- Measured cache size: 16 KB, 256 KB, 1 MB.
- Measured cache size: 32 KB, 512 KB.

| size (KB) | dimension (px) | L2 bandwidth (GB/s) | external bandwidth (GB/s) | comment |
|---|---|---|---|
| 16 | 64x64 | 0.009 | 0.004 | **used only texture cache** |
| 32 | 128x64 | 0.38 | 0.004 | |
| 64 | 128x128 | 45 | 0.004 | **used L2 cache** |
| 128 | 256x128 | 45 | 0.004 | |
| 256 | 256x256 | 49 | 4 | |
| 512 | 512x256 | 49 | 7.6 | **L2 cache with 15% miss** |
| 1024 | 512x512 | 24 | 12.5 | **30% L2 miss, bottleneck on external memory** |
| 16 | 64x64 | 0.009 | 0.004 | **used only texture cache** |
| **32** | 128x64 | 0.38 | 0.004 | |
| 64 | 128x128 | 45 | 0.004 | **used L2 cache** |
| 128 | 256x128 | 45 | 0.004 | |
| 256 | 256x256 | 49 | 4 | |
| **512** | 512x256 | 49 | 7.6 | **L2 cache with 15% miss** |
| 1024 | 512x512 | 24 | 12.5 | **30% L2 miss, bottleneck on external memory** |
8 changes: 4 additions & 4 deletions AE/docs/papers/bench/ARM_Mali_T830.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
* Clock: 1000 MHz
* Bus width: 128 bits
* Memory: 2GB, LPDDR3, DC 32bit, 933MHz, **14.9**GB/s (4GB/s from tests)
* FP16 GFLOPS: **56**
* FP32 GFLOPS: **32**
* FP16 GFLOPS: **56** (10.4 on MulAdd from tests)
* FP32 GFLOPS: **32** (10.4 on MulAdd from tests)
* Device: Samsung J7 Neo (Android 9, Driver 28.0.0)

## Shader
Expand Down Expand Up @@ -49,13 +49,13 @@ Doesn't support quad and subgroups.
|---|---|---|
| **7.7** | Add | 7.7 |
| **10.1** | Mul | 10.1 |
| **5.2** | MulAdd | 10.4 |
| **5.2** | MulAdd | **10.4** |
| **1.3** | FMA | 2.6 |


### NaN / Inf

* FP32, Mediump
* FP32, Mediump. [[11](../GPU_Benchmarks.md#11-NaN)]

| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
|---|---|---|---|---|---|---|---|---|
Expand Down
2 changes: 1 addition & 1 deletion AE/docs/papers/bench/Adreno_505.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@

### NaN / Inf

* FP32
* FP32. [[11](../GPU_Benchmarks.md#11-NaN)]

| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
|---|---|---|---|---|---|---|---|---|
Expand Down
37 changes: 21 additions & 16 deletions AE/docs/papers/bench/Adreno_660.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
## Specs

* Clock: 840 MHz (790?)
* F16 GFLOPS: **3244** (680 GOp/s on MulAdd from tests)
* F32 GFLOPS: **1622** (364 GOp/s on FMA from tests)
* F16 GFLOPS: **3244** (1414 on MulAdd from tests)
* F32 GFLOPS: **1622** (728 on FMA from tests)
* F64 GFLOPS: **405**
* GMem size: 1.5 Mb (bandwidth?)
* L2: ? (bandwidth?)
Expand Down Expand Up @@ -53,6 +53,11 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad

### Instruction cost

* All instructions benchmark [[4](../GPU_Benchmarks.md#4-Shader-instruction-benchmark)]:
* fp32 FMA is preferred than single FMul or separate FMulAdd
* fp32 SignOrZero is x3.9 faster than Sign
- fp32 & i32 datapaths can execute in parallel in 2:1 rate.

* FP32 instruction benchmark [[2](../GPU_Benchmarks.md#2-fp32-instruction-performance)]:
- Loop unrolling is fast during pipeline creation if loop < 256.
- Loop unrolling is 1x - 1.4x faster, 2x slower on 1024, 1.1x slower on 256.
Expand All @@ -63,21 +68,21 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad

| GOp/s | exec time (ms) | ops | max GFLOPS |
|---|---|---|
| **420** | 10.2 | F32Add, F32Mul | 420 |
| **364** | 11.8 | F32FMA, F32MulAdd | **728** |
| **420** | 10.2 | Add, Mul | 420 |
| **364** | 11.8 | FMA, MulAdd | **728** |

* FP16 instruction benchmark [[1](../GPU_Benchmarks.md#1-fp16-instruction-performance)]:

| GOp/s | exec time (ms) | ops | max GFLOPS |
|---|---|---|
| **830** | 5.16 | F16Add, F16Mul | 830 |
| **707** | 6.06 | F16MulAdd | **1414** |
| **117** | 36.5 | F16FMA | 234 |
| **830** | 5.16 | Add, Mul | 830 |
| **707** | 6.06 | MulAdd | **1414** |
| **117** | 36.5 | FMA | 234 |


### NaN / Inf

* FP32, FP16
* FP32, FP16. [[11](../GPU_Benchmarks.md#11-NaN)]

| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
|---|---|---|---|---|---|---|---|---|
Expand Down Expand Up @@ -177,14 +182,14 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
## Texture cache

* RGBA8_UNorm texture with random access [[9](../GPU_Benchmarks.md#9-Texture-cache)]
- Measured cache size: 2 KB, 128 KB.
- Measured cache size: 2 KB, 4 KB (?), 128 KB.
- 8 texels per pixel, dim ???

| size (KB) | dimension (px) | exec time (ms) | diff | approx bandwidth (GB/s) |
| size (KB) | dimension (px) | exec time (ms) | diff | approx bandwidth (GB/s) | comments |
|---|---|---|---|
| 1 | 16x16 | TODO | | |
| 2 | 32x16 | 2.3 | | TODO |
| 4 | 32x32 | 7 | 3 | |
| 16 | 64x64 | 12.4 | 1.8 | |
| 128 | 256x128 | 14 | | |
| 256 | 256x256 | 44 | 3 | |
| 1 | 16x16 | TODO | | |
| **2** | 32x16 | 2.3 | | TODO | L1 cache |
| **4** | 32x32 | 7 | **3** | |
| 16 | 64x64 | 12.4 | **1.8** | |
| **128** | 256x128 | 14 | | | L2 cache |
| 256 | 256x256 | 44 | **3** | |
Loading

0 comments on commit 8ec70d3

Please sign in to comment.