v24.10.259

- update GPU benchmarks and Spherical Cube papers - fixed OpenAppend mode for WinFile - ResEditor: update tools & samples
azhirnov · Oct 13, 2024 · 8ec70d3 · 8ec70d3
1 parent b3ac9e2
commit 8ec70d3
Show file tree

Hide file tree

Showing 59 changed files with 2,360 additions and 470 deletions.
diff --git a/AE/CMakeLists.txt b/AE/CMakeLists.txt
@@ -18,7 +18,7 @@ endif()
 #----------------------------------------------------------
 
 project( "AE"
-		 VERSION 24.9.258	# year, month, version
+		 VERSION 24.10.259	# year, month, version
 		 LANGUAGES C CXX
 		 DESCRIPTION "async game engine"
 		)

diff --git a/AE/docs/papers/GPU_Benchmarks.md b/AE/docs/papers/GPU_Benchmarks.md
diff --git a/AE/docs/papers/GraphicsNotes-ru.md b/AE/docs/papers/GraphicsNotes-ru.md
@@ -139,9 +139,9 @@ float3x3  ComputeTBNinFS (float2 uv, float3 worldPos)
 * `[[unroll]]` сильно замедляет компиляцию пайплайна, в редких случаях дает 2х ускорение, но часто слабо влияет.
 * На NV mediump может работать медленнее чем highp, на мобильных аналогично fp16.
 * Для uint `FindMSB` в 2 раза быстрее `FindLSB`, для int `FindLSB` может быть быстрее.
-* На NV/AMD/Intel FP32ADD выполняется в 2 раза быстрее чем FP32FMA, FP32MUL.
+* На NV/Intel FP32ADD выполняется в 2 раза быстрее чем FP32FMA, FP32MUL и соответствует максимальной производительности по спецификации.
 * На мобилках FP32ADD, FP32MUL, FP32FMA выполняется за один цикл.
-* В спецификациях считают FMA за 2 инструкции и указывают в 2 раза большую производительность.
+* В спецификациях считают FMA за 2 инструкции и указывают в 2 раза большую производительность в FLOPS.
 
 **SFU** pipe (special function unit) - на нем выполняются более редкие операции типа переворачивания (1/x), sqrt, sin, cos, exp, log, fract, ceil, round, sign и тд.
 Чаще всего на 4 потока варпа приходится 1-2 SFU, поэтому все перечисленные операции относительно медленные, но некоторые выполняются за одну инструкцию, а другие эмулируются и занимают еще больше времени.

diff --git a/AE/docs/papers/SphericalCube-ru.md b/AE/docs/papers/SphericalCube-ru.md
@@ -61,3 +61,50 @@ UV куба дает распределение,  близкое к равном
 Код:<br/>
 [коррекция в вычислительном шейдере](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/sphere/SphericalCube-4.as).<br/>
 [коррекция во фрагментном шейдере](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/sphere/SphericalCube-5.as).
+
+
+## Проекция из 2D
+
+Существует 2 варианта записи данных в кубическую карту (cubemap):
+1. Использовать UV координаты для каждой из граней и отдельно обрабатывать грани куба.
+2. Сделать рендеринг в текстуру, где проецируется геометрия с текстурой или с UV для процедурной генерации в фрагментном шейдере.
+
+### Рендеринг в текстуру
+
+Проекция квадрата на границу между гранями кубической карты дает некорректные UV.
+
+![](img/SC_RenderToTex_UVBug3D.png)
+
+В текстуру квадрат рисуется с искажениями, чтобы после проекции в 3D сохранить пропорции, но из-за этого интерполяция UV работает некорректно.<br/>
+Так выглядит грань кубической карты.
+
+![](img/SC_RenderToTex_UVBug2D.png)
+
+Тангенциальная проекция значительно улучшает интерполяцию UV, но граница все еще заметна.
+
+![](img/SC_RenderToTex_UVBug_Tang.png)
+
+Проблема решается, если на границе между гранями поставить дополнительные точки.
+
+[Код](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/sphere/UVSphere-2.as)
+
+
+## Сфера без геометрии
+
+Когда стоит использовать геометрию:
+* При высокой детализации и с картой смещений (displacement map).
+* Рядом расположена другая геометрия. Например постройки на планете.
+* При значительных деформациях геометрии. Например столкновение сфер.
+
+В остальных случаях оптимальнее использовать процедурную сферу без геометрии.
+Из геометрии только квадрат или шестиугольник, во фрагментном шейдере по UV расчитывается нормаль сферы в заданной точке. С дополнительными расчетами можно получить и глубину и записать в `gl_FragDepth`.
+Далее идет попиксельная проекция (коррекция) текстурных координат.
+Для перспективной проекции нормаль сферы нужно спроецировать, ведь в зависимости от расстояния между камерой и сферой видны разные части сферы.
+
+Преимущества:
+* Экономия памяти на геометрии.
+* Геометрия при низкой детализации дает заметные углы по краям, а процедурная сфера всегда идеально круглая и со сглаживанием по краям.
+* Геометрия при высокой детализации использует намного больше потоков фрагментного шейдера, так как по краям треугольников вызываются вспомогательные потоки (quad overdraw). А процедурная сфера использует с геометрией в виде шестиугольника расходует намного меньше потоков.
+* Производительность в 2 раза выше даже на мобилках.
+
+[Код](https://github.com/azhirnov/as-en/blob/dev/AE/samples/res_editor/_data/scripts/sphere/UVSphere-1.as)
diff --git a/AE/docs/papers/bench/AMD_RX570.md b/AE/docs/papers/bench/AMD_RX570.md
@@ -4,11 +4,11 @@
 ## Specs
 
 * FP16: **5.095** TFLOPS (no supported in HW)
-* FP32: **5.095** TFLOPS
+* FP32: **5.095** TFLOPS (4.4 on FMA from tests)
 * FP64: **318.5** GFLOPS
 * Clock base: 1168 MHz, boost: 1244 MHz.
 * Memory: 4GB, GDDR5, 256 bit, 1750 MHz, **224.0** GB/s (86 GB/s from tests)
-* Driver: 2.0.106
+* Driver: 2.0.279
 
 
 ## Shader
@@ -38,7 +38,7 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 
 ### Instruction cost
 
-* [[4](../GPU_Benchmarks.md#4-Shader-instruction-benchmark)]:
+* All instructions benchmark [[4](../GPU_Benchmarks.md#4-Shader-instruction-benchmark)]:
 	* fp32 FMA is preferred than single FMul or separate FMulAdd
 	* fp32 has fastest Length,  Normalize (x1.0),  Distance (x1.5)
 	* fp32 has fastest Clamp,  ClampSNorm (x1.0),  ClampUNorm (x1.0)
@@ -52,10 +52,10 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 	* fp32 FastASin is x1.4 faster than native ASin
 	* fp32 Pow17 equal to Pow8 - native function used instead of MUL loop
 
-* [[2](../GPU_Benchmarks.md#2-fp32-instruction-performance)]:
+* FP32 instruction benchmark [[2](../GPU_Benchmarks.md#2-fp32-instruction-performance)]:
 	- Benchmarking in compute shader is a bit faster.
 
-	| TOp/s | ops | max TFLOPS | comments |
+	| TOp/s | ops | max TFLOPS |
 	|---|---|---|
 	| **2.2** | Add, Mul | **2.2** |
 	| **2.2** | FMA      | **4.4** |
@@ -64,7 +64,7 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 
 ### NaN / Inf
 
-* FP32, Mediump
+* FP32, Mediump. [[11](../GPU_Benchmarks.md#11-NaN)]
 
 	| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
 	|---|---|---|---|---|---|---|---|---|

diff --git a/AE/docs/papers/bench/ARM_Mali_G57.md b/AE/docs/papers/bench/ARM_Mali_G57.md
@@ -41,7 +41,7 @@ Red - full quad, blue - only 1 thread per quad.<br/>
 
 ### Subgroups
 
-* Subgroups in fragment shader reserve threads for helper invocations, even if they are not executed. [[6](../GPU_Benchmarks.md#6-Subgroups)]
+* Helper invocation can be early terminated, but threads are allocated and number of warps with helper invocations and without are same (from performance counters). [[6](../GPU_Benchmarks.md#6-Subgroups)]
 
 * Subgroup occupancy for single triangle with texturing. Helper invocations are executed and included as active thread. Red color - full subgroup. [[6](../GPU_Benchmarks.md#6-Subgroups)]<br/>
 ![](img/full-subgroup/valhall-1-tex.png)
@@ -61,8 +61,6 @@ Subgroup occupancy, red - full subgroup (16 threads), green: ~8 threads per subg
 Triangles with different `gl_InstanceIndex` can be merged into a single subgroup but this is a rare case.<br/>
 ![](img/unique-subgroups/valhall-1-inst.png)
 
-* Helper invocation can be early terminated, but threads are allocated and number of warps with helper invocations and without are same (from performance counters).
-
 
 ### Subgroup threads order
 
@@ -84,9 +82,18 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 * [[4](../GPU_Benchmarks.md#4-Shader-instruction-benchmark)]:
 	- Only fp32 FMA - *(fp16 and mediump use same fp32 FMA)*.
 	- Fp32 FMA is preferred than FMul or FMulAdd.
+	- Fp32 and i32 datapaths can execute in parallel in 2:1 rate
 	- Fp16 and mediump is 2x faster than fp32 in FMull, FAdd.
 	- Length is a bit faster than Distance and Normalize.
 	- ClampUNorm and ClampSNorm are fast.
+	* fp16x2 FMA is used, scalar FMA doesn't have x2 performance
+	* fp32 FastACos is x2.3 faster than native ACos
+	* fp32 FastASin is x2.6 faster than native ASin
+	* fp16 FastATan is x1.8 faster than native ATan
+	* fp16 FastACos is x3.7 faster than native ACos
+	* fp16 FastASin is x4.2 faster than native ASin
+	* fp32 Pow uses MUL loop - performance depends on power
+	* fp32 SignOrZero is x2.3 faster than Sign
 
 * Fp32 performance: [[2](../GPU_Benchmarks.md#2-fp32-instruction-performance)]:
 	- Loop unrolling doesn't change performance.
@@ -95,18 +102,25 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 	- Graphics and compute has same performance.
 	- Compute dispatch on 128 - 2K grid is faster.
 	- Compiler can optimize only addition, so test combine Add and Sub.
-	- **60.7** GOp/s at 950 MHz on Add, Mul, MulAdd, FMA.
-	- Equal to **120** GFLOPS on MulAdd and FMA.
+
+	| Gop/s | op | GFLOPS |
+	|---|---|---|
+	| **60.7** | Add, Mul    | 60.7 |
+	| **60.7** | MulAdd, FMA | **121** |
 
 * Fp16 (half float) performance: [[1](../GPU_Benchmarks.md#1-fp16-instruction-performance)]:
-	-  **60** GOp/s at 950 MHz on FMA - equal to F32FMA.
-	- **121** GOp/s at 950 MHz on Add, Mul, MulAdd.
-	- Equal to **240** GFLOPS on MulAdd.
+	- Measured at 950 MHz
+
+	| Gop/s | op | GFLOPS | comments |
+	|---|---|---|---|
+	| **60**  | FMA      | 120 | equal to F32FMA |
+	| **121** | Add, Mul | 121 |
+	| **121** | MulAdd   | **240** |
 
 
 ### NaN / Inf
 
-* FP32
+* FP32. [[11](../GPU_Benchmarks.md#11-NaN)]
 
 	| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
 	|---|---|---|---|---|---|---|---|---|
@@ -164,6 +178,16 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 | VoronoiContour3FBM, octaves=4 | 16K   | 21.5 | **34** | 1344 |
 
 
+
+## Blending
+
+* Blend vs Discard in FS: TODO: use new test
+	- 1x   opaque: 2.3ms
+	- 3.2x discard: 7.3ms
+	- 3.7x blend `src + dst * (1 - src.a)`: 8.5ms
+	- 6.5x blend `src * (1 - dst.a) + dst * (1 - src.a)`: 15ms - accessing `dst` is slow!
+
+
 ## Resource access
 
 
@@ -210,14 +234,14 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 ## Texture cache
 
 * RGBA8_UNorm texture with random access [[9](../GPU_Benchmarks.md#9-Texture-cache)]
-	- Measured cache size: 16 KB, 256 KB, 1 MB.
+	- Measured cache size: 32 KB, 512 KB.
 
 	| size (KB) | dimension (px) | L2 bandwidth (GB/s) | external bandwidth (GB/s) | comment |
 	|---|---|---|---|
-	| 16   |  64x64  | 0.009 | 0.004 | **used only texture cache** |
-	| 32   | 128x64  | 0.38  | 0.004 | |
-	| 64   | 128x128 | 45    | 0.004 | **used L2 cache** |
-	| 128  | 256x128 | 45    | 0.004 | |
-	| 256  | 256x256 | 49    | 4     | |
-	| 512  | 512x256 | 49    | 7.6   | **L2 cache with 15% miss** |
-	| 1024 | 512x512 | 24    | 12.5  | **30% L2 miss, bottleneck on external memory** |
+	| 16      |  64x64  | 0.009 | 0.004 | **used only texture cache** |
+	| **32**  | 128x64  | 0.38  | 0.004 | |
+	| 64      | 128x128 | 45    | 0.004 | **used L2 cache** |
+	| 128     | 256x128 | 45    | 0.004 | |
+	| 256     | 256x256 | 49    | 4     | |
+	| **512** | 512x256 | 49    | 7.6   | **L2 cache with 15% miss** |
+	| 1024    | 512x512 | 24    | 12.5  | **30% L2 miss, bottleneck on external memory** |
diff --git a/AE/docs/papers/bench/ARM_Mali_T830.md b/AE/docs/papers/bench/ARM_Mali_T830.md
@@ -9,8 +9,8 @@
 * Clock: 1000 MHz
 * Bus width: 128 bits
 * Memory: 2GB, LPDDR3, DC 32bit, 933MHz, **14.9**GB/s (4GB/s from tests)
-* FP16 GFLOPS: **56**
-* FP32 GFLOPS: **32**
+* FP16 GFLOPS: **56** (10.4 on MulAdd from tests)
+* FP32 GFLOPS: **32** (10.4 on MulAdd from tests)
 * Device: Samsung J7 Neo (Android 9, Driver 28.0.0)
 
 ## Shader
@@ -49,13 +49,13 @@ Doesn't support quad and subgroups.
 	|---|---|---|
 	| **7.7**  | Add    | 7.7  |
 	| **10.1** | Mul    | 10.1 |
-	| **5.2**  | MulAdd | 10.4 |
+	| **5.2**  | MulAdd | **10.4** |
 	| **1.3**  | FMA    | 2.6  |
 
 
 ### NaN / Inf
 
-* FP32, Mediump
+* FP32, Mediump. [[11](../GPU_Benchmarks.md#11-NaN)]
 
 	| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
 	|---|---|---|---|---|---|---|---|---|

diff --git a/AE/docs/papers/bench/Adreno_505.md b/AE/docs/papers/bench/Adreno_505.md
@@ -54,7 +54,7 @@
 
 ### NaN / Inf
 
-* FP32
+* FP32. [[11](../GPU_Benchmarks.md#11-NaN)]
 
 	| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
 	|---|---|---|---|---|---|---|---|---|

diff --git a/AE/docs/papers/bench/Adreno_660.md b/AE/docs/papers/bench/Adreno_660.md
@@ -4,8 +4,8 @@
 ## Specs
 
 * Clock: 840 MHz (790?)
-* F16 GFLOPS: **3244** (680 GOp/s on MulAdd from tests)
-* F32 GFLOPS: **1622** (364 GOp/s on FMA from tests)
+* F16 GFLOPS: **3244** (1414 on MulAdd from tests)
+* F32 GFLOPS: **1622** (728 on FMA from tests)
 * F64 GFLOPS: **405**
 * GMem size: 1.5 Mb (bandwidth?)
 * L2: ? (bandwidth?)
@@ -53,6 +53,11 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 
 ### Instruction cost
 
+* All instructions benchmark [[4](../GPU_Benchmarks.md#4-Shader-instruction-benchmark)]:
+	* fp32 FMA is preferred than single FMul or separate FMulAdd
+	* fp32 SignOrZero is x3.9 faster than Sign
+	- fp32 & i32 datapaths can execute in parallel in 2:1 rate.
+
 * FP32 instruction benchmark [[2](../GPU_Benchmarks.md#2-fp32-instruction-performance)]:
 	- Loop unrolling is fast during pipeline creation if loop < 256.
 	- Loop unrolling is 1x - 1.4x faster, 2x slower on 1024, 1.1x slower on 256.
@@ -63,21 +68,21 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 
 	| GOp/s | exec time (ms) | ops | max GFLOPS |
 	|---|---|---|
-	| **420** | 10.2 | F32Add, F32Mul    | 420 |
-	| **364** | 11.8 | F32FMA, F32MulAdd | **728** |
+	| **420** | 10.2 | Add, Mul    | 420 |
+	| **364** | 11.8 | FMA, MulAdd | **728** |
 
 * FP16 instruction benchmark [[1](../GPU_Benchmarks.md#1-fp16-instruction-performance)]:
 
 	| GOp/s | exec time (ms) | ops | max GFLOPS |
 	|---|---|---|
-	| **830** | 5.16 | F16Add, F16Mul | 830 |
-	| **707** | 6.06 | F16MulAdd      | **1414** |
-	| **117** | 36.5 | F16FMA         | 234 |
+	| **830** | 5.16 | Add, Mul | 830 |
+	| **707** | 6.06 | MulAdd   | **1414** |
+	| **117** | 36.5 | FMA      | 234 |
 
 
 ### NaN / Inf
 
-* FP32, FP16
+* FP32, FP16. [[11](../GPU_Benchmarks.md#11-NaN)]
 
 	| op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
 	|---|---|---|---|---|---|---|---|---|
@@ -177,14 +182,14 @@ Result of `Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize )` in compute shad
 ## Texture cache
 
 * RGBA8_UNorm texture with random access [[9](../GPU_Benchmarks.md#9-Texture-cache)]
-	- Measured cache size: 2 KB, 128 KB.
+	- Measured cache size: 2 KB, 4 KB (?), 128 KB.
 	- 8 texels per pixel, dim ???
 
-	| size (KB) | dimension (px) | exec time (ms) | diff | approx bandwidth (GB/s) |
+	| size (KB) | dimension (px) | exec time (ms) | diff | approx bandwidth (GB/s) | comments |
 	|---|---|---|---|
-	|   1 |  16x16  |  TODO |     |  |
-	|   2 |  32x16  |  2.3  |     | TODO |
-	|   4 |  32x32  |  7    | 3   | |
-	|  16 |  64x64  |  12.4 | 1.8 | |
-	| 128 | 256x128 |  14   |     | |
-	| 256 | 256x256 |  44   | 3   | |
+	|   1     |  16x16  |  TODO |         |  |
+	| **2**   |  32x16  |  2.3  |         | TODO | L1 cache |
+	| **4**   |  32x32  |  7    | **3**   | |
+	|  16     |  64x64  |  12.4 | **1.8** | |
+	| **128** | 256x128 |  14   |         | | L2 cache |
+	| 256     | 256x256 |  44   | **3**   | |