LoongArch64实验性向量调用约定实现 (Experimental implementation of LoongArch64 vector calling convention)

__For English version, please keep scrolling down.__

这是目前基于GCC对LA64实现的一个实验性的向量调用约定，代码位于dev/vecarg分支，如果存在问题和不完善的情况欢迎大家进行讨论。如果中英文版本之间存在描述模棱两可的情况也请提出，谢谢大家！

代码位于https://github.com/loongson/gcc/pull/113

## 关键字
- VR: 向量寄存器(Vector Register)
- VAR: 向量参数寄存器(Vector Argument Register)
- VRLEN: 向量寄存器位宽(Vector Register Length)

## 向量类型
向量的比特位宽可以是128bit或者256bit，并且总是包含多个元素。向量元素从最低比特起占据向量空间，并且拥有从0开始递增的index。
向量的元素类型遵循于LP64数据模型。

## 向量寄存器
LA64可以选择性的实现32个128位或者256位的向量寄存器硬件。如果实现向量寄存器，则必须实现双精度浮点硬件单元。
同编号的256位向量寄存器的低半部分与128位寄存器共用，同编号的128位向量寄存器低半部分和浮点寄存器共用。

以下为向量寄存器的使用约定：


| 名称                                         | 用途                    | 是否在过程间保存    |
|---------------------------------------------|-------------------------|------------------|
| $vr0 - $vr1 (128位) / $xr0 - $xr1 (256位）   | 参数寄存器/返回值寄存器     | 否               |
| $vr2 - $vr7 (128位) / $xr2 - $xr7 (256位）   | 参数寄存器                | 否               |
| $vr8 - $vr31 (128位) / $xr8 - $xr31 (256位)  | 临时寄存器                | 否               |


TODO：对于在过程间保存完整内容的寄存器（static register/callee-saved register)，目前尚无明确最终方案，需要有效的性能测试手段来辅助判断。
目前在配合sleef向量数学库（还未提交社区）对x264、libjpeg-turbo进行性能测试的过程中，不同s/t寄存器的分配对性能没有产生明显影响。

## 向量调用约定
向量调用约定扩展是叠加于LP64D之上、使用128/256位向量寄存器，对向量参数和返回值进行传递的调用约定扩展。
可以通过以下的方式启用该调用约定：

- 使用vecarg选项对编译模块进行编译。这会使编译模块内的所有使用了向量参数、向量返回值函数都遵循该调用约定。
- 使用vecarg属性在源码中标记特定函数。被标记的函数会启用该调用约定。

为了使向量调用约定在函数、编译模块之间的行为保持一致，需要遵循以下的要求：

- 如果使用vecarg选项构建一个编译模块，如果另一模块调用了该模块使用了向量参数、向量返回值的函数，该模块也应当使用vecarg选项进行编译。
- 如果使用vecarg属性标记了一个函数，该函数的所有声明、定义都应当使用vecarg属性进行标记。
- 对于所有利用了向量调用约定的编译模块，使用相同的向量长度指令集支持进行编译。

p.s.: 对于GCC当前的PoC实现，vecarg选项对应`-mvecarg`, vecarg属性对应于`__attribute__ ((vecarg))`。

## 子程序调用流程

在以下的向量调用约定描述中，对于128/256位向量的传递描述中，我们都认为编译器开启了对应位宽的向量指令支持。

### 寄存器
VAR：0-7号向量寄存器按照编号依次用于向量参数的传递。同时，0-1号向量寄存器用于向量返回值的传递。向量参数传递时，总是会选择VRLEN等于向量参数位宽的VAR进行传递。

### 参数传递
在启用向量调用约定时，参数可能的传递形式如下：

1. 一个参数寄存器。
2. 一对编号连续的参数寄存器。
3. 下面的任意一种不同类型参数寄存器的配对组合：
    - 一个GAR和一个FAR
    - 一个GAR和一个VAR
    - 一个FAR和一个VAR
4. 一个在栈区域连续的内存块，该内存块具有由子程序调用者的$sp计算的偏移常量
5. 1和4的组合。

#### 单个向量参数的传递

1. 128位向量
    - 如果存在至少1个VAR可用，则使用VAR进行传递。
    - 如果无VAR可用，至少2个编号相邻的GAR可用，则使用这一对GAR进行传递，低64位存储在编号靠前的GAR,高64位存储在编号靠后的GAR。
    - 其他情况，完全通过栈进行传递。

2. 256位向量
    - 如果存在至少1个VAR可用，则使用VAR进行传递。
    - 如果无VAR可用，至少一个GAR可用，则将256位向量存储在调用者的栈空间，并且将存放位置对应的内存地址存放在GAR。
    - 如果无VAR和GAR可用，则完全通过栈进行传递。

#### 带有向量成员的结构体的传递

无论何种场合，最多仅使用两个寄存器（所有使用的寄存器类型的数量之和）进行结构体的传递，否则从栈进行参数传递。

- 如果结构体仅存在一个成员，并且该成员是向量，则参数传递规则与单个向量参数的传递行为相同。
- 当结构体的成员为两个时：
    1. 如果两个成员均为向量，而且有至少两个编号连续的VAR可用，则使用这两个VAR进行传递。
    2. 如果结构体包含一个向量成员、一个浮点成员，在FAR、VAR有空闲的前提下，使用一个VAR、一个FAR对两个成员进行传递。
    3. 如果结构体包含一个向量成员、一个整型成员，在FAR、GAR有空闲的前提下，使用一个VAR、一个GAR对两个成员进行传递。
- 其他情况下，如果至少有一个空闲的GAR,则进行引用传参，否则从栈进行传递。

p.s.:如果结构体成员包含0长度位域、0长度数组、空结构体或空组合体等成员，其处理规则与基础ABI中Other structures中所描述的处理方式相同。

#### 可变长参数列表的传递

对于向量参数，不使用VAR/FAR进行传递。

对于128位向量，如果至少有两个GAR可用，并且首个GAR的编号为偶数，则使用这对GAR传递参数。

对于256位向量，根据向量位宽遵循现有基础ABI定义。

### 返回值
0-1号VAR用于返回值的传递，传递方式与参数列表中首个参数的传递逻辑相同。

---

This is a experimental vector calling convention impl. for LoongArch64 based on GCC. The ad-hoc implementation can be found in this pull request: https://github.com/loongson/gcc/pull/113.

Any discussions about this prototype calling convention are welcome! And please report any inconsistency between Chinese and English version. Thanks!

## Keywords
- VR: Vector Register
- VAR: Vector Argument Register
- VRLEN: Vector Register Length

## Vector Types
A vector can be either 128 bits or 256 bits width, and always contains
multiple elements. Each member of vector consecutively occupies the vector
from lowest bits, and has index that starting from zero.

Elements of a vector always have same base scalar type from LP64 data model.

## Vector Register
LoongArch machines that implements LA64 can optionally have 32 vector registers
may be either 128 or 256-bit, depending on the hardware implementation. double-precision
FPU is required for vector registers. Floating-point registers and vector registers that have same
index postfix follow the overlapping rules below:

- Floating-point registers are overlapping the lower 64 bits of 128-bit and 256-bit vector registers.
- 128-bit vector registers are overlapping the lower 128 bits of 256-bit vector registers.


| Name                                             | Usage                    | Preserved across calls  |
|--------------------------------------------------|-------------------------|------------------|
| $vr0 - $vr1 (128-bit) / $xr0 - $xr1 (256-bit）   | Argument registers  / return value registers | No               |
| $vr2 - $vr7 (128-bit) / $xr2 - $xr7 (256-bit）   | Argument registers                | No               |
| $vr8 - $vr31 (128-bit) / $xr8 - $xr31 (256-bit)  | Temporary registers                | No               |

TODO: For "static register"/"callee-saved register", we didn't have a clear resolution for now, and we need effective performance measurements for definition.

In current performance test, when utilize different static/temp register allocation solutions with vector calling convention, x264/libjpeg-turbo's testing tool and sleef vector math library(loongarch support not released yet), we can't see significant difference in performance outputs.

## Vector Calling Conventions
Vector calling convention extension is based on the LP64D, it utilizes 128-bit/256-bit vector register to pass vector argument and return value.It can be enabled via:

- Use "vecarg option" to compile objects. This way will makes all functions that contain vector arguments or vector return values follow this calling convention.
- Use "vecarg attribute" to mark specific function. The function that being marked will follow this calling convetion.

For consistent behavior between objects and functions, following rules should be considered while utilizing vector calling conventions:

- When compiling object A with "vecarg option", if object B invokes functions that contain vector arguments or return value from this object, obejct B also need to be compiled with "vecarg option".
- When marking function with "vecarg attribute",
- All objects that utilize vector calling convention should be compiled with same SIMD instuction option(Keep same max vector length).

p.s.: For current GCC PoC implementation, "vecarg option" refers to `-mvecarg`, "vecarg attribute" refers to `__attribute__ ((vecarg))`。

## Subroutine Calling Sequence

In the following description of vector calling convention, we assume 128/256-bit vector insturction support is enabled in compiler while utilizing corresponding convention.

### Registers
VAR: Number 0 - 7 vector register are preserved for vector argument passing, and number 0 - 1 vector are also used for vector return value.

### Argument Passing
When vector calling convention is enabled, the possible passing method will be one of the following options:

1. An argument register.
2. A pair of argument registers with adjacent numbers.
3. Any combination type of a pair of argument registers below:
  - a GAR and a FAR.
  - a GAR and a VAR.
  - a FAR and a VAR.
4. A contiguous block of memory in the stack arguments region, with a constant offset from the caller's outgoing `$sp`.
5. A combination of 1 and 4.

#### Passing Single Vector Argument

1. 128-bit vector argument
    - If at least 1 VAR is available, then pass this argument via single VAR.
    - If no VAR is available and at least 2 GARs with adjacent numbers are available, then pass vector argument via them; the low 64-bit part of vector argument is stored inside first GAR, and high 64-bit part of vector argument inside second GAR.
    - For other condition, pass vector argument on stack.

2. 256-bit vector argument
    - If at least 1 VAR is available, then pass this argument via single VAR.
    - If no VAR is available and at least 1 GAR is available, then store 256-bit vector argument on stack, then pass vector argument's address via GAR.
    - If no GAR and VAR is available, pass vector argument on stack.

#### Passing Struct with Vector Member

For all conditions, we only use at most 2 registers(sum of all register types) to pass a struct with vector member, otherwise pass structure on stack.

- If struct only has 1 member and it's vector type, then we follow the single vector argument passing rule for this struct.
- If struct has 2 members:
    1. If all of members are vector type, and 2 or more VARs with adjacent numbers are available, then pass struct via them; first vector member is stored inside first VAR, and second vector member inside second VAR.
    2. If struct contains 1 vector member and 1 float-point member, and VARs and FARs are sufficient, then use 1 VAR and 1 FAR to pass struct.
    3. If struct contains 1 vetcor member and 1 integer member, and VARs and GARs are sufficient, then use 1 VAR and 1 GAR to pass struct.
    4. For other conditions, If at least 1 GAR is available, then pass struct by reference, otherwise pass on stack.

p.s.: If struct contains zero-with bit field/zero-length array/empty struct/empty union, the passing rule is same as the description of "Other Structure" in base ABI document.

#### Variadic arguments

We don't use VAR/FAR to pass vector arguments.

For 128-bit vector argument, if at least 2 GARs are available, and first GAR's number is even, then use this pair of GARs to pass argument.

For 256-bit vector argument, it follows the current base ABI conventions with its data bit-width(256-bit).

### Return Value

0 - 1 VARs are used for passing return value. The passing rule of return value is same as the first argument's method of argument list.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LoongArch64实验性向量调用约定实现 (Experimental implementation of LoongArch64 vector calling convention) #114

关键字

向量类型

向量寄存器

向量调用约定

子程序调用流程

寄存器

参数传递

单个向量参数的传递

带有向量成员的结构体的传递

可变长参数列表的传递

返回值

Keywords

Vector Types

Vector Register

Vector Calling Conventions

Subroutine Calling Sequence

Registers

Argument Passing

Passing Single Vector Argument

Passing Struct with Vector Member

Variadic arguments

Return Value

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

名称	用途	是否在过程间保存
$vr0 - $vr1 (128位) / $xr0 - $xr1 (256位）	参数寄存器/返回值寄存器	否
$vr2 - $vr7 (128位) / $xr2 - $xr7 (256位）	参数寄存器	否
$vr8 - $vr31 (128位) / $xr8 - $xr31 (256位)	临时寄存器	否

Name	Usage	Preserved across calls
$vr0 - $vr1 (128-bit) / $xr0 - $xr1 (256-bit）	Argument registers / return value registers	No
$vr2 - $vr7 (128-bit) / $xr2 - $xr7 (256-bit）	Argument registers	No
$vr8 - $vr31 (128-bit) / $xr8 - $xr31 (256-bit)	Temporary registers	No

LoongArch64实验性向量调用约定实现 (Experimental implementation of LoongArch64 vector calling convention) #114

Description

关键字

向量类型

向量寄存器

向量调用约定

子程序调用流程

寄存器

参数传递

单个向量参数的传递

带有向量成员的结构体的传递

可变长参数列表的传递

返回值

Keywords

Vector Types

Vector Register

Vector Calling Conventions

Subroutine Calling Sequence

Registers

Argument Passing

Passing Single Vector Argument

Passing Struct with Vector Member

Variadic arguments

Return Value

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions