Skip to content

Commit

Permalink
add implementation-of-vectorized-execution (#253)
Browse files Browse the repository at this point in the history
Co-authored-by: ruoxi <[email protected]>
  • Loading branch information
XuHuaiyu and zanmato1984 authored Oct 31, 2023
1 parent 878047d commit 9b57613
Showing 1 changed file with 185 additions and 0 deletions.
185 changes: 185 additions & 0 deletions src/understand-tidb/implementation-of-vectorized-execution.md
Original file line number Diff line number Diff line change
@@ -1 +1,186 @@
# Implementation of Vectorized Execution
This section introduces the implementation details of the TiDB vectorized execution model.
## Understanding Vectorized Execution
Vectorized execution, also known as batch processing, is a method of processing data in batches, rather than row by row. Traditional row-based processing handles one row at a time, which can lead to significant overhead and reduced efficiency, especially when dealing with large datasets. Vectorized execution, on the other hand, processes data in chunks or vectors, allowing for better utilization of CPU and memory resources.
## Key Benefits of Vectorized Execution
1. **Improved CPU Utilization**: Processing data in batches minimizes the overhead associated with instruction fetching and decoding, leading to better CPU utilization and improved performance.
2. **Reduced Memory Access**: Data processed in vectors is more likely to be present in the CPU cache, reducing the need for memory access, which is often a performance bottleneck.
3. **Reduced Branching**: Traditional row-based processing often involves conditional statements and branching, which can hinder performance. Vectorized execution minimizes branching, leading to more predictable and faster execution.
## Implementing Vectorized Execution in TiDB
TiDB leverages a memory layout similar to Apache Arrow to enable the execution of a batch of data at a time. This approach allows for efficient data processing, reducing overhead and improving performance.
### Columnar Memory Layout Implementation in TiDB
In TiDB, the columnar memory layout is defined as a `Column`, and a batch of these `Columns` is referred to as a `Chunk`. The implementation of `Column` draws inspiration from Apache Arrow, ensuring efficient data processing. Depending on the data type they store, TiDB employs two types of `Columns`:
- **Column with Fixed Length**: These Columns store data of a fixed length, such as Double, Bigint, Decimal, and similar data types. This structure is optimized for predictable and consistent data sizes, facilitating efficient processing.
- **Column with Variable Length**: These Columns accommodate variable-length data types, including Char, Varchar, and others. Variable-length data types can hold a range of character lengths, and the Column structure adapts to handle such variability.
In TiDB, the Column and Chunk types are defined as follows:
``` go
type Column struct {
length int // the number of elements in the column
nullBitmap []byte // bitmap indicating null values
offsets []int64 // used for varLen column; row i starts from data[offsets[i]]
data []byte // the actual data
elemBuf []byte // used to indicate the byte length of each element for fixed-length objects
// ... (additional fields)
}

type Chunk struct {
columns []*Column
sel []int // sel indicates which rows are selected. If it is nil, all rows are selected.
capacity int // capacity indicates the maximum number of rows this chunk can hold.
// ... (additional fields)
}
```

### Column and Chunk Data Manipulation

TiDB supports various data manipulation operations on `Column` and `Chunk`:

**Appending a Fixed-Length Non-NULL Value to a Column**:
- To append an element, a specific `append` method tailored to the data type is called (e.g., [AppendInt64](https://github.com/pingcap/tidb/blob/ecaa1c518cc9367844ebb5206f2e970461c8bf28/pkg/util/chunk/column.go#L246-L248)).
- The data to be appended is shallow copied to the `elemBuf` using an `unsafe.Pointer`.
- The data in `elemBuf` is then appended to the `data`.
- A `1` is appended to the `nullBitmap`.

**Appending a Non-Fixed-Length Non-NULL Value to a Column**:
- To append a variable-length data value, such as a string, it is directly appended to the `data`.
- A `1` is appended to the `nullBitmap`.
- The starting point of the newly appended data in the `data` is recorded in the `offsets`.

**Appending a NULL Value to a Column**:
- To append a NULL value, the [AppendNull](https://github.com/pingcap/tidb/blob/ecaa1c518cc9367844ebb5206f2e970461c8bf28/pkg/util/chunk/column.go#L229-L237) function is used.
- A `0` is appended to the `nullBitmap`.
- If it's a fixed-length column, a placeholder data of the same size as `elemBuf` is appended to the `data`.
- If it's a variable-length column, no data is appended to the `data`, but the starting point of the next element is recorded in the `offsets`.

**Reading a Value from a Column**:
- Values in a `Column` can be read using functions like [GetInt64(rowIdx)](https://github.com/pingcap/tidb/blob/ecaa1c518cc9367844ebb5206f2e970461c8bf28/pkg/util/chunk/column.go#L551-L553) and [GetString(rowIdx)](https://github.com/pingcap/tidb/blob/ecaa1c518cc9367844ebb5206f2e970461c8bf28/pkg/util/chunk/column.go#L576-L578). The reading principle can be deduced from the previously described appending mechanism. Here, we retrieve the specified element in the `Column` based on the rowID. The details of reading from a `Column` are consistent with the principles discussed for appending.

**Reading a Row from a Chunk**:
- Within a `Chunk`, the concept of a [Row](https://github.com/pingcap/tidb/blob/ecaa1c518cc9367844ebb5206f2e970461c8bf28/pkg/util/chunk/row.go#L25-L28) is logical. The data for a row is stored across different `Columns` in the `Chunk`. The data for the same row in various columns is not necessarily stored consecutively in memory. When obtaining a `Row` object, there is no need to perform data copying, as the data for the same row is already stored in the corresponding `Columns`.
- The concept of a `Row` is useful because, during the operation of operators, data is often accessed and processed on a per-row basis. For example, operations like aggregation, sorting, and similar tasks work with data at the row level.
- You can retrieve a row from a `Chunk` using the [GetRow(rowIdx)](https://github.com/pingcap/tidb/blob/ecaa1c518cc9367844ebb5206f2e970461c8bf28/pkg/util/chunk/chunk.go#L356-L368) function. Once you have a `Row` object, you can further access the data in specific columns within that row using functions like [Row::GetInt64(colIdx)](https://github.com/pingcap/tidb/blob/ecaa1c518cc9367844ebb5206f2e970461c8bf28/pkg/util/chunk/row.go#L51-L53), which allows you to retrieve the data corresponding to the specified column index for that row.

### Examples
#### How expression is evaluated

In this section, we’ll use the TiDB expression `colA * 0.8 + colB` to demonstrate how expression evaluation works using vectorized execution and to highlight the performance gap between row-based execution and vectorized execution.

**Expression Tree Representation**
The TiDB expression `colA * 0.8 + colB` is parsed into an expression evaluation tree, where each non-leaf node represents an arithmetic operator, and the leaf nodes represent the data source. Each non-leaf node can be either a constant (like `0.8`) or a field (like `colA`) in the table. The parent-child relationship between nodes indicates a computationally dependent relationship: the evaluation result of the child node is the input data for the parent node.

```
┌─┐
┌───┤+├───┐
│ └─┘ │
┌┴┐ ┌┴┐
colA*0.8+colB───► ┌──┤*├───┐ │B│
│ └─┘ │ └─┘
┌┴┐ ┌─┴─┐
│A│ │0.8│
└─┘ └───┘
```
**Non-Vectorized Execution**

In a non-vectorized execution model, the computing logic of each node can be abstracted using the following evaluation interface:
```
type Node interface {
evalReal(row Row) (val float64, isNull bool)
}
```

Taking `*`, `0.8`, and `col` nodes as examples, all three nodes implement the interface above. Their pseudocode is as follows:

```
func (node *multiplyRealNode) evalReal(row Row) (float64, bool) {
v1, null1 := node.leftChild.evalReal(row)
v2, null2 := node.rightChild.evalReal(row)
return v1 * v2, null1 || null2
}
func (node *constantNode) evalReal(row Row) (float64, bool) {
return node.someConstantValue, node.isNull // 0.8 in this case
}
func (node *columnNode) evalReal(row Row) (float64, bool) {
return row.GetReal(node.colIdx)
}
```

In non-vectorized execution, the expression is iterated over rows. Every time this function performs a multiplication, only a few instructions are actually involved in the "real" multiplication compared to the number of assembly instructions required to perform the function.

**Vectorized Execution**

In vectorized execution, the interface to evaluate an expression in a batch manner in TiDB looks like this:


```
type VecNode interface {
vecEvalReal(input *Chunk, result *Column)
}
```

Taking `multiplyRealNode` as an example:

```
func (node *multiplyRealNode) vecEvalReal(input *Chunk, result *Column) {
buf := pool.AllocColumnBuffer(TypeReal, input.NumRows())
defer pool.ReleaseColumnBuffer(buf)
node.leftChild.vecEvalReal(input, result)
node.rightChild.vecEvalReal(input, buf)
f64s1 := result.Float64s()
f64s2 := buf.Float64s()
result.MergeNulls(buf)
for i := range i64s1 {
if result.IsNull(i) {
continue
}
i64s1[i] *= i64s2[i]
}
}
```

This implementation reduces the interpretation overhead by batch processing, which is more beneficial for modern CPUs:

- A vector of data is sequentially accessed. This reduces CPU cache misses.
- Most of the computational work is within a simple loop. This facilitates CPU branch prediction and instruction pipelining.

We use the same dataset (1024 rows formed by two columns of floating-point numbers) to compute `colA * 0.8 + colB` in two ways: row-based execution and vectorized execution. The results are as follows:


```
BenchmarkVec-12 152166 7056 ns/op 0 B/op 0 allocs/op
BenchmarkRow-12 28944 38461 ns/op 0 B/op 0 allocs/op
```

The results above show vectorized execution is four times faster than row-based execution.
For more details about the vectorized expression evaluation, you can refer to [this link](https://www.pingcap.com/blog/10x-performance-improvement-for-expression-evaluation-made-possible-by-vectorized-execution/).


#### How operators are evaluated

In this section, we'll dive deeper into the evaluation of operators, focusing on HashJoin as an example.

HashJoin in vectorized execution consists of the following steps:

**Hashing Phase**

Let's consider the table used for constructing the hash table as 't'. The data from table 't' is read into `Chunk` in batches. First, the data in the Chunk is filtered by columns based on the predicates on table 't'. The filtered results for these columns are then combined into a `selected` array. In the `selected` array, true values indicate valid rows. The relevant code is available in the [VectorizedFilter](https://github.com/pingcap/tidb/blob/fd3b2cc571a23ec5169ffe428a7b1232c8ccab96/pkg/executor/join.go#L1252) section.
Subsequently, the hash values for the remaining valid data in the Chunk are calculated column-wise. If multiple columns are used in the hash, their values are concatenated to form the final hash key for a row. Further details can be found in the [HashChunkSelected](https://github.com/pingcap/tidb/blob/fd3b2cc571a23ec5169ffe428a7b1232c8ccab96/pkg/executor/hash_table.go#L467) code section.
Finally, the `selected` array is used for filtering. The hash key for valid rows, along with their corresponding row pointers, is used to construct the hash table.

**Probe Phase**

The probe phase in HashJoin mirrors the build phase. Initially, data from the probe table is read into `Chunk` in batches. Predicates are applied to filter the Chunk by columns, and a `selected` array is generated to mark valid rows. The hash keys are then calculated for each of the valid rows.

For the valid rows in the Chunk, the calculated hash value is employed to probe the hash table constructed during the build phase. This lookup aims to identify matching rows in the hash table using the hash values. The implementation can be explored in [join2Chunk](https://github.com/pingcap/tidb/blob/fd3b2cc571a23ec5169ffe428a7b1232c8ccab96/pkg/executor/join.go#L987).

**Matching and Output**

Upon discovering matching rows in the hash table, the outcomes are output as joined rows and saved in a new `Chunk`. For deeper insights, see the code in [joinMatchedProbeSideRow2Chunk](https://github.com/pingcap/tidb/blob/fd3b2cc571a23ec5169ffe428a7b1232c8ccab96/pkg/executor/join.go#L925).

Vectorized computation in HashJoin offers considerable advantages over row-based computation, primarily concerning performance. By allowing for batch processing of data, vectorized computation minimizes instruction fetch and decode overhead, enhancing CPU utilization, trimming memory access, reducing conditional branches, and augmenting parallelism. These benefits render vectorized HashJoin exceptionally efficient and performant when processing large datasets.

## Conclusion

In conclusion, TiDB's adept data processing, drawing inspiration from the Apache Arrow memory layout with its columns and chunks, emerges as an invaluable asset for contemporary data professionals. With its vectorized execution, TiDB heightens CPU utilization, curtails memory access overhead, and curbs branching, culminating in substantially accelerated and more streamlined query performance.

0 comments on commit 9b57613

Please sign in to comment.