[cartesian] Loop blocking in vertical direction #1618

xyuan · 2024-08-20T17:46:41Z

In the current implementation, the library node expansion pattern doesn't support loop blocking in vertical direction, see below for the generated C++ code,

    {
        #pragma omp parallel for collapse(2)
        for (auto __tile_j = 0; __tile_j < __J; __tile_j += 8) {
            for (auto __tile_i = 0; __tile_i < __I; __tile_i += 8) {
                {
                    for (auto __i = (__tile_i - 1); __i < ((__tile_i + Min(8, (__I - __tile_i))) + 1); __i += 1) {
                        for (auto __j = __tile_j; __j < (__tile_j + Min(8, (__J - __tile_j))); __j += 1) {
                            for (auto __k = 0; __k < __K; __k += 1) {
                                nested_state_0_0_0_0_11(__state, &dxa[((__dxa_I_stride * __i) + (__dxa_J_stride * __j))], &q[(((__j * __q_J_stride) + (__k * __q_K_stride)) + (__q_I_stride * (__i + 1)))], &__0_al[(((__K * (__j - __tile_j)) + ((8 * __K) * ((__i - __tile_i) + 1))) + __k)], __I, __K, __dxa_I_stride, __dxa_J_stride, __i, __q_I_stride, __q_J_stride, __q_K_stride);
                            }
                        }
                    }
                }
                {
                    for (auto __i = __tile_i; __i < (__tile_i + Min(8, (__I - __tile_i))); __i += 1) {
                        for (auto __j = __tile_j; __j < (__tile_j + Min(8, (__J - __tile_j))); __j += 1) {
                            for (auto __k = 0; __k < __K; __k += 1) {
                                nested_state_0_0_0_0_12(__state, &__0_al[((((8 * __K) * (__i - __tile_i)) + (__K * (__j - __tile_j))) + __k)], &courant[(((__courant_I_stride * __i) + (__courant_J_stride * __j)) + (__courant_K_stride * __k))], &q[(((__j * __q_J_stride) + (__k * __q_K_stride)) + (__q_I_stride * (__i + 2)))], &xflux[(((__i * __xflux_I_stride) + (__j * __xflux_J_stride)) + (__k * __xflux_K_stride))], __K, __courant_I_stride, __courant_J_stride, __courant_K_stride, __q_I_stride, __q_J_stride, __q_K_stride, __xflux_I_stride, __xflux_J_stride, __xflux_K_stride);
                            }
                        }
                    }
                }
            }
        }
    }

}

This is fine when vertical loop size is small, however, when we run high resolution with larger dimensional size in vertical direction, this vertical loop is no longer cache friendly, we will need to add new library node expansion pattern to support loop blocking in vertical direction, with which more cache friendly code can be generated as fellowing,

    {
        #pragma omp parallel for collapse(2)
        for (auto __tile_j = 0; __tile_j < __J; __tile_j += 8) {
            for (auto __tile_i = 0; __tile_i < __I; __tile_i += 8) {
                 **for (auto __tile_k = 0; __tile_k < __K; __tile_k += 8) {**
                {
                    for (auto __i = (__tile_i - 1); __i < ((__tile_i + Min(8, (__I - __tile_i))) + 1); __i += 1) {
                        for (auto __j = __tile_j; __j < (__tile_j + Min(8, (__J - __tile_j))); __j += 1) {
                            **for (auto __k = __tile_k; __k < (__tile_k + Min(8, (K__-__tile_k))); __k += 1) {**
                                nested_state_0_0_0_0_11(__state, &dxa[((__dxa_I_stride * __i) + (__dxa_J_stride * __j))], &q[(((__j * __q_J_stride) + (__k * __q_K_stride)) + (__q_I_stride * (__i + 1)))], &__0_al[(((__K * (__j - __tile_j)) + ((8 * __K) * ((__i - __tile_i) + 1))) + __k)], __I, __K, __dxa_I_stride, __dxa_J_stride, __i, __q_I_stride, __q_J_stride, __q_K_stride);
                            }
                        }
                    }
                  }
                }
                {
                    for (auto __i = __tile_i; __i < (__tile_i + Min(8, (__I - __tile_i))); __i += 1) {
                        for (auto __j = __tile_j; __j < (__tile_j + Min(8, (__J - __tile_j))); __j += 1) {
                            **for (auto __k = __tile_k; __k < (__tile_k + Min(8, (__K - __tile_k))); __k += 1) {**
                                nested_state_0_0_0_0_12(__state, &__0_al[((((8 * __K) * (__i - __tile_i)) + (__K * (__j - __tile_j))) + __k)], &courant[(((__courant_I_stride * __i) + (__courant_J_stride * __j)) + (__courant_K_stride * __k))], &q[(((__j * __q_J_stride) + (__k * __q_K_stride)) + (__q_I_stride * (__i + 2)))], &xflux[(((__i * __xflux_I_stride) + (__j * __xflux_J_stride)) + (__k * __xflux_K_stride))], __K, __courant_I_stride, __courant_J_stride, __courant_K_stride, __q_I_stride, __q_J_stride, __q_K_stride, __xflux_I_stride, __xflux_J_stride, __xflux_K_stride);
                            }
                        }
                    }
                }
              }
            }
        }
    }

}

The bold code is the new code that added with the new library node expansion pattern. The corresponding PR will be added and update here for further review.

The text was updated successfully, but these errors were encountered:

FlorianDeconinck · 2024-08-20T17:58:17Z

Ping @twicki & @romanc to track for NASA

FlorianDeconinck changed the title ~~loop blocking in vertical direction~~ [cartesian] Loop blocking in vertical direction Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cartesian] Loop blocking in vertical direction #1618

[cartesian] Loop blocking in vertical direction #1618

xyuan commented Aug 20, 2024

FlorianDeconinck commented Aug 20, 2024

[cartesian] Loop blocking in vertical direction #1618

[cartesian] Loop blocking in vertical direction #1618

Comments

xyuan commented Aug 20, 2024

FlorianDeconinck commented Aug 20, 2024