You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While this transformation achieves functional correctness, it introduces additional complexity in the indexing expressions and splits the vectorized loop into smaller chunks (e.g., breaking the 16-element vectorized access into two 8-element accesses). This reduces the efficiency of vectorized memory operations and complicates the generated code.
Proposed Enhancement:
To address this, the legalization pass should be enhanced to maintain the original vectorized structure and ensure that the indexing expressions remain as simple as possible. Specifically:
1. Preserve Single-Level Vectorization: Instead of breaking the 16-element vectorized loop into smaller subloops (e.g., two 8-element loops), the pass should retain the original T.vectorized(16) loop where possible.
2. Simplify Index Calculations: The pass should avoid introducing complex expressions like (v_3 * 8 + vec) for computing indices. Instead, it should aim to directly map the v_3 indices to the original structure (e.g., v_3 // 8 and v_3 % 8).
3. Optimize Performance: By preserving the larger vectorized loop and avoiding unnecessary transformations, the pass can generate more efficient, hardware-friendly code that takes better advantage of vectorized memory access.
The text was updated successfully, but these errors were encountered:
Currently, when we write a set of nested loops to ensure 16-byte vectorized access, the code might look like this:
However, our current legalization pass transforms this into the following form:
While this transformation achieves functional correctness, it introduces additional complexity in the indexing expressions and splits the vectorized loop into smaller chunks (e.g., breaking the 16-element vectorized access into two 8-element accesses). This reduces the efficiency of vectorized memory operations and complicates the generated code.
Proposed Enhancement:
To address this, the legalization pass should be enhanced to maintain the original vectorized structure and ensure that the indexing expressions remain as simple as possible. Specifically:
1. Preserve Single-Level Vectorization: Instead of breaking the 16-element vectorized loop into smaller subloops (e.g., two 8-element loops), the pass should retain the original T.vectorized(16) loop where possible.
2. Simplify Index Calculations: The pass should avoid introducing complex expressions like (v_3 * 8 + vec) for computing indices. Instead, it should aim to directly map the v_3 indices to the original structure (e.g., v_3 // 8 and v_3 % 8).
3. Optimize Performance: By preserving the larger vectorized loop and avoiding unnecessary transformations, the pass can generate more efficient, hardware-friendly code that takes better advantage of vectorized memory access.
The text was updated successfully, but these errors were encountered: