Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] LayoutInference pass should be enhanced to analysis vectorize factor cross indices #266

Closed
LeiWang1999 opened this issue Dec 12, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@LeiWang1999
Copy link
Contributor

Currently, when we write a set of nested loops to ensure 16-byte vectorized access, the code might look like this:

for i in range(1):
    for v_3 in T.vectorized(16):
        B_shared[tx // 16, tx % 16 // 8, tx % 8 * 2 + v_3 // 8, v_3 % 8] = B[bx * 8 + tx // 16, ko * 2 + tx % 16 // 8, tx % 8 * 2 + v_3 // 8, v_3 % 8]

However, our current legalization pass transforms this into the following form:

for i, v_3 in T.grid(1, 2):
    for vec in T.vectorized(8):
        B_shared[tx // 16, tx % 16 // 8, tx % 8 * 2 + (v_3 * 8 + vec) // 8, (v_3 * 8 + vec) % 8] = B[bx * 8 + tx // 16, ko * 2 + tx % 16 // 8, tx % 8 * 2 + (v_3 * 8 + vec) // 8, (v_3 * 8 + vec) % 8]

While this transformation achieves functional correctness, it introduces additional complexity in the indexing expressions and splits the vectorized loop into smaller chunks (e.g., breaking the 16-element vectorized access into two 8-element accesses). This reduces the efficiency of vectorized memory operations and complicates the generated code.

Proposed Enhancement:
To address this, the legalization pass should be enhanced to maintain the original vectorized structure and ensure that the indexing expressions remain as simple as possible. Specifically:
1. Preserve Single-Level Vectorization: Instead of breaking the 16-element vectorized loop into smaller subloops (e.g., two 8-element loops), the pass should retain the original T.vectorized(16) loop where possible.
2. Simplify Index Calculations: The pass should avoid introducing complex expressions like (v_3 * 8 + vec) for computing indices. Instead, it should aim to directly map the v_3 indices to the original structure (e.g., v_3 // 8 and v_3 % 8).
3. Optimize Performance: By preserving the larger vectorized loop and avoiding unnecessary transformations, the pass can generate more efficient, hardware-friendly code that takes better advantage of vectorized memory access.

@LeiWang1999
Copy link
Contributor Author

LeiWang1999 commented Dec 16, 2024

closed as has been implemented by pr #268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant