-
Notifications
You must be signed in to change notification settings - Fork 269
[Parallel] Support T.Parallel
with dynamic extents
#990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
WalkthroughAdds symbolic-shape-aware detection in Layout::Inverse (choosing Bijective vs NoCheck and warning on fallback). Reworks loop partitioning to handle dynamic extents with per-iteration guards, thread-offset normalization, and guarded emits; avoids unrolling non-constant extents. Vectorize extent now simplified via arith::Analyzer. Adds CUDA TileLang parallel tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Caller
participant L as Layout::Inverse
participant S as SymbolicCollector
participant D as DetectIterMap
C->>L: Inverse(input_size_)
L->>S: collect_symbolic(input_size_)
L->>S: collect_symbolic(OutputShape())
Note over L: Merge symbolic dims and re-collect to classify\nstatic vs dynamic shapes
alt Static (no symbolic dims)
L->>D: DetectIterMap(level=Bijective)
else Dynamic (symbolic dims present)
L->>L: Log warning: fallback to NoCheck
L->>D: DetectIterMap(level=NoCheck)
end
D-->>L: mapping result + level
L-->>C: Inverse mapping (and level via InverseWithLevel)
sequenceDiagram
autonumber
participant P as LoopPartitioner
participant E as ExtentResolver
participant A as arith::Analyzer
participant B as TransformedBody
participant O as OutputEmitter
P->>E: Compute vector_extent & thread_extent (dynamic)
P->>P: Build flattened index (using extents)
loop For each inner fragment/iteration
P->>A: Derive/collect per-iteration bounds checks (mins/extents)
alt Bounds provable
A-->>P: No guard needed
else Unknown
P->>P: Accumulate guard expressions (with optional thread-offset substitute)
end
end
alt Guard not provable
P->>B: Simplify indices (BufferIndiceSimplify)
P->>O: Emit IfThenElse(guard, B, skip)
else
P->>O: Emit B directly
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/layout/layout.cc
(1 hunks)src/transform/loop_partition.cc
(3 hunks)testing/python/language/test_tilelang_language_parallel.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/transform/loop_partition.cc (2)
src/transform/lower_opaque_block.cc (4)
body
(42-46)body
(42-42)min
(148-162)min
(148-149)src/op/parallel.cc (4)
VisitStmt_
(130-146)VisitStmt_
(130-130)VisitStmt_
(148-160)VisitStmt_
(148-148)
testing/python/language/test_tilelang_language_parallel.py (4)
tilelang/testing/__init__.py (1)
set_random_seed
(30-35)tilelang/jit/__init__.py (1)
jit
(244-317)tilelang/language/kernel.py (1)
threads
(215-219)tilelang/language/parallel.py (1)
Parallel
(8-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-metal
- GitHub Check: build-test-amd
src/transform/loop_partition.cc
Outdated
PrimExpr min = loop->min; | ||
PrimExpr extent = loop->extent; | ||
PrimExpr lower_cond = guard_analyzer.Simplify(GE(indices[i], min)); | ||
if (!guard_analyzer.CanProve(lower_cond)) { | ||
guard = guard_analyzer.Simplify(logical_and(guard, lower_cond)); | ||
need_guard = true; | ||
} | ||
PrimExpr upper_cond = guard_analyzer.Simplify(LT(indices[i], min + extent)); | ||
if (!guard_analyzer.CanProve(upper_cond)) { | ||
guard = guard_analyzer.Simplify(logical_and(guard, upper_cond)); | ||
need_guard = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substitute loop bounds before building the guard
When a parallel nest has inner loops whose min
/extent
depend on outer iterators (e.g. triangular tiles), the expressions in Lines 97‑108 still contain the original loop vars because we never apply Substitute(…, vmap)
before generating lower_cond
/upper_cond
. After we drop the old loops, those vars become free symbols, so the emitted IfThenElse
guard is ill-formed and rejects valid iterations at runtime. Please substitute loop->min
and loop->extent
with the accumulated vmap
prior to guard construction (and use the substituted versions in both the guard synthesis and the later body substitution) so the guard is expressed solely in terms of the new iterator variables.
🤖 Prompt for AI Agents
In src/transform/loop_partition.cc around lines 97 to 108, the code constructs
lower_cond and upper_cond directly from loop->min and loop->extent which can
still reference outer loop variables; substitute those with the accumulated vmap
first so the guard contains only new iterator vars. Fix by computing PrimExpr
sub_min = Substitute(loop->min, vmap) and PrimExpr sub_extent =
Substitute(loop->extent, vmap) before building lower_cond/upper_cond, use
sub_min/sub_extent when calling guard_analyzer.Simplify and when composing
logical_and into guard, and ensure the later body substitution also uses these
substituted values (replace uses of loop->min/extent in guard synthesis and
subsequent substitutions with the substituted versions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/transform/loop_partition.cc (2)
100-112
: Simplify the comment (verbose with duplicate example).The comment spans 13 lines and repeats the same
layout([i, j]) = floor((i * 16 + j) / 32)
example twice (lines 105 and 108). Consider consolidating into a single, concise explanation.Apply this diff to streamline the comment:
- // Guard executes the recovered loop body only if each inverse-mapped iterator - // falls back into the original For ranges. We first check every axis from the - // old loop nest (old_loop_depth) and then the extra index produced by inverse - // layouts that carry a replicate/thread component (`inv_output_shape`). Both - // must stay within bounds to ensure correctness. Example: layout([i, j]) = - // floor((i * 16 + j) / 32) may generate extra points when the new loop - // enumerates 0..31; the guard drops iterations whose inverse-mapped (i, j) - // or replicate index fall outside their original extents. - // Example: layout([i, j]) = floor((i * 16 + j) / 32) may produce extra points - // when the new loop enumerates 0..31; this guard skips iterations where the - // inverse i, j land outside the original extents. This protects - // non-surjective loop_layout mappings that otherwise over-cover the parallel - // space. + // Guard the loop body so that inverse-mapped iterators stay within original + // For ranges (old_loop_depth axes) and any replicate/thread index from + // inv_output_shape remains in bounds. Non-surjective layouts (e.g., + // layout([i,j]) = floor((i*16+j)/32)) may over-cover the parallel space, + // so the guard skips out-of-range iterations.
164-167
: Consider reusing analyzer (minor inefficiency).Creating a new
arith::Analyzer
at line 164 for eachForNode
visit is slightly inefficient. While the logic is correct (skipping unroll for non-constant extents), consider making the analyzer a class member initialized once in the constructor to reduce allocations.Apply this diff to reuse the analyzer:
class LoopPramaUnroller : public StmtExprMutator { public: - LoopPramaUnroller() = default; + LoopPramaUnroller() : analyzer_(std::make_shared<arith::Analyzer>()) {} private: Stmt VisitStmt_(const ForNode *node) final { if (node->kind == ForKind::kSerial) { - auto analyzer = std::make_shared<arith::Analyzer>(); - if (as_const_int(analyzer->Simplify(node->extent)) == nullptr) { + if (as_const_int(analyzer_->Simplify(node->extent)) == nullptr) { return StmtExprMutator::VisitStmt_(node); } For new_for = GetRef<For>(node); auto for_ptr = new_for.CopyOnWrite(); for_ptr->annotations.Set(tir::attr::pragma_unroll_explicit, Bool(false)); for_ptr->kind = ForKind::kUnrolled; return new_for; } return StmtExprMutator::VisitStmt_(node); } + std::shared_ptr<arith::Analyzer> analyzer_; };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/transform/loop_partition.cc
(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/loop_partition.cc (1)
src/layout/layout.h (1)
Array
(34-66)
🔇 Additional comments (5)
src/transform/loop_partition.cc (5)
71-71
: LGTM: Dynamic extent binding.Correctly binds the loop variable with a potentially symbolic extent from the layout's output shape.
78-79
: LGTM: Tracking loop bounds for guards.These arrays are correctly introduced to track dynamic loop bounds for later guard generation.
137-140
: LGTM: Conditional guard emission.Correctly emits an
IfThenElse
guard only when the analyzer cannot statically prove the guard, avoiding unnecessary runtime overhead for provably-safe cases.
149-153
: LGTM: Thread offset substitution and unrolling.Thread offset normalization is correctly applied to the final body before unrolling, ensuring thread indices are zero-based.
184-196
: LGTM: Dynamic extent support in partitioner.The flattening and index computation logic correctly handles dynamic extents by operating on
PrimExpr
values rather than assuming constant loop sizes.
// Normalize thread var once so we can reuse the same substitution later. | ||
Map<Var, PrimExpr> thread_offset_map; | ||
bool has_thread_offset = false; | ||
if (loop_layout->ThreadRange().defined()) { | ||
auto range = loop_layout->ThreadRange(); | ||
thread_offset_map.Set(thread_var, thread_var - range->min); | ||
has_thread_offset = true; | ||
} | ||
for (int i = 0; i < old_loop_depth; i++) { | ||
const ForNode *loop = body.as<ForNode>(); | ||
ICHECK(loop != nullptr); | ||
vmap.Set(loop->loop_var, indices[i]); | ||
loop_mins.push_back(loop->min); | ||
loop_extents.push_back(loop->extent); | ||
body = loop->body; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substitute loop bounds incrementally before collecting.
The past review comment remains unaddressed. When an inner loop's min
or extent
depends on an outer iterator (e.g., triangular loops for j in [i, N)
), lines 94–95 capture those bounds without substituting the accumulated vmap
. After the new loops are constructed, the old loop variables become free symbols, rendering the guard at lines 119–121 ill‑formed and incorrectly rejecting valid iterations at runtime.
At iteration i
of the extraction loop (lines 90–96), vmap
already holds substitutions for loops 0..i-1
. Before appending to loop_mins
and loop_extents
, apply Substitute(loop->min, vmap)
and Substitute(loop->extent, vmap)
so that the collected bounds are expressed solely in terms of the new iterator variables.
Apply this diff to fix the substitution:
vmap.Set(loop->loop_var, indices[i]);
- loop_mins.push_back(loop->min);
- loop_extents.push_back(loop->extent);
+ loop_mins.push_back(Substitute(loop->min, vmap));
+ loop_extents.push_back(Substitute(loop->extent, vmap));
body = loop->body;
🤖 Prompt for AI Agents
In src/transform/loop_partition.cc around lines 82 to 96, the loop mins and
extents are being collected without applying the current vmap, which leaves
inner-loop bounds depending on outer-loop variables and later produces
ill-formed guards; fix this by substituting accumulated mappings into the bounds
before storing them: replace pushing loop->min and loop->extent with pushing
Substitute(loop->min, vmap) and Substitute(loop->extent, vmap) (also ensure any
thread_offset_map substitution is applied consistently if needed) so the
collected bounds are expressed only in terms of new iterator variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/transform/loop_partition.cc (1)
94-100
: Critical: Substitute loop bounds incrementally before collecting them.This issue was flagged in the past review and remains unaddressed. When an inner loop's
min
orextent
depends on an outer iterator (e.g., triangular loopsfor j in [i, N)
), lines 98-99 capture those bounds without applying the accumulatedvmap
. After the body substitution at line 103, the old loop variables become free symbols, rendering the guard at lines 125-127 ill-formed and incorrectly rejecting valid iterations at runtime.At each iteration
i
of the extraction loop,vmap
already holds substitutions for loops0..i-1
. Before appending toloop_mins
andloop_extents
, applySubstitute(loop->min, vmap)
andSubstitute(loop->extent, vmap)
so that the collected bounds are expressed solely in terms of the new iterator variables.Apply this diff to fix the substitution:
vmap.Set(loop->loop_var, indices[i]); - loop_mins.push_back(loop->min); - loop_extents.push_back(loop->extent); + loop_mins.push_back(Substitute(loop->min, vmap)); + loop_extents.push_back(Substitute(loop->extent, vmap)); body = loop->body;
🧹 Nitpick comments (2)
src/layout/layout.cc (2)
258-259
: Consider documenting the level parameter's impact.The change from a fixed
arith::IterMapLevel::Bijective
to a computedlevel
is critical for dynamic shape support. Consider adding a brief inline comment explaining thatlevel
varies based on whether the shape is static or symbolic.Apply this diff:
+ // Use determined level: Bijective for static shapes, NoCheck for dynamic arith::IterMapResult res = arith::DetectIterMap(forward_index_, getVarMap(), 1, level, &analyzer);
283-286
: Redundant std::move on temporary object.The
std::move
oninverse_result.first
is unnecessary here. The compiler will automatically apply RVO (Return Value Optimization) on the returned pair element. While harmless, removing it makes the intent clearer.Apply this diff:
Layout LayoutNode::Inverse() const { auto inverse_result = InverseWithLevel(); - return std::move(inverse_result.first); + return inverse_result.first; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/layout/layout.cc
(3 hunks)src/layout/layout.h
(3 hunks)src/transform/loop_partition.cc
(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/layout/layout.h (1)
src/layout/layout.cc (6)
Layout
(57-70)Layout
(72-75)InverseWithLevel
(232-281)InverseWithLevel
(232-232)InverseWithLevel
(400-409)InverseWithLevel
(400-400)
src/transform/loop_partition.cc (2)
src/layout/layout.h (1)
Array
(36-69)src/op/parallel.cc (4)
VisitStmt_
(131-147)VisitStmt_
(131-131)VisitStmt_
(149-161)VisitStmt_
(149-149)
src/layout/layout.cc (2)
tilelang/layout/layout.py (1)
Layout
(13-148)src/layout/swizzle.cc (2)
Inverse
(66-69)Inverse
(66-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
🔇 Additional comments (12)
src/layout/layout.h (3)
10-11
: LGTM! Necessary includes for the new interface.The includes for
iter_affine_map.h
andutility
support the newInverseWithLevel()
return typestd::pair<Layout, arith::IterMapLevel>
.
41-41
: LGTM! New interface method properly declared.The new
InverseWithLevel()
method cleanly extends the Layout API to expose both the inverse layout and its iterator map level, enabling callers to make dynamic vs. static shape decisions.
82-82
: LGTM! FragmentNode override properly declared.The
final
specifier is appropriate for FragmentNode's override ofInverseWithLevel()
.src/layout/layout.cc (2)
232-257
: LGTM! Symbolic shape detection correctly implemented.The
collect_symbolic
lambda filters non-constant dimensions, and the logic correctly combines symbolic dimensions from bothinput_size_
andOutputShape()
. TheIterMapLevel
selection (Bijective for static, NoCheck for dynamic) aligns with the PR's goal of supporting dynamic extents with runtime guards.
395-409
: LGTM! FragmentNode correctly delegates to base implementation.The implementation properly augments the input with replicate extent and thread information before delegating to the base
Layout::InverseWithLevel()
. The updatedInverse()
correctly wraps the newInverseWithLevel()
method.src/transform/loop_partition.cc (7)
71-73
: LGTM! Dynamic extent binding correctly implemented.The analyzer now binds the loop variable to a range with dynamic extent from
loop_layout->OutputShape()[i]
, which properly supports symbolic shapes.
79-84
: LGTM! Guard necessity determined from layout bijectivity.The use of
InverseWithLevel()
to obtain both the inverse layout and theIterMapLevel
enables the code to conditionally generate guards only when the layout cannot be proven bijective. This aligns with the PR's goal of supporting dynamic extents safely.
86-93
: LGTM! Thread offset normalization properly isolated.Creating
thread_offset_map
upfront and trackinghas_thread_offset
enables consistent substitution later at lines 122-124, 133-135, and 157-159.
104-147
: LGTM! Comprehensive guard generation for dynamic extents.The guard logic correctly:
- Checks each inverse-mapped index against the original loop bounds (lines 120-129)
- Handles the replicate dimension if present (lines 131-142)
- Applies thread offset normalization when needed (lines 122-124, 133-135)
- Only emits the
IfThenElse
wrapper if the guard cannot be statically proven (lines 144-146)The detailed comments clearly explain the purpose and example scenarios. Once the loop bounds substitution issue (lines 98-99) is fixed, this guard generation will be correct.
157-162
: LGTM! Thread offset applied consistently before unrolling.The thread offset substitution at lines 157-159 ensures the body uses normalized thread indices before the loop is unrolled at line 161, maintaining consistency with the earlier guard generation.
172-175
: LGTM! Unrolling correctly skipped for non-constant extents.Creating a local analyzer and checking for constant extent before unrolling prevents incorrect optimization on dynamic loops. This is essential for the PR's dynamic extent support.
192-204
: LGTM! Partition logic correctly handles dynamic extents.The flattened index computation now uses
loop_vars_[i]->dom->extent
directly instead of requiring compile-time constants. The expressions foraccess_idx
,thd
, andidx
correctly work with dynamic extents.
if (!is_static_shape) { | ||
// Runtime guards keep dynamic tails safe, so we allow NoCheck here and | ||
// warn. | ||
LOG(WARNING) << "Layout::Inverse on symbolic layout, falling back to " | ||
"NoCheck; symbolic dims: " | ||
<< symbolic_dims; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify the NoCheck fallback doesn't mask correctness issues.
The warning states "Runtime guards keep dynamic tails safe," but ensure that all call sites that obtain InverseWithLevel()
actually use the returned level to generate appropriate guards. If any caller ignores the level and assumes Bijective, dynamic shapes could produce incorrect results.
Run the following script to check which code paths use InverseWithLevel()
and whether they respect the returned level:
🏁 Script executed:
#!/bin/bash
# Description: Find all usages of InverseWithLevel and verify they handle the returned level
# Search for InverseWithLevel calls
echo "=== InverseWithLevel call sites ==="
rg -nP 'InverseWithLevel\s*\(' --type=cpp -A 5
# Search for uses of the level from the pair
echo -e "\n=== Checking if returned level is used ==="
rg -nP '(auto|std::pair).*InverseWithLevel' --type=cpp -A 10
Length of output: 7817
Need to dive deeper to understand the full picture. Let me search for the wrapper methods and their callers:Now let me search for more specific information about the context of this code:
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the broader context around the warning in layout.cc
echo "=== Context of warning in layout.cc (lines 245-260) ==="
sed -n '245,260p' src/layout/layout.cc
# Find the complete Inverse() method and check what it does
echo -e "\n=== Inverse() method around line 284 ==="
sed -n '280,290p' src/layout/layout.cc
# Find FragmentNode's wrapper Inverse() around line 396
echo -e "\n=== FragmentNode Inverse() around line 396 ==="
sed -n '390,400p' src/layout/layout.cc
# Check for any guards or assertions when Inverse() is called
echo -e "\n=== Search for usage patterns of Inverse() method ==="
rg -nP 'Inverse\(\)' --type=cpp -B 2 -A 3 -g '!test' -g '!*_test.cc'
Length of output: 5676
Let me check if the callers of Inverse()
have mechanisms to detect and handle symbolic shapes:
---
🏁 Script executed:
```shell
#!/bin/bash
# Check the context of Inverse() calls to see if they guard against symbolic shapes
echo "=== Checking if parallel.cc:65 handles symbolic shapes ==="
sed -n '60,75p' src/op/parallel.cc
echo -e "\n=== Checking if parallel.cc:540 handles symbolic shapes ==="
sed -n '535,550p' src/op/parallel.cc
echo -e "\n=== Checking if copy.cc:1025 handles symbolic shapes ==="
sed -n '1020,1035p' src/op/copy.cc
# Check function signatures to understand context
echo -e "\n=== Function containing parallel.cc:65 ==="
rg -nB 20 'large_frag->Inverse' src/op/parallel.cc | head -40
echo -e "\n=== Check if there's any documented assumption about Inverse() ==="
rg -nP 'Inverse' src/layout/layout.h -B 3 -A 1
Length of output: 3891
Verify the NoCheck fallback doesn't mask correctness issues — CONCERN CONFIRMED.
The review comment identifies a real design flaw. Investigation reveals:
-
InverseWithLevel() properly detects symbolic dimensions (lines 232-280) and returns
NoCheck
level when symbolic dims exist, with a warning logged. -
However, the public Inverse() wrapper method discards the returned level (line 283 and line 395), returning only the Layout object.
-
Multiple call sites use Inverse() without guard detection:
src/op/parallel.cc:65
(ProveFragmentContains)src/op/parallel.cc:540
andsrc/op/parallel.cc:614
src/op/copy.cc:1025
These callers cannot detect whether the inverse layout requires runtime guards.
-
Contrast: loop_partition.cc:81 properly uses InverseWithLevel() and checks the level before proceeding:
bool need_guard = inverse_info.second != arith::IterMapLevel::Bijective;
The issue: The warning states guards keep dynamic tails safe, but Inverse() callers have no mechanism to know when guards are needed. If any of these callers operates on symbolic layouts, they could silently produce incorrect results without guards.
The fix requires either: (a) documenting that Inverse() assumes static/bijective layouts only, or (b) modifying callers to either detect symbolic shapes independently or switch to InverseWithLevel().
🤖 Prompt for AI Agents
In src/layout/layout.cc around lines 251-257 (and note the Inverse wrapper at
~lines 283 and 395), the current Inverse() discards the IterMapLevel returned by
InverseWithLevel() causing callers to be unable to detect when the inverse is
only valid with runtime guards; update the implementation and call sites so
callers don't silently miss this: either (1) change/overload Inverse() to
propagate the level (e.g., return a pair or provide an InverseChecked API) so
callers can inspect the IterMapLevel, or (2) update each caller that currently
calls Inverse() (src/op/parallel.cc lines ~65, ~540, ~614; src/op/copy.cc line
~1025) to call InverseWithLevel() directly and handle non-bijective results by
inserting the appropriate runtime guard or failing with a clear error, and add a
DCHECK or documentation in Inverse() that it must only be used for
static/bijective layouts if you keep the simpler API.
This pull request introduces several improvements to the handling of dynamic and symbolic loop extents and parallelism in the layout and loop partitioning code, along with new tests for parallel execution with both static and dynamic extents. The main changes focus on supporting symbolic (non-constant) shapes, improving guard generation for dynamic loops, and enhancing test coverage for parallel computation.
Support for symbolic and dynamic loop extents:
LayoutNode::Inverse()
to correctly handle symbolic dimensions by detecting and warning about non-static shapes, and adjusting the bijectivity check accordingly.Guard generation and correctness:
IfThenElse
when needed.Pragma unrolling improvements:
Test coverage for parallelism:
test_tilelang_language_parallel.py
for parallel elementwise operations with both static and dynamic extents, including CUDA-based validation for correctness.Summary by CodeRabbit
New Features
Bug Fixes
Tests