Skip to content

Conversation

LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Oct 13, 2025

This pull request introduces several improvements to the handling of dynamic and symbolic loop extents and parallelism in the layout and loop partitioning code, along with new tests for parallel execution with both static and dynamic extents. The main changes focus on supporting symbolic (non-constant) shapes, improving guard generation for dynamic loops, and enhancing test coverage for parallel computation.

Support for symbolic and dynamic loop extents:

  • Enhanced LayoutNode::Inverse() to correctly handle symbolic dimensions by detecting and warning about non-static shapes, and adjusting the bijectivity check accordingly.
  • Updated loop partitioning logic to remove the requirement for constant loop sizes, allowing for symbolic extents and using runtime guards to ensure correct execution when bounds cannot be statically proven. [1] [2]

Guard generation and correctness:

  • Added runtime guard generation in loop partitioning to ensure that index bounds are respected when loop extents are dynamic or symbolic, wrapping the loop body in an IfThenElse when needed.

Pragma unrolling improvements:

  • Modified the loop unrolling logic to skip explicit unrolling pragmas for loops with non-constant extents, preventing incorrect optimization on dynamic loops.

Test coverage for parallelism:

  • Added new tests in test_tilelang_language_parallel.py for parallel elementwise operations with both static and dynamic extents, including CUDA-based validation for correctness.

Summary by CodeRabbit

  • New Features

    • Improved handling of symbolic and dynamic shapes with automatic selection of stricter or relaxed mapping checks.
    • Loop partitioning and vectorization now handle non-constant extents and apply thread-offset normalization when needed.
  • Bug Fixes

    • Added runtime guards to keep inverse-mapped indices within valid bounds; skip unsafe unrolling when extents are dynamic.
    • Emit warnings and fall back to relaxed checks for dynamic symbolic dimensions.
  • Tests

    • Added CUDA tests validating parallel kernels for static and dynamic extents with runtime clipping.

Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

Walkthrough

Adds symbolic-shape-aware detection in Layout::Inverse (choosing Bijective vs NoCheck and warning on fallback). Reworks loop partitioning to handle dynamic extents with per-iteration guards, thread-offset normalization, and guarded emits; avoids unrolling non-constant extents. Vectorize extent now simplified via arith::Analyzer. Adds CUDA TileLang parallel tests.

Changes

Cohort / File(s) Summary
Layout inverse: symbolic shape handling
src/layout/layout.cc, src/layout/layout.h
Introduces InverseWithLevel() returning (Layout, arith::IterMapLevel), adds collect_symbolic use to gather symbolic dims from input_size_, OutputShape(), and existing symbols, reclassifies static vs dynamic dims, selects DetectIterMap level (Bijective for static, NoCheck otherwise), and logs a warning on fallback.
Fragment node inverse update
src/layout/layout.cc (FragmentNode methods), src/layout/layout.h
Adds FragmentNode::InverseWithLevel() (final) and updates FragmentNode::Inverse() to wrap/return the first element from InverseWithLevel().
Loop partitioning & guarded unrolling
src/transform/loop_partition.cc
Replaces constant-size assumptions with dynamic extents: collects per-iteration loop mins/extents, computes flattened indices using dynamic vector/thread extents, builds guards for inverse-mapped indices, applies thread-offset normalization, uses arith::Analyzer to prove guards and wraps body in IfThenElse when needed, substitutes thread offsets before finalization, and skips unrolling when extents are not constant.
Vectorize planner: analyzer-driven extent simplification
src/transform/loop_vectorize.cc
Simplifies the extent expression with arith::Analyzer before converting to const int; behavior unchanged if extent remains non-constant.
TileLang CUDA parallel tests
testing/python/language/test_tilelang_language_parallel.py
New tests adding JIT TileLang CUDA kernels for static (+1.0) and dynamic (clipped −1.0) parallel elementwise operations; includes CUDA-availability guard and checks across varied valid lengths.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant L as Layout::Inverse
  participant S as SymbolicCollector
  participant D as DetectIterMap

  C->>L: Inverse(input_size_)
  L->>S: collect_symbolic(input_size_)
  L->>S: collect_symbolic(OutputShape())
  Note over L: Merge symbolic dims and re-collect to classify\nstatic vs dynamic shapes
  alt Static (no symbolic dims)
    L->>D: DetectIterMap(level=Bijective)
  else Dynamic (symbolic dims present)
    L->>L: Log warning: fallback to NoCheck
    L->>D: DetectIterMap(level=NoCheck)
  end
  D-->>L: mapping result + level
  L-->>C: Inverse mapping (and level via InverseWithLevel)
Loading
sequenceDiagram
  autonumber
  participant P as LoopPartitioner
  participant E as ExtentResolver
  participant A as arith::Analyzer
  participant B as TransformedBody
  participant O as OutputEmitter

  P->>E: Compute vector_extent & thread_extent (dynamic)
  P->>P: Build flattened index (using extents)
  loop For each inner fragment/iteration
    P->>A: Derive/collect per-iteration bounds checks (mins/extents)
    alt Bounds provable
      A-->>P: No guard needed
    else Unknown
      P->>P: Accumulate guard expressions (with optional thread-offset substitute)
    end
  end
  alt Guard not provable
    P->>B: Simplify indices (BufferIndiceSimplify)
    P->>O: Emit IfThenElse(guard, B, skip)
  else
    P->>O: Emit B directly
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I hop through loops with whiskered care,
Sniffing shapes that shift in air.
If dims are fixed I bind the key,
Else NoCheck lets bounds roam free.
CUDA hums — carrots counted neatly. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "[Parallel] Support T.Parallel with dynamic extents" directly aligns with the primary objective of the changeset. The raw summary and PR objectives confirm that the main changes enable parallel loop execution (T.Parallel) to work with symbolic and dynamic loop extents rather than requiring constant extents. The title is concise, specific, and clearly communicates the core enhancement. A developer scanning the commit history would immediately understand that this PR extends the T.Parallel primitive to support dynamic extent scenarios, which is precisely what the layout, loop partitioning, vectorization, and test changes collectively achieve.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a229dd and e540e54.

📒 Files selected for processing (3)
  • src/layout/layout.cc (1 hunks)
  • src/transform/loop_partition.cc (3 hunks)
  • testing/python/language/test_tilelang_language_parallel.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/transform/loop_partition.cc (2)
src/transform/lower_opaque_block.cc (4)
  • body (42-46)
  • body (42-42)
  • min (148-162)
  • min (148-149)
src/op/parallel.cc (4)
  • VisitStmt_ (130-146)
  • VisitStmt_ (130-130)
  • VisitStmt_ (148-160)
  • VisitStmt_ (148-148)
testing/python/language/test_tilelang_language_parallel.py (4)
tilelang/testing/__init__.py (1)
  • set_random_seed (30-35)
tilelang/jit/__init__.py (1)
  • jit (244-317)
tilelang/language/kernel.py (1)
  • threads (215-219)
tilelang/language/parallel.py (1)
  • Parallel (8-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-metal
  • GitHub Check: build-test-amd

Comment on lines 97 to 108
PrimExpr min = loop->min;
PrimExpr extent = loop->extent;
PrimExpr lower_cond = guard_analyzer.Simplify(GE(indices[i], min));
if (!guard_analyzer.CanProve(lower_cond)) {
guard = guard_analyzer.Simplify(logical_and(guard, lower_cond));
need_guard = true;
}
PrimExpr upper_cond = guard_analyzer.Simplify(LT(indices[i], min + extent));
if (!guard_analyzer.CanProve(upper_cond)) {
guard = guard_analyzer.Simplify(logical_and(guard, upper_cond));
need_guard = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Substitute loop bounds before building the guard
When a parallel nest has inner loops whose min/extent depend on outer iterators (e.g. triangular tiles), the expressions in Lines 97‑108 still contain the original loop vars because we never apply Substitute(…, vmap) before generating lower_cond/upper_cond. After we drop the old loops, those vars become free symbols, so the emitted IfThenElse guard is ill-formed and rejects valid iterations at runtime. Please substitute loop->min and loop->extent with the accumulated vmap prior to guard construction (and use the substituted versions in both the guard synthesis and the later body substitution) so the guard is expressed solely in terms of the new iterator variables.

🤖 Prompt for AI Agents
In src/transform/loop_partition.cc around lines 97 to 108, the code constructs
lower_cond and upper_cond directly from loop->min and loop->extent which can
still reference outer loop variables; substitute those with the accumulated vmap
first so the guard contains only new iterator vars. Fix by computing PrimExpr
sub_min = Substitute(loop->min, vmap) and PrimExpr sub_extent =
Substitute(loop->extent, vmap) before building lower_cond/upper_cond, use
sub_min/sub_extent when calling guard_analyzer.Simplify and when composing
logical_and into guard, and ensure the later body substitution also uses these
substituted values (replace uses of loop->min/extent in guard synthesis and
subsequent substitutions with the substituted versions).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/transform/loop_partition.cc (2)

100-112: Simplify the comment (verbose with duplicate example).

The comment spans 13 lines and repeats the same layout([i, j]) = floor((i * 16 + j) / 32) example twice (lines 105 and 108). Consider consolidating into a single, concise explanation.

Apply this diff to streamline the comment:

-  // Guard executes the recovered loop body only if each inverse-mapped iterator
-  // falls back into the original For ranges. We first check every axis from the
-  // old loop nest (old_loop_depth) and then the extra index produced by inverse
-  // layouts that carry a replicate/thread component (`inv_output_shape`). Both
-  // must stay within bounds to ensure correctness. Example: layout([i, j]) =
-  // floor((i * 16 + j) / 32) may generate extra points when the new loop
-  // enumerates 0..31; the guard drops iterations whose inverse-mapped (i, j)
-  // or replicate index fall outside their original extents.
-  // Example: layout([i, j]) = floor((i * 16 + j) / 32) may produce extra points
-  // when the new loop enumerates 0..31; this guard skips iterations where the
-  // inverse i, j land outside the original extents. This protects
-  // non-surjective loop_layout mappings that otherwise over-cover the parallel
-  // space.
+  // Guard the loop body so that inverse-mapped iterators stay within original
+  // For ranges (old_loop_depth axes) and any replicate/thread index from
+  // inv_output_shape remains in bounds. Non-surjective layouts (e.g.,
+  // layout([i,j]) = floor((i*16+j)/32)) may over-cover the parallel space,
+  // so the guard skips out-of-range iterations.

164-167: Consider reusing analyzer (minor inefficiency).

Creating a new arith::Analyzer at line 164 for each ForNode visit is slightly inefficient. While the logic is correct (skipping unroll for non-constant extents), consider making the analyzer a class member initialized once in the constructor to reduce allocations.

Apply this diff to reuse the analyzer:

 class LoopPramaUnroller : public StmtExprMutator {
 public:
-  LoopPramaUnroller() = default;
+  LoopPramaUnroller() : analyzer_(std::make_shared<arith::Analyzer>()) {}
 
 private:
   Stmt VisitStmt_(const ForNode *node) final {
     if (node->kind == ForKind::kSerial) {
-      auto analyzer = std::make_shared<arith::Analyzer>();
-      if (as_const_int(analyzer->Simplify(node->extent)) == nullptr) {
+      if (as_const_int(analyzer_->Simplify(node->extent)) == nullptr) {
         return StmtExprMutator::VisitStmt_(node);
       }
       For new_for = GetRef<For>(node);
       auto for_ptr = new_for.CopyOnWrite();
       for_ptr->annotations.Set(tir::attr::pragma_unroll_explicit, Bool(false));
       for_ptr->kind = ForKind::kUnrolled;
       return new_for;
     }
     return StmtExprMutator::VisitStmt_(node);
   }
+  std::shared_ptr<arith::Analyzer> analyzer_;
 };
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7b2a1d and f806710.

📒 Files selected for processing (1)
  • src/transform/loop_partition.cc (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/loop_partition.cc (1)
src/layout/layout.h (1)
  • Array (34-66)
🔇 Additional comments (5)
src/transform/loop_partition.cc (5)

71-71: LGTM: Dynamic extent binding.

Correctly binds the loop variable with a potentially symbolic extent from the layout's output shape.


78-79: LGTM: Tracking loop bounds for guards.

These arrays are correctly introduced to track dynamic loop bounds for later guard generation.


137-140: LGTM: Conditional guard emission.

Correctly emits an IfThenElse guard only when the analyzer cannot statically prove the guard, avoiding unnecessary runtime overhead for provably-safe cases.


149-153: LGTM: Thread offset substitution and unrolling.

Thread offset normalization is correctly applied to the final body before unrolling, ensuring thread indices are zero-based.


184-196: LGTM: Dynamic extent support in partitioner.

The flattening and index computation logic correctly handles dynamic extents by operating on PrimExpr values rather than assuming constant loop sizes.

Comment on lines +82 to 96
// Normalize thread var once so we can reuse the same substitution later.
Map<Var, PrimExpr> thread_offset_map;
bool has_thread_offset = false;
if (loop_layout->ThreadRange().defined()) {
auto range = loop_layout->ThreadRange();
thread_offset_map.Set(thread_var, thread_var - range->min);
has_thread_offset = true;
}
for (int i = 0; i < old_loop_depth; i++) {
const ForNode *loop = body.as<ForNode>();
ICHECK(loop != nullptr);
vmap.Set(loop->loop_var, indices[i]);
loop_mins.push_back(loop->min);
loop_extents.push_back(loop->extent);
body = loop->body;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Substitute loop bounds incrementally before collecting.

The past review comment remains unaddressed. When an inner loop's min or extent depends on an outer iterator (e.g., triangular loops for j in [i, N)), lines 94–95 capture those bounds without substituting the accumulated vmap. After the new loops are constructed, the old loop variables become free symbols, rendering the guard at lines 119–121 ill‑formed and incorrectly rejecting valid iterations at runtime.

At iteration i of the extraction loop (lines 90–96), vmap already holds substitutions for loops 0..i-1. Before appending to loop_mins and loop_extents, apply Substitute(loop->min, vmap) and Substitute(loop->extent, vmap) so that the collected bounds are expressed solely in terms of the new iterator variables.

Apply this diff to fix the substitution:

     vmap.Set(loop->loop_var, indices[i]);
-    loop_mins.push_back(loop->min);
-    loop_extents.push_back(loop->extent);
+    loop_mins.push_back(Substitute(loop->min, vmap));
+    loop_extents.push_back(Substitute(loop->extent, vmap));
     body = loop->body;
🤖 Prompt for AI Agents
In src/transform/loop_partition.cc around lines 82 to 96, the loop mins and
extents are being collected without applying the current vmap, which leaves
inner-loop bounds depending on outer-loop variables and later produces
ill-formed guards; fix this by substituting accumulated mappings into the bounds
before storing them: replace pushing loop->min and loop->extent with pushing
Substitute(loop->min, vmap) and Substitute(loop->extent, vmap) (also ensure any
thread_offset_map substitution is applied consistently if needed) so the
collected bounds are expressed only in terms of new iterator variables.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/transform/loop_partition.cc (1)

94-100: Critical: Substitute loop bounds incrementally before collecting them.

This issue was flagged in the past review and remains unaddressed. When an inner loop's min or extent depends on an outer iterator (e.g., triangular loops for j in [i, N)), lines 98-99 capture those bounds without applying the accumulated vmap. After the body substitution at line 103, the old loop variables become free symbols, rendering the guard at lines 125-127 ill-formed and incorrectly rejecting valid iterations at runtime.

At each iteration i of the extraction loop, vmap already holds substitutions for loops 0..i-1. Before appending to loop_mins and loop_extents, apply Substitute(loop->min, vmap) and Substitute(loop->extent, vmap) so that the collected bounds are expressed solely in terms of the new iterator variables.

Apply this diff to fix the substitution:

     vmap.Set(loop->loop_var, indices[i]);
-    loop_mins.push_back(loop->min);
-    loop_extents.push_back(loop->extent);
+    loop_mins.push_back(Substitute(loop->min, vmap));
+    loop_extents.push_back(Substitute(loop->extent, vmap));
     body = loop->body;
🧹 Nitpick comments (2)
src/layout/layout.cc (2)

258-259: Consider documenting the level parameter's impact.

The change from a fixed arith::IterMapLevel::Bijective to a computed level is critical for dynamic shape support. Consider adding a brief inline comment explaining that level varies based on whether the shape is static or symbolic.

Apply this diff:

+  // Use determined level: Bijective for static shapes, NoCheck for dynamic
   arith::IterMapResult res =
       arith::DetectIterMap(forward_index_, getVarMap(), 1, level, &analyzer);

283-286: Redundant std::move on temporary object.

The std::move on inverse_result.first is unnecessary here. The compiler will automatically apply RVO (Return Value Optimization) on the returned pair element. While harmless, removing it makes the intent clearer.

Apply this diff:

 Layout LayoutNode::Inverse() const {
   auto inverse_result = InverseWithLevel();
-  return std::move(inverse_result.first);
+  return inverse_result.first;
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f806710 and 071699c.

📒 Files selected for processing (3)
  • src/layout/layout.cc (3 hunks)
  • src/layout/layout.h (3 hunks)
  • src/transform/loop_partition.cc (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/layout/layout.h (1)
src/layout/layout.cc (6)
  • Layout (57-70)
  • Layout (72-75)
  • InverseWithLevel (232-281)
  • InverseWithLevel (232-232)
  • InverseWithLevel (400-409)
  • InverseWithLevel (400-400)
src/transform/loop_partition.cc (2)
src/layout/layout.h (1)
  • Array (36-69)
src/op/parallel.cc (4)
  • VisitStmt_ (131-147)
  • VisitStmt_ (131-131)
  • VisitStmt_ (149-161)
  • VisitStmt_ (149-149)
src/layout/layout.cc (2)
tilelang/layout/layout.py (1)
  • Layout (13-148)
src/layout/swizzle.cc (2)
  • Inverse (66-69)
  • Inverse (66-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
🔇 Additional comments (12)
src/layout/layout.h (3)

10-11: LGTM! Necessary includes for the new interface.

The includes for iter_affine_map.h and utility support the new InverseWithLevel() return type std::pair<Layout, arith::IterMapLevel>.


41-41: LGTM! New interface method properly declared.

The new InverseWithLevel() method cleanly extends the Layout API to expose both the inverse layout and its iterator map level, enabling callers to make dynamic vs. static shape decisions.


82-82: LGTM! FragmentNode override properly declared.

The final specifier is appropriate for FragmentNode's override of InverseWithLevel().

src/layout/layout.cc (2)

232-257: LGTM! Symbolic shape detection correctly implemented.

The collect_symbolic lambda filters non-constant dimensions, and the logic correctly combines symbolic dimensions from both input_size_ and OutputShape(). The IterMapLevel selection (Bijective for static, NoCheck for dynamic) aligns with the PR's goal of supporting dynamic extents with runtime guards.


395-409: LGTM! FragmentNode correctly delegates to base implementation.

The implementation properly augments the input with replicate extent and thread information before delegating to the base Layout::InverseWithLevel(). The updated Inverse() correctly wraps the new InverseWithLevel() method.

src/transform/loop_partition.cc (7)

71-73: LGTM! Dynamic extent binding correctly implemented.

The analyzer now binds the loop variable to a range with dynamic extent from loop_layout->OutputShape()[i], which properly supports symbolic shapes.


79-84: LGTM! Guard necessity determined from layout bijectivity.

The use of InverseWithLevel() to obtain both the inverse layout and the IterMapLevel enables the code to conditionally generate guards only when the layout cannot be proven bijective. This aligns with the PR's goal of supporting dynamic extents safely.


86-93: LGTM! Thread offset normalization properly isolated.

Creating thread_offset_map upfront and tracking has_thread_offset enables consistent substitution later at lines 122-124, 133-135, and 157-159.


104-147: LGTM! Comprehensive guard generation for dynamic extents.

The guard logic correctly:

  1. Checks each inverse-mapped index against the original loop bounds (lines 120-129)
  2. Handles the replicate dimension if present (lines 131-142)
  3. Applies thread offset normalization when needed (lines 122-124, 133-135)
  4. Only emits the IfThenElse wrapper if the guard cannot be statically proven (lines 144-146)

The detailed comments clearly explain the purpose and example scenarios. Once the loop bounds substitution issue (lines 98-99) is fixed, this guard generation will be correct.


157-162: LGTM! Thread offset applied consistently before unrolling.

The thread offset substitution at lines 157-159 ensures the body uses normalized thread indices before the loop is unrolled at line 161, maintaining consistency with the earlier guard generation.


172-175: LGTM! Unrolling correctly skipped for non-constant extents.

Creating a local analyzer and checking for constant extent before unrolling prevents incorrect optimization on dynamic loops. This is essential for the PR's dynamic extent support.


192-204: LGTM! Partition logic correctly handles dynamic extents.

The flattened index computation now uses loop_vars_[i]->dom->extent directly instead of requiring compile-time constants. The expressions for access_idx, thd, and idx correctly work with dynamic extents.

Comment on lines +251 to +257
if (!is_static_shape) {
// Runtime guards keep dynamic tails safe, so we allow NoCheck here and
// warn.
LOG(WARNING) << "Layout::Inverse on symbolic layout, falling back to "
"NoCheck; symbolic dims: "
<< symbolic_dims;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the NoCheck fallback doesn't mask correctness issues.

The warning states "Runtime guards keep dynamic tails safe," but ensure that all call sites that obtain InverseWithLevel() actually use the returned level to generate appropriate guards. If any caller ignores the level and assumes Bijective, dynamic shapes could produce incorrect results.

Run the following script to check which code paths use InverseWithLevel() and whether they respect the returned level:


🏁 Script executed:

#!/bin/bash
# Description: Find all usages of InverseWithLevel and verify they handle the returned level

# Search for InverseWithLevel calls
echo "=== InverseWithLevel call sites ==="
rg -nP 'InverseWithLevel\s*\(' --type=cpp -A 5

# Search for uses of the level from the pair
echo -e "\n=== Checking if returned level is used ==="
rg -nP '(auto|std::pair).*InverseWithLevel' --type=cpp -A 10

Length of output: 7817


Need to dive deeper to understand the full picture. Let me search for the wrapper methods and their callers:Now let me search for more specific information about the context of this code:


---

🏁 Script executed:

```shell
#!/bin/bash
# Get the broader context around the warning in layout.cc
echo "=== Context of warning in layout.cc (lines 245-260) ==="
sed -n '245,260p' src/layout/layout.cc

# Find the complete Inverse() method and check what it does
echo -e "\n=== Inverse() method around line 284 ==="
sed -n '280,290p' src/layout/layout.cc

# Find FragmentNode's wrapper Inverse() around line 396
echo -e "\n=== FragmentNode Inverse() around line 396 ==="
sed -n '390,400p' src/layout/layout.cc

# Check for any guards or assertions when Inverse() is called
echo -e "\n=== Search for usage patterns of Inverse() method ==="
rg -nP 'Inverse\(\)' --type=cpp -B 2 -A 3 -g '!test' -g '!*_test.cc'

Length of output: 5676


Let me check if the callers of Inverse() have mechanisms to detect and handle symbolic shapes:


---

🏁 Script executed:

```shell
#!/bin/bash
# Check the context of Inverse() calls to see if they guard against symbolic shapes
echo "=== Checking if parallel.cc:65 handles symbolic shapes ==="
sed -n '60,75p' src/op/parallel.cc

echo -e "\n=== Checking if parallel.cc:540 handles symbolic shapes ==="
sed -n '535,550p' src/op/parallel.cc

echo -e "\n=== Checking if copy.cc:1025 handles symbolic shapes ==="
sed -n '1020,1035p' src/op/copy.cc

# Check function signatures to understand context
echo -e "\n=== Function containing parallel.cc:65 ==="
rg -nB 20 'large_frag->Inverse' src/op/parallel.cc | head -40

echo -e "\n=== Check if there's any documented assumption about Inverse() ==="
rg -nP 'Inverse' src/layout/layout.h -B 3 -A 1

Length of output: 3891


Verify the NoCheck fallback doesn't mask correctness issues — CONCERN CONFIRMED.

The review comment identifies a real design flaw. Investigation reveals:

  1. InverseWithLevel() properly detects symbolic dimensions (lines 232-280) and returns NoCheck level when symbolic dims exist, with a warning logged.

  2. However, the public Inverse() wrapper method discards the returned level (line 283 and line 395), returning only the Layout object.

  3. Multiple call sites use Inverse() without guard detection:

    • src/op/parallel.cc:65 (ProveFragmentContains)
    • src/op/parallel.cc:540 and src/op/parallel.cc:614
    • src/op/copy.cc:1025

    These callers cannot detect whether the inverse layout requires runtime guards.

  4. Contrast: loop_partition.cc:81 properly uses InverseWithLevel() and checks the level before proceeding: bool need_guard = inverse_info.second != arith::IterMapLevel::Bijective;

The issue: The warning states guards keep dynamic tails safe, but Inverse() callers have no mechanism to know when guards are needed. If any of these callers operates on symbolic layouts, they could silently produce incorrect results without guards.

The fix requires either: (a) documenting that Inverse() assumes static/bijective layouts only, or (b) modifying callers to either detect symbolic shapes independently or switch to InverseWithLevel().

🤖 Prompt for AI Agents
In src/layout/layout.cc around lines 251-257 (and note the Inverse wrapper at
~lines 283 and 395), the current Inverse() discards the IterMapLevel returned by
InverseWithLevel() causing callers to be unable to detect when the inverse is
only valid with runtime guards; update the implementation and call sites so
callers don't silently miss this: either (1) change/overload Inverse() to
propagate the level (e.g., return a pair or provide an InverseChecked API) so
callers can inspect the IterMapLevel, or (2) update each caller that currently
calls Inverse() (src/op/parallel.cc lines ~65, ~540, ~614; src/op/copy.cc line
~1025) to call InverseWithLevel() directly and handle non-bijective results by
inserting the appropriate runtime guard or failing with a clear error, and add a
DCHECK or documentation in Inverse() that it must only be used for
static/bijective layouts if you keep the simpler API.

@LeiWang1999 LeiWang1999 merged commit 27701c3 into tile-ai:main Oct 20, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant