Iterator contexts prototype (chapel-lang#24488)

This PR is the initial step towards implementing features primarily designed and discussed in: - chapel-lang#16405 - Cray/chapel-private#4349 - Cray/chapel-private#5216 - and the aspirations of chapel-lang#9529 and chapel-lang#6334 It enables nightly GPU testing on the newly-added `test/users/engin/context`. ### ADDED FEATURES -- these are not user-facing at all There are two umbrella features introduced here: - querying things like number of tasks and task IDs from inside the body of a `forall` loop - hoisting variable declarations from the loop body into upper contexts that typically represent `coforall`s and `foreach`es in the iterator implementation. This allows, for example, locale-private variables to be declared inside the loop body. These features are enabled when compiling with `--iterator-contexts`, which is off by default. The flag `--report-context-adjustments` enables debugging printouts for compiler developers. First, the iterators to be used in parallel loops are augmented with "handles." These are instances of the `Context` type defined in test/users/engin/context/ChapelContextSupport.chpl For example: var onCtx = new Context(rank=rank, taskId=locDomIdx, numTasks=locDoms.domain.shape); The `taskId` indicates the position of the current task within the rank-dimensional space of the shape `numTasks`. The innermost foreach loop serves as a handle automatically. Second, to use the above in a forall loop, the loop body is augmented with something like this: const context = new Context(); const vectorContext = __primitive("outer context", ctx1, context); const localTaskContext = __primitive("outer context", ctx1, vectorContext); const localeContext = __primitive("outer context", ctx2, localTaskContext); // where type ctx1 = Context(1, int(64)); // 1-d space of tasks type ctx2 = Context(2, (int(64), int(64))); // 2-d space of tasks These variables will be mapped to the iterator(s)' handles, starting with the (dynamically) innermost handle. This mapping enables querying of the current task's position, e.g., `localeContext.taskId`. Currently it is the user responsibility to match the type of the variable and the type of the corresponding handle. Finally, to hoist a variable in the loop body to the corresponding context, declare it using split-init as follows: var localTile; { const ref locSubDom = Dom.localSubdomain(); localTile = Input[{locSubDom.dim(1), locSubDom.dim(0)}]; } __primitive("hoist to context", localeContext, localTile); The contents of the block will be hoisted together with the variable being declared, here `localTile`, to the context associated with the variable in the primitive, here `localeContext`. Array, c_array, and barrier variables are currently supported. See the code in `test/users/engin/context` for examples. ### IMPLEMENTATION OVERVIEW The core of the implementation is in a new file called `lowerLoopContexts` and fires at the end of iterator lowering. New primitives PRIM_INNERMOST_CONTEXT, PRIM_OUTER_CONTEXT, PRIM_HOIST_TO_CONTEXT, and a special `Context` type is at the core of the implementation. User's can use `Context` variables to "find" outer contexts, and in turn hoist variable declarations in such contexts. Currently arrays, barriers and c_arrays can be moved around in this manner. The current design requires iterators to be mildly modified to make use of these features. This PR uses its own iterators that are pretty much copied directly from DR and Block distribution except for the few added lines in support of this PR. `test/users/engin/context` has those iterators, module support that will eventually turn into an internal module and finally some tests that demonstrate how these features can be used. `.good` files in test/users/engin/context adjust to the tests behaving differently for locale model = flat vs. gpu. There, our start_test framework chooses: * `testname.comm-none.lm-gpu.good` for comm==none and lm==gpu (obviously) * `testname.comm-none.good` for comm==none and lm!=gpu * `testname.good` for comm!=none, whether lm==gpu or not The tests behave differently for comm==none and lm==flat because only in this configuration the compiler removes on-statements early and the implementation has not been adjusted to handle this properly. This is a todo item. While there: * Tidy up CHPL_NIGHTLY_TEST_DIRS in GPU-related scripts in util/cron. * Remove the non-portable sed option `-i` from test/gpu/native/noGpu/basicMem.prediff. ### NEXT STEPS We would like to implement the user-facing syntax proposed in Cray/chapel-private#5216 to facilitate writing more codes using these features and help with the final user-facing design. Implementation-wise, the immediate next steps are: * Enable `test/users/engin/context/transpose.chpl` for GPUs. * Implement detection of handles properly when comm==none and lm==flat, i.e., when `on`-statements are removed from the AST early and so multiple handles can end up in a single block. * Is `_ddata_allocate_noinit_gpu_shared()` newly-added to ChapelBase.chpl needed? * Resolve the compiler crash with --verify and lm==gpu observable in `test/gpu/native/distArray/blockUseInFunction.chpl`, a few tests under release/examples, etc. * Improve the prototype syntax; perhaps switch to block-based syntax. * Revisit how barriers should be hoisted w.r.t. automatic adjustment of the number of tasks. Some steps for productization: * Implement hoisting as part of lowerForallStmtsInline(). * Add the creation of `Context` handles into our standard iterators, including DefaultRectangular, BlockDist, etc. * ... and ensure they are removed when unused. The branch has been developed by @e-kayrakli, @DanilaFe and @vasslitvinov. Earlier dev history: 98f9c70..eccaf0d and b09bbfc..5250416. Reviewed by: @e-kayrakli. Merged by: @vasslitvinov.
arezaii · Feb 24, 2024 · 74482f5 · 74482f5
2 parents 2f4febd + 73171d1
commit 74482f5
Show file tree

Hide file tree

Showing 80 changed files with 2,055 additions and 19 deletions.
diff --git a/compiler/AST/primitive.cpp b/compiler/AST/primitive.cpp
@@ -175,6 +175,11 @@ returnInfoFirst(CallExpr* call) {
   return call->get(1)->qualType();
 }
 
+static QualifiedType
+returnInfoFirstAsValue(CallExpr* call) {
+  return QualifiedType(Qualifier::QUAL_CONST_VAL, call->get(1)->qualType().type());
+}
+
 static QualifiedType
 returnInfoFirstDeref(CallExpr* call) {
   QualifiedType tmp = call->get(1)->qualType();
@@ -711,6 +716,10 @@ initPrimitive() {
   // use for any primitives not in this list
   primitives[PRIM_UNKNOWN] = NULL;
 
+  prim_def(PRIM_INNERMOST_CONTEXT, "innermost context", returnInfoFirstAsValue);
+  prim_def(PRIM_OUTER_CONTEXT, "outer context", returnInfoFirst);
+  prim_def(PRIM_HOIST_TO_CONTEXT, "hoist to context", returnInfoVoid);
+
   prim_def(PRIM_ACTUALS_LIST, "actuals list", returnInfoVoid);
   prim_def(PRIM_NOOP, "noop", returnInfoVoid);
   // dst, src. PRIM_MOVE can set a reference.

diff --git a/compiler/include/driver.h b/compiler/include/driver.h
@@ -253,6 +253,8 @@ extern bool fReportOptimizedOn;
 extern bool fReportPromotion;
 extern bool fReportScalarReplace;
 extern bool fReportGpu;
+extern bool fIteratorContexts;
+extern bool fReportContextAdj;
 extern bool fReportDeadBlocks;
 extern bool fReportDeadModules;
 extern bool fReportGpuTransformTime;

diff --git a/compiler/include/lowerLoopContexts.h b/compiler/include/lowerLoopContexts.h
@@ -0,0 +1,26 @@
+/*
+ * Copyright 2020-2024 Hewlett Packard Enterprise Development LP
+ * Copyright 2004-2019 Cray Inc.
+ * Other additional copyright holders may be indicated within.
+ *
+ * The entirety of this work is licensed under the Apache License,
+ * Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef _LOWER_LOOP_CONTEXTS_H_
+#define _LOWER_LOOP_CONTEXTS_H_
+
+void lowerContexts();
+
+#endif
diff --git a/compiler/main/driver.cpp b/compiler/main/driver.cpp
@@ -295,6 +295,8 @@ bool fReportOptimizeForallUnordered = false;
 bool fReportPromotion = false;
 bool fReportScalarReplace = false;
 bool fReportGpu = false;
+bool fIteratorContexts = false;
+bool fReportContextAdj = false;
 bool fReportDeadBlocks = false;
 bool fReportDeadModules = false;
 bool fReportGpuTransformTime = false;
@@ -1470,6 +1472,8 @@ static ArgumentDescription arg_desc[] = {
  {"report-promotion", ' ', NULL, "Print information about scalar promotion", "F", &fReportPromotion, NULL, NULL},
  {"report-scalar-replace", ' ', NULL, "Print scalar replacement stats", "F", &fReportScalarReplace, NULL, NULL},
  {"report-gpu", ' ', NULL, "Print information about what loops are and are not GPU eligible", "F", &fReportGpu, NULL, NULL},
+ {"iterator-contexts", ' ', NULL, "Handle iterator contexts", "F", &fIteratorContexts, NULL, NULL},
+ {"report-context-adjustments", ' ', NULL, "Print debugging information while handling iterator contexts", "F", &fReportContextAdj, NULL, NULL},
 
  {"", ' ', NULL, "Developer Flags -- Miscellaneous", NULL, NULL, NULL, NULL},
  {"allow-noinit-array-not-pod", ' ', NULL, "Allow noinit for arrays of records", "N", &fAllowNoinitArrayNotPod, "CHPL_BREAK_ON_CODEGEN", NULL},

diff --git a/compiler/optimizations/gpuTransforms.cpp b/compiler/optimizations/gpuTransforms.cpp
@@ -616,6 +616,7 @@ GpuizableLoop::GpuizableLoop(BlockStmt *blk) {
   INT_ASSERT(blk->getFunction());
 
   this->loop_ = toCForLoop(blk);
+
   this->parentFn_ = toFnSymbol(blk->getFunction());
   this->assertionReporter_.noteGpuizableAssertion(findCompileTimeGpuAssertions());
   this->isEligible_ = evaluateLoop();
@@ -1024,6 +1025,7 @@ class GpuKernel {
   static bool isCallToPrimitiveWeShouldNotCopyIntoKernel(CallExpr *call);
   void populateBody(FnSymbol *outlinedFunction);
   void normalizeOutlinedFunction();
+  void setLateGpuizationFailure(bool flag);
   void finalize();
 
   void generateIndexComputation();
@@ -1306,7 +1308,7 @@ void GpuKernel::populateBody(FnSymbol *outlinedFunction) {
                 addKernelArgument(sym);
               }
               else {
-                INT_FATAL("Malformed PRIM_GET_MEMBER_*");
+                this->setLateGpuizationFailure(true);
               }
             }
             else if (parent->isPrimitive()) {
@@ -1322,15 +1324,15 @@ void GpuKernel::populateBody(FnSymbol *outlinedFunction) {
               }
             }
             else {
-              INT_FATAL("Unexpected call expression");
+              this->setLateGpuizationFailure(true);
             }
           } else if (CondStmt* cond = toCondStmt(symExpr->parentExpr)) {
             // Parent is a conditional statement.
             if (symExpr == cond->condExpr) {
               addKernelArgument(sym);
             }
           } else {
-            INT_FATAL("Unexpected symbol expression");
+            this->setLateGpuizationFailure(true);
           }
         }
       }
@@ -1344,6 +1346,9 @@ void GpuKernel::populateBody(FnSymbol *outlinedFunction) {
   update_symbols(outlinedFunction->body, &copyMap_);
 }
 
+void GpuKernel::setLateGpuizationFailure(bool flag) {
+  this->lateGpuizationFailure_ = flag;
+}
 
 void GpuKernel::normalizeOutlinedFunction() {
   normalize(fn_);
@@ -1355,7 +1360,7 @@ void GpuKernel::normalizeOutlinedFunction() {
   collectDefExprs(fn_, defExprsInBody);
   for_vector (DefExpr, def, defExprsInBody) {
     if(def->sym->type == dtUnknown) {
-      this->lateGpuizationFailure_ = true;
+      this->setLateGpuizationFailure(true);
     }
   }
 

diff --git a/compiler/passes/checkResolved.cpp b/compiler/passes/checkResolved.cpp
@@ -585,6 +585,14 @@ checkReturnPaths(FnSymbol* fn) {
   }
 }
 
+static void checkIteratorContextPrimitives(CallExpr* call) {
+  if (call->isPrimitive(PRIM_INNERMOST_CONTEXT) ||
+      call->isPrimitive(PRIM_OUTER_CONTEXT)     ||
+      call->isPrimitive(PRIM_HOIST_TO_CONTEXT)  )
+    USR_FATAL_CONT(call,
+      "use of this feature requires compiling with --iterator-contexts");
+}
+
 static void
 checkBadAddrOf(CallExpr* call)
 {
@@ -633,8 +641,11 @@ checkBadAddrOf(CallExpr* call)
 static void
 checkCalls()
 {
-  forv_Vec(CallExpr, call, gCallExprs)
+  forv_Vec(CallExpr, call, gCallExprs) {
     checkBadAddrOf(call);
+    if (! fIteratorContexts)
+      checkIteratorContextPrimitives(call);
+  }
 }
 
 // This function checks that the passed type is an acceptable

diff --git a/compiler/passes/normalize.cpp b/compiler/passes/normalize.cpp
@@ -130,6 +130,40 @@ static TypeSymbol* expandTypeAlias(SymExpr* se);
 *                                                                             *
 ************************************** | *************************************/
 
+static void handleSharedCArrays() {
+  forv_expanding_Vec(CallExpr, call, gCallExprs)
+   if (call->isPrimitive(PRIM_HOIST_TO_CONTEXT))
+
+    // The particular definition we expect is a default-init c_array, which is:
+    //
+    //    unknown myArray;
+    //    unknown call_tmp;
+    //    call_tmp = c_array(t, k);
+    //    __primitive("default init var", myArray, call_tmp);
+
+    if (DefExpr* hoistDefExpr = toSymExpr(call->get(2))->symbol()->defPoint)
+     if (DefExpr* typeDefExpr = toDefExpr(hoistDefExpr->next))
+      if (CallExpr* typeAssign = toCallExpr(typeDefExpr->next))
+       if (typeAssign->isPrimitive(PRIM_MOVE))
+        if (CallExpr* typeCall = toCallExpr(typeAssign->get(2)))
+         if (CallExpr* initCall = toCallExpr(typeAssign->next))
+          if (initCall->isPrimitive(PRIM_DEFAULT_INIT_VAR))
+           if (SymExpr* typeConstructor = toSymExpr(typeCall->baseExpr))
+            if (typeConstructor->symbol()->hasFlag(FLAG_C_ARRAY))
+   // if all the above conditions succeeded, add a shared variant
+   {
+    SET_LINENO(hoistDefExpr);
+    auto newBlock = new BlockStmt();
+    auto newArr = new VarSymbol(astr("shared_", hoistDefExpr->sym->name));
+    newArr->qual = Qualifier::QUAL_REF;
+    newBlock->insertAtTail(new DefExpr(newArr));
+    newBlock->insertAtTail(new CallExpr(PRIM_MOVE, newArr,
+                new CallExpr("createSharedCArray", typeDefExpr->sym)));
+    initCall->insertAfter(newBlock);
+   }
+}
+
+
 void normalize() {
 
   insertModuleInit();
@@ -264,6 +298,9 @@ void normalize() {
     }
   }
 
+  if (fIteratorContexts)
+    handleSharedCArrays();
+
   find_printModuleInit_stuff();
 }
 

diff --git a/compiler/resolution/CMakeLists.txt b/compiler/resolution/CMakeLists.txt
@@ -35,6 +35,7 @@ set(SRCS
     loopDetails.cpp
     lowerForalls.cpp
     lowerIterators.cpp
+    lowerLoopContexts.cpp
     nilChecking.cpp
     postFold.cpp
     preFold.cpp

diff --git a/compiler/resolution/lowerIterators.cpp b/compiler/resolution/lowerIterators.cpp
@@ -27,6 +27,7 @@
 #include "ForallStmt.h"
 #include "ForLoop.h"
 #include "iterator.h"
+#include "lowerLoopContexts.h"
 #include "optimizations.h"
 #include "passes.h"
 #include "resolution.h"
@@ -3198,6 +3199,8 @@ void lowerIterators() {
 
   reconstructIRautoCopyAutoDestroy();
 
+  lowerContexts();
+
   cleanupTemporaryVectors();
   cleanupIteratorBreakToken();
   cleanupPrimIRFieldValByFormal();