Skip to content

Commit

Permalink
Iterator contexts prototype (chapel-lang#24488)
Browse files Browse the repository at this point in the history
This PR is the initial step towards implementing features
primarily designed and discussed in:

- chapel-lang#16405
- Cray/chapel-private#4349
- Cray/chapel-private#5216
- and the aspirations of chapel-lang#9529 and chapel-lang#6334

It enables nightly GPU testing on the newly-added `test/users/engin/context`.

### ADDED FEATURES -- these are not user-facing at all

There are two umbrella features introduced here:

- querying things like number of tasks and task IDs from inside
  the body of a `forall` loop

- hoisting variable declarations from the loop body into upper contexts
  that typically represent `coforall`s and `foreach`es in the iterator
  implementation. This allows, for example, locale-private variables
  to be declared inside the loop body.

These features are enabled when compiling with `--iterator-contexts`,
which is off by default. The flag `--report-context-adjustments` enables
debugging printouts for compiler developers.

First, the iterators to be used in parallel loops are augmented with "handles."
These are instances of the `Context` type defined in
    test/users/engin/context/ChapelContextSupport.chpl

For example:
    var onCtx = new Context(rank=rank, taskId=locDomIdx, numTasks=locDoms.domain.shape);

The `taskId` indicates the position of the current task within
the rank-dimensional space of the shape `numTasks`.

The innermost foreach loop serves as a handle automatically.

Second, to use the above in a forall loop, the loop body is augmented
with something like this:

    const context = new Context();
    const vectorContext    = __primitive("outer context", ctx1, context);
    const localTaskContext = __primitive("outer context", ctx1, vectorContext);
    const localeContext    = __primitive("outer context", ctx2, localTaskContext);
    // where
    type ctx1 = Context(1, int(64));  // 1-d space of tasks
    type ctx2 = Context(2, (int(64), int(64)));  // 2-d space of tasks

These variables will be mapped to the iterator(s)' handles, starting with
the (dynamically) innermost handle. This mapping enables querying
of the current task's position, e.g., `localeContext.taskId`.
Currently it is the user responsibility to match the type of the variable
and the type of the corresponding handle.

Finally, to hoist a variable in the loop body to the corresponding context,
declare it using split-init as follows:

    var localTile;
    {
      const ref locSubDom = Dom.localSubdomain();
      localTile = Input[{locSubDom.dim(1), locSubDom.dim(0)}];
    }
    __primitive("hoist to context", localeContext, localTile);

The contents of the block will be hoisted together with the variable
being declared, here `localTile`, to the context associated with the variable
in the primitive, here `localeContext`. Array, c_array, and barrier variables
are currently supported.

See the code in `test/users/engin/context` for examples.

### IMPLEMENTATION OVERVIEW

The core of the implementation is in a new file called `lowerLoopContexts` and
fires at the end of iterator lowering. New primitives PRIM_INNERMOST_CONTEXT,
PRIM_OUTER_CONTEXT, PRIM_HOIST_TO_CONTEXT, and a special `Context` type
is at the core of the implementation. User's can use `Context` variables
to "find" outer contexts, and in turn hoist variable declarations
in such contexts. Currently arrays, barriers and c_arrays can be moved around
in this manner.

The current design requires iterators to be mildly modified to make use
of these features. This PR uses its own iterators that are pretty much copied
directly from DR and Block distribution except for the few added lines
in support of this PR.

`test/users/engin/context` has those iterators, module support that will
eventually turn into an internal module and finally some tests
that demonstrate how these features can be used.

`.good` files in test/users/engin/context adjust to the tests
behaving differently for locale model = flat vs. gpu.
There, our start_test framework chooses:

* `testname.comm-none.lm-gpu.good` for comm==none and lm==gpu (obviously)
* `testname.comm-none.good` for comm==none and lm!=gpu
* `testname.good` for comm!=none, whether lm==gpu or not

The tests behave differently for comm==none and lm==flat because only in this
configuration the compiler removes on-statements early and the implementation
has not been adjusted to handle this properly. This is a todo item.

While there:
* Tidy up CHPL_NIGHTLY_TEST_DIRS in GPU-related scripts in util/cron.
* Remove the non-portable sed option `-i`
  from test/gpu/native/noGpu/basicMem.prediff.

### NEXT STEPS

We would like to implement the user-facing syntax proposed in
    Cray/chapel-private#5216
to facilitate writing more codes using these features
and help with the final user-facing design.

Implementation-wise, the immediate next steps are:
* Enable `test/users/engin/context/transpose.chpl` for GPUs.
* Implement detection of handles properly when comm==none and lm==flat, i.e.,
  when `on`-statements are removed from the AST early and so multiple handles
  can end up in a single block.
* Is `_ddata_allocate_noinit_gpu_shared()` newly-added to ChapelBase.chpl needed?
* Resolve the compiler crash with --verify and lm==gpu observable in
  `test/gpu/native/distArray/blockUseInFunction.chpl`, a few tests under
  release/examples, etc.
* Improve the prototype syntax; perhaps switch to block-based syntax.
* Revisit how barriers should be hoisted w.r.t. automatic adjustment
  of the number of tasks.

Some steps for productization:
* Implement hoisting as part of lowerForallStmtsInline().
* Add the creation of `Context` handles into our standard iterators,
  including DefaultRectangular, BlockDist, etc.
* ... and ensure they are removed when unused.

The branch has been developed by @e-kayrakli, @DanilaFe and @vasslitvinov.
Earlier dev history: 98f9c70..eccaf0d and b09bbfc..5250416.
Reviewed by: @e-kayrakli. Merged by: @vasslitvinov.
  • Loading branch information
vasslitvinov authored Feb 24, 2024
2 parents 2f4febd + 73171d1 commit 74482f5
Show file tree
Hide file tree
Showing 80 changed files with 2,055 additions and 19 deletions.
9 changes: 9 additions & 0 deletions compiler/AST/primitive.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,11 @@ returnInfoFirst(CallExpr* call) {
return call->get(1)->qualType();
}

static QualifiedType
returnInfoFirstAsValue(CallExpr* call) {
return QualifiedType(Qualifier::QUAL_CONST_VAL, call->get(1)->qualType().type());
}

static QualifiedType
returnInfoFirstDeref(CallExpr* call) {
QualifiedType tmp = call->get(1)->qualType();
Expand Down Expand Up @@ -711,6 +716,10 @@ initPrimitive() {
// use for any primitives not in this list
primitives[PRIM_UNKNOWN] = NULL;

prim_def(PRIM_INNERMOST_CONTEXT, "innermost context", returnInfoFirstAsValue);
prim_def(PRIM_OUTER_CONTEXT, "outer context", returnInfoFirst);
prim_def(PRIM_HOIST_TO_CONTEXT, "hoist to context", returnInfoVoid);

prim_def(PRIM_ACTUALS_LIST, "actuals list", returnInfoVoid);
prim_def(PRIM_NOOP, "noop", returnInfoVoid);
// dst, src. PRIM_MOVE can set a reference.
Expand Down
2 changes: 2 additions & 0 deletions compiler/include/driver.h
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,8 @@ extern bool fReportOptimizedOn;
extern bool fReportPromotion;
extern bool fReportScalarReplace;
extern bool fReportGpu;
extern bool fIteratorContexts;
extern bool fReportContextAdj;
extern bool fReportDeadBlocks;
extern bool fReportDeadModules;
extern bool fReportGpuTransformTime;
Expand Down
26 changes: 26 additions & 0 deletions compiler/include/lowerLoopContexts.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/*
* Copyright 2020-2024 Hewlett Packard Enterprise Development LP
* Copyright 2004-2019 Cray Inc.
* Other additional copyright holders may be indicated within.
*
* The entirety of this work is licensed under the Apache License,
* Version 2.0 (the "License"); you may not use this file except
* in compliance with the License.
*
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#ifndef _LOWER_LOOP_CONTEXTS_H_
#define _LOWER_LOOP_CONTEXTS_H_

void lowerContexts();

#endif
4 changes: 4 additions & 0 deletions compiler/main/driver.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,8 @@ bool fReportOptimizeForallUnordered = false;
bool fReportPromotion = false;
bool fReportScalarReplace = false;
bool fReportGpu = false;
bool fIteratorContexts = false;
bool fReportContextAdj = false;
bool fReportDeadBlocks = false;
bool fReportDeadModules = false;
bool fReportGpuTransformTime = false;
Expand Down Expand Up @@ -1470,6 +1472,8 @@ static ArgumentDescription arg_desc[] = {
{"report-promotion", ' ', NULL, "Print information about scalar promotion", "F", &fReportPromotion, NULL, NULL},
{"report-scalar-replace", ' ', NULL, "Print scalar replacement stats", "F", &fReportScalarReplace, NULL, NULL},
{"report-gpu", ' ', NULL, "Print information about what loops are and are not GPU eligible", "F", &fReportGpu, NULL, NULL},
{"iterator-contexts", ' ', NULL, "Handle iterator contexts", "F", &fIteratorContexts, NULL, NULL},
{"report-context-adjustments", ' ', NULL, "Print debugging information while handling iterator contexts", "F", &fReportContextAdj, NULL, NULL},

{"", ' ', NULL, "Developer Flags -- Miscellaneous", NULL, NULL, NULL, NULL},
{"allow-noinit-array-not-pod", ' ', NULL, "Allow noinit for arrays of records", "N", &fAllowNoinitArrayNotPod, "CHPL_BREAK_ON_CODEGEN", NULL},
Expand Down
13 changes: 9 additions & 4 deletions compiler/optimizations/gpuTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -616,6 +616,7 @@ GpuizableLoop::GpuizableLoop(BlockStmt *blk) {
INT_ASSERT(blk->getFunction());

this->loop_ = toCForLoop(blk);

this->parentFn_ = toFnSymbol(blk->getFunction());
this->assertionReporter_.noteGpuizableAssertion(findCompileTimeGpuAssertions());
this->isEligible_ = evaluateLoop();
Expand Down Expand Up @@ -1024,6 +1025,7 @@ class GpuKernel {
static bool isCallToPrimitiveWeShouldNotCopyIntoKernel(CallExpr *call);
void populateBody(FnSymbol *outlinedFunction);
void normalizeOutlinedFunction();
void setLateGpuizationFailure(bool flag);
void finalize();

void generateIndexComputation();
Expand Down Expand Up @@ -1306,7 +1308,7 @@ void GpuKernel::populateBody(FnSymbol *outlinedFunction) {
addKernelArgument(sym);
}
else {
INT_FATAL("Malformed PRIM_GET_MEMBER_*");
this->setLateGpuizationFailure(true);
}
}
else if (parent->isPrimitive()) {
Expand All @@ -1322,15 +1324,15 @@ void GpuKernel::populateBody(FnSymbol *outlinedFunction) {
}
}
else {
INT_FATAL("Unexpected call expression");
this->setLateGpuizationFailure(true);
}
} else if (CondStmt* cond = toCondStmt(symExpr->parentExpr)) {
// Parent is a conditional statement.
if (symExpr == cond->condExpr) {
addKernelArgument(sym);
}
} else {
INT_FATAL("Unexpected symbol expression");
this->setLateGpuizationFailure(true);
}
}
}
Expand All @@ -1344,6 +1346,9 @@ void GpuKernel::populateBody(FnSymbol *outlinedFunction) {
update_symbols(outlinedFunction->body, &copyMap_);
}

void GpuKernel::setLateGpuizationFailure(bool flag) {
this->lateGpuizationFailure_ = flag;
}

void GpuKernel::normalizeOutlinedFunction() {
normalize(fn_);
Expand All @@ -1355,7 +1360,7 @@ void GpuKernel::normalizeOutlinedFunction() {
collectDefExprs(fn_, defExprsInBody);
for_vector (DefExpr, def, defExprsInBody) {
if(def->sym->type == dtUnknown) {
this->lateGpuizationFailure_ = true;
this->setLateGpuizationFailure(true);
}
}

Expand Down
13 changes: 12 additions & 1 deletion compiler/passes/checkResolved.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -585,6 +585,14 @@ checkReturnPaths(FnSymbol* fn) {
}
}

static void checkIteratorContextPrimitives(CallExpr* call) {
if (call->isPrimitive(PRIM_INNERMOST_CONTEXT) ||
call->isPrimitive(PRIM_OUTER_CONTEXT) ||
call->isPrimitive(PRIM_HOIST_TO_CONTEXT) )
USR_FATAL_CONT(call,
"use of this feature requires compiling with --iterator-contexts");
}

static void
checkBadAddrOf(CallExpr* call)
{
Expand Down Expand Up @@ -633,8 +641,11 @@ checkBadAddrOf(CallExpr* call)
static void
checkCalls()
{
forv_Vec(CallExpr, call, gCallExprs)
forv_Vec(CallExpr, call, gCallExprs) {
checkBadAddrOf(call);
if (! fIteratorContexts)
checkIteratorContextPrimitives(call);
}
}

// This function checks that the passed type is an acceptable
Expand Down
37 changes: 37 additions & 0 deletions compiler/passes/normalize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,40 @@ static TypeSymbol* expandTypeAlias(SymExpr* se);
* *
************************************** | *************************************/

static void handleSharedCArrays() {
forv_expanding_Vec(CallExpr, call, gCallExprs)
if (call->isPrimitive(PRIM_HOIST_TO_CONTEXT))

// The particular definition we expect is a default-init c_array, which is:
//
// unknown myArray;
// unknown call_tmp;
// call_tmp = c_array(t, k);
// __primitive("default init var", myArray, call_tmp);

if (DefExpr* hoistDefExpr = toSymExpr(call->get(2))->symbol()->defPoint)
if (DefExpr* typeDefExpr = toDefExpr(hoistDefExpr->next))
if (CallExpr* typeAssign = toCallExpr(typeDefExpr->next))
if (typeAssign->isPrimitive(PRIM_MOVE))
if (CallExpr* typeCall = toCallExpr(typeAssign->get(2)))
if (CallExpr* initCall = toCallExpr(typeAssign->next))
if (initCall->isPrimitive(PRIM_DEFAULT_INIT_VAR))
if (SymExpr* typeConstructor = toSymExpr(typeCall->baseExpr))
if (typeConstructor->symbol()->hasFlag(FLAG_C_ARRAY))
// if all the above conditions succeeded, add a shared variant
{
SET_LINENO(hoistDefExpr);
auto newBlock = new BlockStmt();
auto newArr = new VarSymbol(astr("shared_", hoistDefExpr->sym->name));
newArr->qual = Qualifier::QUAL_REF;
newBlock->insertAtTail(new DefExpr(newArr));
newBlock->insertAtTail(new CallExpr(PRIM_MOVE, newArr,
new CallExpr("createSharedCArray", typeDefExpr->sym)));
initCall->insertAfter(newBlock);
}
}


void normalize() {

insertModuleInit();
Expand Down Expand Up @@ -264,6 +298,9 @@ void normalize() {
}
}

if (fIteratorContexts)
handleSharedCArrays();

find_printModuleInit_stuff();
}

Expand Down
1 change: 1 addition & 0 deletions compiler/resolution/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ set(SRCS
loopDetails.cpp
lowerForalls.cpp
lowerIterators.cpp
lowerLoopContexts.cpp
nilChecking.cpp
postFold.cpp
preFold.cpp
Expand Down
3 changes: 3 additions & 0 deletions compiler/resolution/lowerIterators.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
#include "ForallStmt.h"
#include "ForLoop.h"
#include "iterator.h"
#include "lowerLoopContexts.h"
#include "optimizations.h"
#include "passes.h"
#include "resolution.h"
Expand Down Expand Up @@ -3198,6 +3199,8 @@ void lowerIterators() {

reconstructIRautoCopyAutoDestroy();

lowerContexts();

cleanupTemporaryVectors();
cleanupIteratorBreakToken();
cleanupPrimIRFieldValByFormal();
Expand Down
Loading

0 comments on commit 74482f5

Please sign in to comment.