Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement array view elision (#24390)
The main optimization this PR introduces is to avoid creating ArrayViews in assignments like: ```chpl Arr1[...] = Arr2[...]; ``` See my #24390 (comment) and Jeremiah's #24390 (comment) for some quick performance results. Resolves #16133 (comment) --- There are also other smaller optimizations in this PR. See below for more details. ## 1. Array View Elision (copy/paste from a comment in compiler) Array View Elision (AVE) aims to optimize assignments involving array views. Currently, this is limited to: ``` slice = slice, and rank-change = rank-change ``` This is mostly out of abundance of caution and could be extended to assignments involving arrays, too The gist of the implementation is based on eliding array views from operations such as the ones above. Note that this implies that the following cannot be covered: ```chpl ref slice = A[1..5]; slice = A[6..10]; ``` As determining whether `slice` can be dropped is more complicated than I could bite at the moment. So, both sides of the assignments must be array-view generating expressions for this optimization to fire. There are two parts of this optimization: ### a. Pre-normalize (`ArrayViewElisionTransformer` is the type doing this) Given a statement like ```chpl A[x] = B[y]; ``` we generate ```chpl param array_view_elision: bool; // will be replaced during resolution if (array_view_elision) { var protoSlice1 = chpl__createProtoSlice(A, x); var protoSlice2 = chpl__createProtoSlice(B, y); __primitive(PRIM_PROTO_SLICE_ASSIGN, protoSlice1, protoSlice2); } else { A[x] = B[y]; } ``` Here the `protoSlice*` has type `chpl__protoSlice`. See `modules/internal/ChapelArrayViewElision.chpl` for the details of that type. The main purpose of that type is to represent the expression that would create an array view. But avoid doing that. ### b. During prefold (ArrayViewElisionPrefolder is the type doing this) Operation revolves around `PRIM_PROTO_SLICE_ASSIGN`. The ArrayViewElisionPrefolder is in charge of finding the other relevant AST (the `CondStmt`, the protoSlice temps etc) and transforming the conditional. Statically, `chpl__ave_exprCanBeProtoSlice` is called on both protoSlices to make sure that the module code is OK with creating protoSlices out of those expressions. We also check whether two protoSlices can be assigned to one another. This is done by `chpl__ave_protoSlicesSupportAssignment`. If `fLocal`, that's sufficient. Calls to that function are inserted, resolved, the result is collected, and finally the calls are removed. At that point, we drop the `array_view_elision` flag completely, and replace it with `true` or `false`, after which the conditional statement is constant-folded. If not `fLocal`, we also call `chpl_bothLocal` an replace the flag with the result of that. Note that this is a dynamic check, meaning that the conditional will not be removed. This optimization is on-by-default. It can be controlled with `--[no-]array-view-elision`. Additionally, there's also `--report-array-view-elision` flag to enable some output during compilation to help with understanding what's optimized and what's not. Took some more effort. So, full list of optimizations are: ## 2. Short array transfer optimization For array transfers below some size threshold, we don't want memcpy- or forall-based solutions. For such scenarios, this PR adds a code path to directly fall back to `for`-based solutions. Currently, this code path is exercised only for `chpl__protoSlice`s because of abundance of caution. ## 3. Avoid creating domains in slice/transfer code `chpl__protoSlice` stores a range or tuple of ranges. It is crucial to avoid creating domains for bulk transfer. So, the module code for array transfer has a lot of adjustments to work with ranges. In other words, we should avoid creating domains (as well as array views) for ```chpl Arr1[1..5, 6..10] = Arr2[1..5, 6..10]; ``` ## 4. Improve `c_ptr` creation performance `chpl__protoSlice` stores a `c_ptr` to the base array. So, we should be able to create C pointers fast. To that end, this PR: - moves `locale` checking in `boundsChecking`, and - makes type checking for DR arrays in `c_ptrTo` et al completely static ## TODO: - [x] add a flag to control the optimization - [x] get buy-in for `c_ptr` optimizations - [x] update/improve the PR message - [x] SAT shouldn't fire for GPU to GPU transfers, probably. Maybe disable it completely? ## Test status - [x] linux64 - [x] gasnet - [x] gpu (nvidia) ## Future work - Check out the status on #16133 (comment) - Also on: #24343 (comment) [Reviewed by @benharsh (lead) and @riftEmber (C pointer changes)]
- Loading branch information