Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New and Improved MapFusion #1629

Open
wants to merge 120 commits into
base: main
Choose a base branch
from

Conversation

philip-paul-mueller
Copy link
Collaborator

@philip-paul-mueller philip-paul-mueller commented Aug 22, 2024

A new and improved version of the map fusion transformation.
The transformation is implemented in a class named MapFusionSerial, furthermore the MapFusionParallel transformation is added, that allows to fuse parallel maps together.
The new transformation analyses the graph more carefully when it checks if and how it should perform the fusing.
Special consideration was given about the correction of memlets.
However, there is still some aspects that should be improved and allowed to handle.

The new transformation produces graphs that are slightly different from before, and certain (other) transformations can not handle the resulting SDFG. For that reason a compatibility flag strict_dataflow was introduced. However, by default this flag is disabled. The only place where it is activated is inside the auto optimization function.

Furthermore, the SDFGState._read_and_write_sets() function has been rewritten to handle the new SDFGs, because of some bugs. However, one bug has been kept because of other transformations that would fail otherwise.
But it is a bug, tests were written to demonstrate this.

Collection of known issues in other transformation:

@philip-paul-mueller philip-paul-mueller changed the title Started with a first version of the map fusion stuff. New and Improved MapFusion Aug 22, 2024
@philip-paul-mueller philip-paul-mueller marked this pull request as draft August 22, 2024 13:55
Now using the 3.9 type hints.
When the function was fixing the innteriour of the second map, it did not remove the readiong.
It almost passes all fuction.
However, the one that needs renaming are not yet done.
…t in the input and output set.

However, it is very simple.
Before it was going to look for the memlet of the consumer or producer.
However, one should actually only look at the memlets that are adjacent to the scope node.
At least this is how the original worked.

I noticed this because of the `buffer_tiling_test.py::test_basic()` test.
I was not yet focused on maps that were nested and not multidimensional.
It seems that the transformation has some problems there.
Whet it now cheks for covering (i.e. if the information to exchange is enough) it will now no longer decend into the maps, but only inspect the first outgoing/incomming edges of the map entrie and exit.
I noticed that the other way was to restrictive, especially for map tiling.
Otherwise we can end up in recursion.
Before it was replacing the elimated variables by zero.
Which actually worked pretty good, but I have now changed that such that `offset()` is used.
I am not sure why I used `replace` in the first place, but I think that there was an issue.
However, I am not sure.
It essentially check if the intermdeiate, that should be turned into a shared intermediate, i.e. a sink node to the map exit, is used in the data flow downstream.
This is needed because some DaCe transformations do not correctkly check for the existence, it even seems that it is killed.
This test failed due to numerical instabilities.
It passed once I changed the arguments, which to me does not make sense, as I think if `a` is close to `b` then `b` should also be close to `a`.
So I changed the test to an absoluet check.
The reason is that everything goes through the intermediates should not be a problem.
philip-paul-mueller added a commit to GridTools/gt4py that referenced this pull request Dec 4, 2024
The [initial version](#1594) of
the optimization pipeline only contained a rough draft.
Currently this PR contains a copy of the map fusion transformations from
DaCe that are currently under
[review](spcl/dace#1629). As soon as that PR is
merged and DaCe was updated in GT4Py these files will be deleted.

This PR collects some general improvements:
- [x] More liberal `LoopBlocking` transformation (with tests).
- [x] Incorporate `MapFusionParallel` 
- [x] Using of `can_be_applied_to()` as soon as DaCe is updated
(`TrivialGPUMapElimination`, `SerialMapPromoter`).
- [x] Looking at strides that the Lowering generates. (Partly done)

However, it still uses MapFusion implementation that ships with GT4Py
and not the one in DaCe.


Note:
Because of commit 60e4226 this PR must be merged after
[PR1768](#1768).
Copy link
Collaborator

@phschaad phschaad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work on this transformation. MapFusion is in need of some improvements. There are a few general concerns I need to have addressed or at least discussed first in addition to the specific review comments though.

Complete Change

You advertise this in the name and description of the PR (the later of which seems to be outdated, if I am not mistaken?), but this is essentially a completely re-written MapFusion. I have nothing against that, but this naturally worries me for a transformation this central, often used, and consequently battle tested.

There are some things, such as checking for no WCR outputs of the first map that serve as inputs to the second map, that I can no longer detect in the new can_be_applied method. If these are all correctly handled by the new transformation now, that is great. But I believe this drastic change at least needs a very concise and clear discussion in the PR to explain conceptually exactly what changed that now allows for different situations for the transformation to apply. Currently this is missing to me, and I was unable to piece this together from the doc comments alone.

Difference to OTF Map Fusion and SubgraphFusion

This new version of the transformation seems to me like it contains components of both OTF map fusion and SubgraphFusion (or CompositeFusion). I am not categorically against bringing transformations together to reduce the size of the transformation repository if there is one transformation that can take the job of 2. However, currently I don't believe this transformation is meant to replace either of the two. Can you elaborate not just on how this new MapFusion is different from the old one (see previous point), but also on how this is different from the other existing transformations in this category (OTFMapFusion and SubgraphFusion)?

Complexity

By seemingly being a much more capable fusion transformation, this can_be_applied method appears to perform a lot more checks in general, which is great and improves robustness if all previously identified issues are still covered. However, can_be_applied is also run a lot, particularly on large graphs. Can you elaborate on the performance difference between this and the old MapFusion transformation, or is there no measurable difference?

def annotates_memlets():
return False
# Pattern Nodes
map_exit_1 = transformation.transformation.PatternNode(nodes.MapExit)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that there is no need for changing anything about the pattern nodes, except explicitly matching the pattern to MapExit as opposed to ExitNode. What's the reason for this change? In my opinion the first_map_exit / second_map_entry naming for instance seemed clearer, and this would completely remove the need for changing anything about how one manually sets up matches with this transformation - which I believe is done quite frequently in user code (i.e., code not inside the DaCe repo).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the naming is better.
I also agree that a transformation as central as this one should not change its public interface unless it is really needed.

I used search+replace, I think I got all, but if you spot a stray, please indicate it.

dace/transformation/dataflow/map_fusion.py Show resolved Hide resolved
tests/transformations/mapfusion_data_races_test.py Outdated Show resolved Hide resolved
tests/transformations/apply_to_test.py Outdated Show resolved Hide resolved
tests/transformations/mapfusion_test.py Outdated Show resolved Hide resolved
dace/transformation/dataflow/map_fusion.py Outdated Show resolved Hide resolved
dace/transformation/dataflow/map_fusion.py Outdated Show resolved Hide resolved
if scope[map_entry_1] is not None:
return False

# We will now check if there exists a remapping that of the map parameter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incomplete, or am I missing something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand your question.
Do you refer to the checks that ensures that the maps are in the same scope or that a particular kind of check is missing at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the comment seems like something is missing or wrong, or maybe I'm reading it wrong :-)

# are not resolved yet.
access_sets: List[Dict[str, nodes.AccessNode]] = []
for scope_node in [map_entry_1, map_exit_1, map_entry_2, map_exit_2]:
access_set: Set[nodes.AccessNode] = self.get_access_set(scope_node, state)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only seems to collect access nodes which are connected as outputs to the map exits and inputs to the map entries, right? If so, are map-internal writes / reads taken care of?

Copy link
Collaborator Author

@philip-paul-mueller philip-paul-mueller Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I had trouble understanding your question, because I thought you refers to the data that is used to exchange data between the maps, i.e. what self.array matches (which is not examined in this function, but in the partition function).
However, then I realized that you mean writes that are fully withing the map, i.e. the dataflow does not pass through the map entry/exit.
Then you are right, this is not checked, because until now I have not realized that this might be an issue, since I assumed that every data has to pass through the maps.
However, now I realize that this is wrong (for example I know that CompositeFusion may put intermediates fully inside a map.

To handle this case I have made the following additions:

  • Fusion is refused if both map bodies consist of an access node that refers to the same data.
  • Fusion is refused if a map body consists of an access node that refers to global data (this restriction could be lifted).
  • As an implementation detail, Views are ignored, because they either refer to data from the outside (has_read_write_dendency() takes care), or they refer to data that is inside the map (in which case it is handle).

As a side note, as far as I can tell the original implementation did not handle this case.

dace/transformation/dataflow/map_fusion.py Outdated Show resolved Hide resolved
…ured.

Before the output edges were before set to dynamic.
However, this was not true as it was always set, thus the new map fusion did not fuse them.
My first attempt was to just disable the `dynamic` property, but now the SDFG is generated manually.
It is almost the same, but uses lesss symbol, as it was simpler to implement it this way, and we are now using float.
For such edges we are sure that the data exists, so it is just a conditional read, which is fine.
Using `nodes()` on an SDFG will only give us the control flow regions, but using `state` will give us also the nested states.
I looked through my code and this seems to be the only places where they appear.
This fixes the correlaton test, but the heat test still fails.
The issue was similar as before.
When I computed the name of the intermediate transient then I used `sdfg.node_id(state)` to get the state ID.
However, now if the state is part of these recursive control flow regions then this may not work, because the state is not a direct node of the SDFG.
However, if I use `self.state_id` then it works, this is what the old MapFusion was doing.
This tests dynamic Memlets inside producers; the original transformation fails on it.
@philip-paul-mueller
Copy link
Collaborator Author

Thanks for reviewing and the wall of text.

To give you some context.
Initially I started with JaCe (JAX frontend for DaCe), I applied it to stencils from ICON4Py.
However, for some of them DaCe's auto_optimizer, was unable to handle them, either the fusion was not performed, the resulting SDFG was invalid or the computation was wrong.
I traced them down to MapFusion, at first I was trying to fix the original implementation, but I had trouble understanding the code at all, so I started to rewrite the transformation.

The main issues I found (not limited to ICON4Py) were:

  • The subsets (not the .subset member of the Memlet; I mean the concept of where we write to it and from where we read) of the new intermediate array were not computed correctly.
  • The transformation did not make a difference between .subset and .other_subset of a Memlet and in most cases just accessed .subset which might be wrong. In fact this is a general impression I had that a lot of code simply accesses .subset (which happens to be the right choice in most but not all cases) and does not care about the intrinsic direction of the Memlet.
  • The check if an intermediate can be removed or must be recreated afterwards was wrong. For this the whole SDFG has to be scanned, there is no way around it, but it was not done.
  • The .dynamic property of the Memelts where fully ignored.
  • As a side note, the check for WCR is on line 427
  • The code that propagates the change (removed intermediate) into the scope was wrong; again .subset was not handled correctly. (Although, I have to say that the current code should also be improved, but just a little.)

I want to point out that this PR adds a lot of tests for MapFusion (approximately 40% of the edits) and the previous version is not able to pass them; roughly 1/3 of them fails.

Regarding the description, I agree the doc string of the class is not that good, however, the code itself is in my view better documented than before, but I have updated the description of the transformation to give a better high level overview, which points to the functions that performs the tests.

I do not know OTFMapFusion and SubgraphFusion very well, however, I have seen that SubgraphFusion is much more general, for example, instead of reducing the intermediate it will move the intermediate data access inside the Map.
The only capability I know SubgraphFusion has is that it is able to handle Maps that are parallel.
This is a capability that my MapFusion currently lacks (it was originally included in the PR, but removed afterwards).

I think the best way to see my MapFusion is not as something new but just as a new iteration of what was already there, it just performs more analysis to handle more cases than before. This allows it to handle more cases. However, there are still some todo's that are open.

I have to admit that I have not performed any testing of the runtime, but I do not have the impression that it takes much more time than before. The reason is that MapFusion is, beside two exceptions, a very local operation.
The first exception is, the check if an intermediate can be removed or not. However, this information is more or less static, so the transformation computes this set at the beginning and then caches it. The downside is that it is hard to tell if the cache should be renewed. However, the cache remains valid as long as no AccessNodes are added. I checked that the use in auto_optimizer is fine. Further, to avoid this I added the assume_always_shared flag. This tells the transformation that every intermediate is shared. Thus no scan is ever needed, however, it will lead to dead dataflow.
The second exception is where we have to ensure that no cycles are created, however, this will only explore the dataflow graph locally (everything downstream).
Furthermore, when I wrote the thing I tried to order the checks in such a way that the ones that are either cheap or very likely to fail come first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants