Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MastForest merging #1534

Merged
merged 45 commits into from
Oct 28, 2024
Merged

Implement MastForest merging #1534

merged 45 commits into from
Oct 28, 2024

Conversation

PhilippGackstatter
Copy link

@PhilippGackstatter PhilippGackstatter commented Oct 15, 2024

Describe your changes

Closes #1529

Implements MastForest merging.

The implementation is explained in comments and it's hopefully sufficient to understand the PR, which is why I'm not adding the details here again. If any explanation is insufficient I'm happy to expand on it!

Couple of notes and ideas:

  • The MastForestDfsIter could be 1) made public and 2) could be made more general to allow for in-order, pre-order as well as post-order iteration of MastForests. Let me know if any of these are desirable. For now it seems best to keep them private.
  • The MastForestMerger is also kept private because the useful functionality can be exposed through the methods added on MastForest directly and there is no benefit in terms of functionality by making it public.
  • Moved EqHash into miden-core.
    • One annoyance is that this type needs to be public in miden-core so that it can be used in miden-assembly by MastForestBuilder where it originally lived. Moving MastForestBuilder into core doesn't seem fitting since it is used only for assembly and specially made for that. Overall it's not bad to have EqHash public in core.
    • Because that type needs to become public, I would propose to rename it to something a more easily understandable, like MastNodeFingerprint and perhaps introduce pub type DecoratorFingerprint = Blake3Digest<32> for decorators. I think the "fingerprint" nicely describes that this type can be used to compare mast nodes although I'm of course open to other suggestions.
    • If this type becomes public we should document its public functions (added TODO).
    • This topic could overall be handled in a quick follow-up PR if desired.
  • I would like to test the implementation with some more complex "real-world" input, but this is difficult in miden-core without an Assembler. I don't want to move those test into miden-assembly though as that separates the implementation and the tests crate-wise. Even with an assembler, it would still be somewhat challenging to come up with the expected merged forest to assert against without basically handwriting it. Although I suppose asserting a couple of properties might be sufficient instead of the entire forest, so it would still be an improvement. One possibility I see to add test in miden-core would be to generate snapshots in miden-assembly, for example, and add them as plaintext serialized MastForests to miden-core. Snapshot testing could be done with something like insta or without, I suppose. This would require adding #[cfg_attr(test, derive(Serialize, Deserialize))] to MastForest and all its descendants which is fine I guess. I'm happy to receive some general input on the testing story for this task though.

Checklist before requesting a review

  • Repo forked and branch created from next according to naming convention.
  • Commit messages and codestyle follow conventions.
  • Relevant issues are linked in the PR description.
  • Tests added for new functionality.
  • Documentation/comments updated according to changes.

@PhilippGackstatter PhilippGackstatter linked an issue Oct 16, 2024 that may be closed by this pull request
@PhilippGackstatter PhilippGackstatter marked this pull request as ready for review October 16, 2024 15:48
Copy link
Contributor

@plafer plafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - really love the docstrings!

Not a full review yet.

core/src/mast/merger.rs Outdated Show resolved Hide resolved
core/src/mast/dfs_iterator.rs Outdated Show resolved Hide resolved
core/src/mast/dfs_iterator.rs Outdated Show resolved Hide resolved
core/src/mast/merger.rs Outdated Show resolved Hide resolved
/// and all children have no decorator roots (meaning that there are no decorators in all the
/// descendants).
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct EqHash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you suggested, I think MastNodeFingerprint is a better name!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a follow-up PR once this PR is somewhat finalized 👍

core/src/mast/merger.rs Outdated Show resolved Hide resolved
core/src/mast/merger.rs Outdated Show resolved Hide resolved
core/src/mast/merger.rs Outdated Show resolved Hide resolved
@bobbinth
Copy link
Contributor

I would like to test the implementation with some more complex "real-world" input, but this is difficult in miden-core without an Assembler. I don't want to move those test into miden-assembly though as that separates the implementation and the tests crate-wise.

I actually think it may be fine to have some tests for merging forests in the miden-assembly crate. It would simplify things a bit and there is not too much downside.

@PhilippGackstatter
Copy link
Author

I updated the implementation to use the desired merging behavior for external nodes with decorators present. The refactor to support this turned out to be fairly complex but I think it's alright now. I still need to add documentation in many places, handle some errors and probably some other small things, but conceptually I think it is in a reviewable state now again.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good! I left some comments inline. The main one is related to handling of external nodes.

Additionally, I think we still need to return the maps with ID rempappints from the merge() method. Maybe the interface could look something like this:

pub fn merge(
    forests: impl IntoIterator<Item = &MastForest>
) -> Result<(MastForest, Vec<MastRootMap>), MastForestError> {
    ...
}

Where MastRootMap (other name suggestions welcome) would provide a way to map a root node ID from the input forests to a node ID in the output forest.

core/src/mast/mod.rs Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/dfs_iterator.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@plafer plafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more nits

core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
@PhilippGackstatter
Copy link
Author

PhilippGackstatter commented Oct 23, 2024

I implemented the refactoring now, but there are still some things left to clean up for sure. I added an assertion to most tests now that assert that the "child id < parent id" property holds, so that's nice 🎉.
If you have feedback on the approach now, I'd be happy to hear it and the other nits I will address tomorrow.

Also, sorry for the huge code changes, hopefully that was the last one.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! Not a deep review yet, but I did go though the high level logic and left a few minor comments.

Also, I think some comments from the previous review still need to be addressed (e.g., #1534 (comment)).

core/src/mast/multi_forest_node_iterator.rs Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/multi_forest_node_iterator.rs Outdated Show resolved Hide resolved
core/src/mast/multi_forest_node_iterator.rs Show resolved Hide resolved
core/src/mast/multi_forest_node_iterator.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Show resolved Hide resolved
@PhilippGackstatter
Copy link
Author

After adding a test that used the MastForest from the Miden StdLib I encountered a small bug in the iterator logic, but that is fixed now. The PR is now ready for a full review again.

///
/// Merging two forests means combining all their constituent parts, i.e. [`MastNode`]s,
/// [`Decorator`]s and roots. During this process, any duplicate or
/// unreachable nodes are removed. Additionally, [`MastNodeId`]s of nodes as well as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought this up before when @plafer looked into implementing a type of dead code elimination when building forests, and want to make sure we haven't unintentionally reintroduced the same issue:

How do you define whether a node is unreachable? With the presence of DYN, the only way to say for sure that a node is unreachable is one of two options:

  • The node is not a procedure root (regardless of visibility, i.e. a procedure does not have to be exported to be dynamically invoked), and it is not reachable by any other node. A very unlikely scenario, but nevertheless it is an edge case.
  • The node is a procedure root with non-public visibility (i.e. not exported), and it's address (hash) is never captured via procref anywhere in its translation unit.

Procedure roots with public visibility are of course part of the public API of a library, so cannot ever be considered unreachable.

So as long as we are only removing nodes classified as unreachable by the above rules, that should be fine. However, AFAIK we haven't yet implemented the capturing of procref uses, so that leaves a very narrow set of nodes we can safely prune.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up!

I think it's easier to describe what is visited than what is not, due to how it is implemented:
The iterator that implements the DFS traversal iterates through all MastForest::procedure_roots(), and from each of those discovers the tree of that node (this is the code). Hence all nodes reachable from procedure roots are visited. Other nodes are not visited. So since all nodes that are procedure roots and reachable from those nodes are visited and the others aren't I think your first point is covered, except maybe for the visibility part. One question I have is: Are the roots returned by MastForest::procedure_roots() only the public ones, or also the private ones? I'm guessing they're the public ones and there may be nodes in the forest that are procedures but aren't specified in MastForest::roots.

For the second point, what does procref translate to in the forest? An external node, but possibly one that is not in MastForest::roots? If so, then the node discovery would have to be changed.

To address everything at once, would the safest option perhaps be to iterate not over the MastForest::procedure_roots but over all nodes instead? We might copy dead nodes but for sure we wouldn't be removing nodes that are actually needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So since all nodes that are procedure roots and reachable from those nodes are visited and the others aren't I think your first point is covered

Yep! That sounds exactly right (of course, as I noted, it would be exceedingly rare for any nodes of a forest to be unreachable by this rule, since we construct the MastForest the same way, so such code would have been pruned when constructing the forest). That said, I think it is a good idea to mention this as you have in the docs.

One question I have is: Are the roots returned by MastForest::procedure_roots() only the public ones, or also the private ones? I'm guessing they're the public ones and there may be nodes in the forest that are procedures but aren't specified in MastForest::roots.

/cc @plafer

I believe that originally produced all of them, but I think that may have changed, and they are now only exported procedures, which is why I bring this up. Internally, we're supposed to be able to identify whether a given node is a non-exported procedure root though, so we should be able to recognize them when visiting the forest(s).

For the second point, what does procref translate to in the forest? An external node, but possibly one that is not in MastForest::roots? If so, then the node discovery would have to be changed.

It gets rewritten to push.HASH, where HASH is the procedure root of the procedure it references. Later, the procedure would be called via DYN, with the procedure root on the stack (though maybe this is changing to be a memory address, rather than the procedure hash, but what I've described is how it has historically worked).

To address everything at once, would the safest option perhaps be to iterate not over the MastForest::procedure_roots but over all nodes instead? We might copy dead nodes but for sure we wouldn't be removing nodes that are actually needed.

I think the key is to ensure that you visit all roots regardless of whether they are part of the set of exported roots. IIRC that information is inside the MastForest (it must be, in order to implement DYN). As an aside, we should definitely ensure that a MastForest can be queried via its public API for all procedure roots, not just exports, though we don't necessarily need to provide names for non-exported procedures like we do exports.

Dead code elimination will come later, when we introduce the analysis that we can use to implement that transformation, so I wouldn't worry about trying to do that here (it is better if we don't actually).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, we should definitely ensure that a MastForest can be queried via its public API for all procedure roots, not just exports, though we don't necessarily need to provide names for non-exported procedures like we do exports.

Thinking on this further, I take back my point about not providing names, I think we also want to preserve names for non-exported procedures as well, so that we can pretty print stack traces in a debugger. Without that information, at best we'll only have source locations (file/line), but no names in a stack trace (unless it can be recovered by other means, but that shouldn't be relied upon IMO).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the roots returned by MastForest::procedure_roots() only the public ones, or also the private ones?

The procedure_roots contains public and private procedures (see here) - we should definitely document this. So then the strategy is valid.

I brought this up before when @plafer looked into implementing a type of dead code elimination when building forests, and want to make sure we haven't unintentionally reintroduced the same issue:

You're right, and I keep forgetting about this issue, as a symptom that it's still not clear in my mind exactly what our strategy for DYN and dead code elimination is. For context, DYN is a bit tricky because technically we can call any value (even if not constructed with procref), and so if we're following the semantics to the letter, we should never be able to prune any node from a MastForest, because technically, it could be called with DYN (which the assembler has no way of knowing).

With the presence of DYN, the only way to say for sure that a node is unreachable is one of two options:

This is probably the most reasonable contract: if you never procref a procedure, we're allowed to remove it from the MastForest. Is this what we want to go for? If so, we should clearly document it - if not only for my future self :).

So to recap, the current merging code doesn't perform any "unwanted dead code elimination", since it visits all nodes that are reachable by at least one procedure (whether private or public), and therefore doesn't require any change related to this concern. A future dead code elimination pass (AFAIU) would look at all local procedures that were not procref'd and remove them (as per our not-yet-documented contract that non-procref'd local procedures are considered dead).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The procedure_roots contains public and private procedures (see here) - we should definitely document this. So then the strategy is valid.

Nice, so this PR is good on that front then, i.e. we aren't stripping things out we shouldn't be. Thanks for confirming that!

For context, DYN is a bit tricky because technically we can call any value (even if not constructed with procref), and so if we're following the semantics to the letter, we should never be able to prune any node from a MastForest, because technically, it could be called with DYN (which the assembler has no way of knowing).

Agreed, technically this is true, however in prior discussions with @bobbinth, he has mentioned that he would like to restrict DYN to procedure roots, even though we haven't implemented such a restriction anywhere yet (and maybe we don't need to, really, you just end up in undefined behavior territory I suppose).

This is probably the most reasonable contract: if you never procref a procedure, we're allowed to remove it from the MastForest. Is this what we want to go for? If so, we should clearly document it - if not only for my future self :).

Agreed, this is my operating assumption for whether or not we can consider a procedure dead. Personally, I don't think we should be doing dead code elimination of any kind at this level - leave it to the assembler or the compiler to do so, as they have more information anyway. That said, I don't have an objection to doing so, as long as it is done conservatively (i.e. we can prove that a node is dead).

A future dead code elimination pass (AFAIU) would look at all local procedures that were not procref'd and remove them

Essentially, yes. More formally: a procedure is proven dead if there are no static references to the procedure via some procedure invocation instruction (or procref/external), or such references only originate from procedures which are themselves marked dead.

Most likely, we'd want to use the call graph we construct in the assembler, ensuring it includes edges for procref, and then topologically ordering the nodes in that graph, and visiting from the leaves to the roots, removing dead leaves, and edges from dead non-leaf functions to their children (and revisiting the children when that happens to see if removing the edge made them dead as well). This could be done before constructing the MastForest, though we would need to incorporate information from linked libraries, as well as determining what exported functions actually need to be preserved (i.e. exported procedures which are not part of the public interface of a library could be DCE'd if they are unused within the library).

Doing that work in the assembler makes the most sense to me, since it knows the full set of objects it is working with, and has most of the necessary metadata (if not all that is needed for this).

Anyway, I'm digressing, this PR is fine in terms of its dead code elimination effects, that's all that matters.

core/src/mast/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left some comments inline - but they are mostly nits.

stdlib/tests/mast_forest_merge.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/merger/mod.rs Outdated Show resolved Hide resolved
core/src/mast/multi_forest_node_iterator.rs Outdated Show resolved Hide resolved
core/src/mast/multi_forest_node_iterator.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you!

@bobbinth bobbinth merged commit 9f9cc63 into next Oct 28, 2024
9 checks passed
@bobbinth bobbinth deleted the pgackst-mast-forest-merge branch October 28, 2024 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merging two MAST forests together
4 participants