-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement MastForest
merging
#1534
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - really love the docstrings!
Not a full review yet.
/// and all children have no decorator roots (meaning that there are no decorators in all the | ||
/// descendants). | ||
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] | ||
pub struct EqHash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you suggested, I think MastNodeFingerprint
is a better name!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a follow-up PR once this PR is somewhat finalized 👍
I actually think it may be fine to have some tests for merging forests in the |
e96b6ed
to
728107f
Compare
I updated the implementation to use the desired merging behavior for external nodes with decorators present. The refactor to support this turned out to be fairly complex but I think it's alright now. I still need to add documentation in many places, handle some errors and probably some other small things, but conceptually I think it is in a reviewable state now again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Looks good! I left some comments inline. The main one is related to handling of external nodes.
Additionally, I think we still need to return the maps with ID rempappints from the merge()
method. Maybe the interface could look something like this:
pub fn merge(
forests: impl IntoIterator<Item = &MastForest>
) -> Result<(MastForest, Vec<MastRootMap>), MastForestError> {
...
}
Where MastRootMap
(other name suggestions welcome) would provide a way to map a root node ID from the input forests to a node ID in the output forest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more nits
4525cd6
to
9394a3a
Compare
I implemented the refactoring now, but there are still some things left to clean up for sure. I added an assertion to most tests now that assert that the "child id < parent id" property holds, so that's nice 🎉. Also, sorry for the huge code changes, hopefully that was the last one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you! Not a deep review yet, but I did go though the high level logic and left a few minor comments.
Also, I think some comments from the previous review still need to be addressed (e.g., #1534 (comment)).
After adding a test that used the |
/// | ||
/// Merging two forests means combining all their constituent parts, i.e. [`MastNode`]s, | ||
/// [`Decorator`]s and roots. During this process, any duplicate or | ||
/// unreachable nodes are removed. Additionally, [`MastNodeId`]s of nodes as well as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I brought this up before when @plafer looked into implementing a type of dead code elimination when building forests, and want to make sure we haven't unintentionally reintroduced the same issue:
How do you define whether a node is unreachable? With the presence of DYN
, the only way to say for sure that a node is unreachable is one of two options:
- The node is not a procedure root (regardless of visibility, i.e. a procedure does not have to be exported to be dynamically invoked), and it is not reachable by any other node. A very unlikely scenario, but nevertheless it is an edge case.
- The node is a procedure root with non-public visibility (i.e. not exported), and it's address (hash) is never captured via
procref
anywhere in its translation unit.
Procedure roots with public visibility are of course part of the public API of a library, so cannot ever be considered unreachable.
So as long as we are only removing nodes classified as unreachable by the above rules, that should be fine. However, AFAIK we haven't yet implemented the capturing of procref
uses, so that leaves a very narrow set of nodes we can safely prune.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up!
I think it's easier to describe what is visited than what is not, due to how it is implemented:
The iterator that implements the DFS traversal iterates through all MastForest::procedure_roots()
, and from each of those discovers the tree of that node (this is the code). Hence all nodes reachable from procedure roots are visited. Other nodes are not visited. So since all nodes that are procedure roots and reachable from those nodes are visited and the others aren't I think your first point is covered, except maybe for the visibility part. One question I have is: Are the roots returned by MastForest::procedure_roots()
only the public ones, or also the private ones? I'm guessing they're the public ones and there may be nodes in the forest that are procedures but aren't specified in MastForest::roots
.
For the second point, what does procref
translate to in the forest? An external node, but possibly one that is not in MastForest::roots
? If so, then the node discovery would have to be changed.
To address everything at once, would the safest option perhaps be to iterate not over the MastForest::procedure_roots
but over all nodes instead? We might copy dead nodes but for sure we wouldn't be removing nodes that are actually needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So since all nodes that are procedure roots and reachable from those nodes are visited and the others aren't I think your first point is covered
Yep! That sounds exactly right (of course, as I noted, it would be exceedingly rare for any nodes of a forest to be unreachable by this rule, since we construct the MastForest the same way, so such code would have been pruned when constructing the forest). That said, I think it is a good idea to mention this as you have in the docs.
One question I have is: Are the roots returned by MastForest::procedure_roots() only the public ones, or also the private ones? I'm guessing they're the public ones and there may be nodes in the forest that are procedures but aren't specified in MastForest::roots.
/cc @plafer
I believe that originally produced all of them, but I think that may have changed, and they are now only exported procedures, which is why I bring this up. Internally, we're supposed to be able to identify whether a given node is a non-exported procedure root though, so we should be able to recognize them when visiting the forest(s).
For the second point, what does procref translate to in the forest? An external node, but possibly one that is not in MastForest::roots? If so, then the node discovery would have to be changed.
It gets rewritten to push.HASH
, where HASH
is the procedure root of the procedure it references. Later, the procedure would be called via DYN
, with the procedure root on the stack (though maybe this is changing to be a memory address, rather than the procedure hash, but what I've described is how it has historically worked).
To address everything at once, would the safest option perhaps be to iterate not over the MastForest::procedure_roots but over all nodes instead? We might copy dead nodes but for sure we wouldn't be removing nodes that are actually needed.
I think the key is to ensure that you visit all roots regardless of whether they are part of the set of exported roots. IIRC that information is inside the MastForest (it must be, in order to implement DYN
). As an aside, we should definitely ensure that a MastForest
can be queried via its public API for all procedure roots, not just exports, though we don't necessarily need to provide names for non-exported procedures like we do exports.
Dead code elimination will come later, when we introduce the analysis that we can use to implement that transformation, so I wouldn't worry about trying to do that here (it is better if we don't actually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, we should definitely ensure that a MastForest can be queried via its public API for all procedure roots, not just exports, though we don't necessarily need to provide names for non-exported procedures like we do exports.
Thinking on this further, I take back my point about not providing names, I think we also want to preserve names for non-exported procedures as well, so that we can pretty print stack traces in a debugger. Without that information, at best we'll only have source locations (file/line), but no names in a stack trace (unless it can be recovered by other means, but that shouldn't be relied upon IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the roots returned by MastForest::procedure_roots() only the public ones, or also the private ones?
The procedure_roots
contains public and private procedures (see here) - we should definitely document this. So then the strategy is valid.
I brought this up before when @plafer looked into implementing a type of dead code elimination when building forests, and want to make sure we haven't unintentionally reintroduced the same issue:
You're right, and I keep forgetting about this issue, as a symptom that it's still not clear in my mind exactly what our strategy for DYN
and dead code elimination is. For context, DYN
is a bit tricky because technically we can call any value (even if not constructed with procref
), and so if we're following the semantics to the letter, we should never be able to prune any node from a MastForest
, because technically, it could be called with DYN
(which the assembler has no way of knowing).
With the presence of DYN, the only way to say for sure that a node is unreachable is one of two options:
This is probably the most reasonable contract: if you never procref
a procedure, we're allowed to remove it from the MastForest
. Is this what we want to go for? If so, we should clearly document it - if not only for my future self :).
So to recap, the current merging code doesn't perform any "unwanted dead code elimination", since it visits all nodes that are reachable by at least one procedure (whether private or public), and therefore doesn't require any change related to this concern. A future dead code elimination pass (AFAIU) would look at all local procedures that were not procref'd and remove them (as per our not-yet-documented contract that non-procref'd local procedures are considered dead).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The procedure_roots contains public and private procedures (see here) - we should definitely document this. So then the strategy is valid.
Nice, so this PR is good on that front then, i.e. we aren't stripping things out we shouldn't be. Thanks for confirming that!
For context, DYN is a bit tricky because technically we can call any value (even if not constructed with procref), and so if we're following the semantics to the letter, we should never be able to prune any node from a MastForest, because technically, it could be called with DYN (which the assembler has no way of knowing).
Agreed, technically this is true, however in prior discussions with @bobbinth, he has mentioned that he would like to restrict DYN
to procedure roots, even though we haven't implemented such a restriction anywhere yet (and maybe we don't need to, really, you just end up in undefined behavior territory I suppose).
This is probably the most reasonable contract: if you never procref a procedure, we're allowed to remove it from the MastForest. Is this what we want to go for? If so, we should clearly document it - if not only for my future self :).
Agreed, this is my operating assumption for whether or not we can consider a procedure dead. Personally, I don't think we should be doing dead code elimination of any kind at this level - leave it to the assembler or the compiler to do so, as they have more information anyway. That said, I don't have an objection to doing so, as long as it is done conservatively (i.e. we can prove that a node is dead).
A future dead code elimination pass (AFAIU) would look at all local procedures that were not procref'd and remove them
Essentially, yes. More formally: a procedure is proven dead if there are no static references to the procedure via some procedure invocation instruction (or procref
/external
), or such references only originate from procedures which are themselves marked dead.
Most likely, we'd want to use the call graph we construct in the assembler, ensuring it includes edges for procref
, and then topologically ordering the nodes in that graph, and visiting from the leaves to the roots, removing dead leaves, and edges from dead non-leaf functions to their children (and revisiting the children when that happens to see if removing the edge made them dead as well). This could be done before constructing the MastForest, though we would need to incorporate information from linked libraries, as well as determining what exported functions actually need to be preserved (i.e. exported procedures which are not part of the public interface of a library could be DCE'd if they are unused within the library).
Doing that work in the assembler makes the most sense to me, since it knows the full set of objects it is working with, and has most of the necessary metadata (if not all that is needed for this).
Anyway, I'm digressing, this PR is fine in terms of its dead code elimination effects, that's all that matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you! I left some comments inline - but they are mostly nits.
This reverts commit 9394a3a.
032bce5
to
9d5400f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you!
Describe your changes
Closes #1529
Implements
MastForest
merging.The implementation is explained in comments and it's hopefully sufficient to understand the PR, which is why I'm not adding the details here again. If any explanation is insufficient I'm happy to expand on it!
Couple of notes and ideas:
MastForestDfsIter
could be 1) made public and 2) could be made more general to allow for in-order, pre-order as well as post-order iteration ofMastForest
s. Let me know if any of these are desirable. For now it seems best to keep them private.MastForestMerger
is also kept private because the useful functionality can be exposed through the methods added onMastForest
directly and there is no benefit in terms of functionality by making it public.EqHash
intomiden-core
.miden-core
so that it can be used inmiden-assembly
byMastForestBuilder
where it originally lived. MovingMastForestBuilder
into core doesn't seem fitting since it is used only for assembly and specially made for that. Overall it's not bad to haveEqHash
public in core.MastNodeFingerprint
and perhaps introducepub type DecoratorFingerprint = Blake3Digest<32>
for decorators. I think the "fingerprint" nicely describes that this type can be used to compare mast nodes although I'm of course open to other suggestions.miden-core
without anAssembler
. I don't want to move those test intomiden-assembly
though as that separates the implementation and the tests crate-wise. Even with an assembler, it would still be somewhat challenging to come up with the expected merged forest to assert against without basically handwriting it. Although I suppose asserting a couple of properties might be sufficient instead of the entire forest, so it would still be an improvement. One possibility I see to add test in miden-core would be to generate snapshots inmiden-assembly
, for example, and add them as plaintext serializedMastForest
s to miden-core. Snapshot testing could be done with something likeinsta
or without, I suppose. This would require adding#[cfg_attr(test, derive(Serialize, Deserialize))]
toMastForest
and all its descendants which is fine I guess. I'm happy to receive some general input on the testing story for this task though.Checklist before requesting a review
next
according to naming convention.