-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DYN-6666] Improvements to workspace checksum that is needed for Dynamo ML data pipeline. #15010
Conversation
UI Smoke TestsTest: success. 2 passed, 0 failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with comments, do we need a test to calculate the checksum based on connections?
Still 6 API errors: |
Thanks @QilongTang, added them to PublicAPI.Unshipped.txt. |
{ | ||
using (var reader = XmlReader.Create(dynamoMLDataPath)) | ||
{ | ||
checksums = (List<GraphChecksumPair>)serializer.Deserialize(reader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should be wrapped in a try catch, what if the data is corrupt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added exception handling. Addressed here: #15116
/// Return a dictionary of GraphChecksumItems. | ||
/// Key will be the workspace guid and its value will be a list of saved checksums(sha256 hash) for that workspace. | ||
/// </summary> | ||
internal Dictionary<string, List<string>> GraphChecksumDictionary { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why we need both of these data structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, seems like if you used JSON you could just serialize this directly and skip the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes using JSON serialization now. #15116
JProperty endProperty = connectorProperties.FirstOrDefault(x => x.Name == "End"); | ||
string inputId = (String)endProperty.Value; | ||
// node info connections has a unique id in the format: startnodeid[outputindex]endnodeid[outputindex]. | ||
nodeInfoConnections.Add(startingPort.Owner.AstIdentifierGuid + "[" + startingPort.Index.ToString() + "]" + endingPort.Owner.AstIdentifierGuid + "[" + endingPort.Index.ToString() + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still bizarre to me - I guess you want to base your hash data only on the connections of the graph and no other properties, and in this case sha256 is a good has to choose as any small change in the input will create a big change in the output.
What I was imagining originally was that you could just use a hash that generates collisions - like SHA1 or MD5 - then just use the current JSON of the entire graph, but that is much less controllable than the solution you have here - depends on what you need exactly.
} | ||
} | ||
else | ||
{ | ||
PreferenceSettings.GraphChecksumItemsList.Add(new GraphChecksumItem() { GraphId = graphId, Checksum = currentWorkspaceViewModel.Checksum }); | ||
substantialChecksum = true; | ||
Model.GraphChecksumDictionary.Add(graphId, new List<string>() { currentWorkspaceViewModel.CurrentCheckSum }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there ever more than one item in this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, each workspace can have a list of unique checksums.
Model.GraphChecksumDictionary.TryGetValue(graphId, out List<string> checksums); | ||
|
||
// compare the current checksum with previous hash values. | ||
if (checksums != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large can this list get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only serializing unique checksums of saved workspaces(on closing the workspace). I think the idea is turn this feature off once we have enough data to train the ML model. If the feature is off, the checksums are not calculated or serialized.
/// </summary> | ||
[JsonIgnore] | ||
public string Checksum | ||
{ | ||
get | ||
{ | ||
List<string> nodeInfoConnections = new List<string>(); | ||
JObject jsonWorkspace = JsonRepresentation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you get rid of JsonRepresentation
? - I think it may have been added just for this method...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used here:
JsonRepresentation = JObject.Parse(saveContent); |
Purpose
Improvements are done to the checksum computation. that is used to determine if the current workspace should be sent to ML data service.
Declarations
Check these if you believe they are true
*.resx
filesRelease Notes
Reviewers
@QilongTang @mjkkirschner