Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DYN-6666] Improvements to workspace checksum that is needed for Dynamo ML data pipeline. #15010

Merged
merged 3 commits into from
Mar 26, 2024

Conversation

reddyashish
Copy link
Contributor

@reddyashish reddyashish commented Mar 12, 2024

Purpose

Improvements are done to the checksum computation. that is used to determine if the current workspace should be sent to ML data service.

  1. The checksum now uses node connections and converts it into sha256 hash value to determine if it is unique.
  2. This info is serialized in a different file under dynamo user directory: AppData\Dynamo\Dynamo Core\3.1\DynamoMLDataPipeline.xml.
  3. The serialized data format is converted from a list to dictionary so as to store all unique checksum hashes for that particular workspace. This is to avoid sending any duplicate workspaces that have been sent before.
  4. Avoid reading the node connections from the json again and use the in memory data to calculate the checksum string.

Declarations

Check these if you believe they are true

  • The codebase is in a better state after this PR
  • Is documented according to the standards
  • The level of testing this PR includes is appropriate
  • User facing strings, if any, are extracted into *.resx files
  • All tests pass using the self-service CI.
  • Snapshot of UI changes, if any.
  • Changes to the API follow Semantic Versioning and are documented in the API Changes document.
  • This PR modifies some build requirements and the readme is updated
  • This PR contains no files larger than 50 MB

Release Notes

Reviewers

@QilongTang @mjkkirschner

@reddyashish reddyashish changed the title [DYN-6666] Improvements to workspace Checksum that is needed for Dynamo ML data pipeline. [DYN-6666] Improvements to workspace checksum that is needed for Dynamo ML data pipeline. Mar 12, 2024
Copy link

github-actions bot commented Mar 12, 2024

UI Smoke Tests

Test: success. 2 passed, 0 failed.
TestComplete Test Result
Workflow Run: UI Smoke Tests
Check: UI Smoke Tests - net8.0

@QilongTang QilongTang added this to the 3.1 milestone Mar 13, 2024
Copy link
Contributor

@QilongTang QilongTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with comments, do we need a test to calculate the checksum based on connections?

@QilongTang
Copy link
Contributor

Still 6 API errors:
D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\Configuration\GraphChecksumItem.cs(20,18): error RS0016: Symbol 'GraphChecksumPair' is not part of the declared public API (https://github.com/dotnet/roslyn-analyzers/blob/main/src/PublicApiAnalyzers/PublicApiAnalyzers.Help.md) [D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\DynamoCore.csproj]
D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\Configuration\GraphChecksumItem.cs(20,18): error RS0016: Symbol 'implicit constructor for 'GraphChecksumPair'' is not part of the declared public API (https://github.com/dotnet/roslyn-analyzers/blob/main/src/PublicApiAnalyzers/PublicApiAnalyzers.Help.md) [D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\DynamoCore.csproj]
D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\Configuration\GraphChecksumItem.cs(22,33): error RS0016: Symbol 'GraphId.get' is not part of the declared public API (https://github.com/dotnet/roslyn-analyzers/blob/main/src/PublicApiAnalyzers/PublicApiAnalyzers.Help.md) [D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\DynamoCore.csproj]
D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\Configuration\GraphChecksumItem.cs(22,38): error RS0016: Symbol 'GraphId.set' is not part of the declared public API (https://github.com/dotnet/roslyn-analyzers/blob/main/src/PublicApiAnalyzers/PublicApiAnalyzers.Help.md) [D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\DynamoCore.csproj]
D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\Configuration\GraphChecksumItem.cs(24,40): error RS0016: Symbol 'Checksum.get' is not part of the declared public API (https://github.com/dotnet/roslyn-analyzers/blob/main/src/PublicApiAnalyzers/PublicApiAnalyzers.Help.md) [D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\DynamoCore.csproj]
D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\Configuration\GraphChecksumItem.cs(24,45): error RS0016: Symbol 'Checksum.set' is not part of the declared public API (https://github.com/dotnet/roslyn-analyzers/blob/main/src/PublicApiAnalyzers/PublicApiAnalyzers.Help.md) [D:\a\Dynamo\Dynamo\Dynamo\src\DynamoCore\DynamoCore.csproj]

@reddyashish
Copy link
Contributor Author

Thanks @QilongTang, added them to PublicAPI.Unshipped.txt.

@reddyashish reddyashish merged commit cb9fe46 into DynamoDS:master Mar 26, 2024
22 checks passed
{
using (var reader = XmlReader.Create(dynamoMLDataPath))
{
checksums = (List<GraphChecksumPair>)serializer.Deserialize(reader);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be wrapped in a try catch, what if the data is corrupt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added exception handling. Addressed here: #15116

/// Return a dictionary of GraphChecksumItems.
/// Key will be the workspace guid and its value will be a list of saved checksums(sha256 hash) for that workspace.
/// </summary>
internal Dictionary<string, List<string>> GraphChecksumDictionary { get; set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why we need both of these data structures?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, seems like if you used JSON you could just serialize this directly and skip the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes using JSON serialization now. #15116

JProperty endProperty = connectorProperties.FirstOrDefault(x => x.Name == "End");
string inputId = (String)endProperty.Value;
// node info connections has a unique id in the format: startnodeid[outputindex]endnodeid[outputindex].
nodeInfoConnections.Add(startingPort.Owner.AstIdentifierGuid + "[" + startingPort.Index.ToString() + "]" + endingPort.Owner.AstIdentifierGuid + "[" + endingPort.Index.ToString() + "]");
Copy link
Member

@mjkkirschner mjkkirschner Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still bizarre to me - I guess you want to base your hash data only on the connections of the graph and no other properties, and in this case sha256 is a good has to choose as any small change in the input will create a big change in the output.

What I was imagining originally was that you could just use a hash that generates collisions - like SHA1 or MD5 - then just use the current JSON of the entire graph, but that is much less controllable than the solution you have here - depends on what you need exactly.

}
}
else
{
PreferenceSettings.GraphChecksumItemsList.Add(new GraphChecksumItem() { GraphId = graphId, Checksum = currentWorkspaceViewModel.Checksum });
substantialChecksum = true;
Model.GraphChecksumDictionary.Add(graphId, new List<string>() { currentWorkspaceViewModel.CurrentCheckSum });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there ever more than one item in this list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, each workspace can have a list of unique checksums.

Model.GraphChecksumDictionary.TryGetValue(graphId, out List<string> checksums);

// compare the current checksum with previous hash values.
if (checksums != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large can this list get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only serializing unique checksums of saved workspaces(on closing the workspace). I think the idea is turn this feature off once we have enough data to train the ML model. If the feature is off, the checksums are not calculated or serialized.

/// </summary>
[JsonIgnore]
public string Checksum
{
get
{
List<string> nodeInfoConnections = new List<string>();
JObject jsonWorkspace = JsonRepresentation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you get rid of JsonRepresentation? - I think it may have been added just for this method...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used here:

JsonRepresentation = JObject.Parse(saveContent);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants