Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Components that consume external data as a network must generate UUIDs for nodes #415

Closed
jthrilly opened this issue Mar 6, 2018 · 11 comments · Fixed by #615
Closed

Components that consume external data as a network must generate UUIDs for nodes #415

jthrilly opened this issue Mar 6, 2018 · 11 comments · Fixed by #615
Assignees
Milestone

Comments

@jthrilly
Copy link
Member

jthrilly commented Mar 6, 2018

Only makes sense to implement this once #413 has happened.

Concerning UUID issues (#397):

  • Any component using external data (JSON or CSV) interpreted as a network, and requiring a UUID for each node, is responsible for creating this UUID by hashing the the node properties. We should develop a standard helper function for this purpose.
  • The specific hash method is TBD, but look at: https://github.com/puleos/object-hash
  • When a node is brought into the interview, this UUID will be transferred as with any other property, allowing direct and consistent comparison to the origin node in the external data in subsequent page loads.
  • This approach will work for JSON data or CSV data, since ultimately both must be converted to an array of objects to work with the node lists, or the autocomplete roster. Parsing CSV etc, is out of scope for this issue.
  • If hashing each time the data is loaded creates a performance issue, we can cache the hashes in redux, and use a hash of the entire dataset to check for changes on protocol load (or something similar). Where exactly this will go in redux is not certain.
  • Alternatively, we can provide a way for the user to mark an externalData source as containing a property that should be used as a UID. Suggestions welcome on an API in protocol.js for this.
@jthrilly jthrilly added this to the Alpha.4 milestone Mar 6, 2018
@wwqrd
Copy link
Contributor

wwqrd commented Mar 7, 2018

We could probably use the nodejs built-ins for hashing: https://nodejs.org/api/crypto.html#crypto_class_hash

The rest of this sounds good to me!

@bryfox
Copy link
Contributor

bryfox commented Mar 7, 2018

To be clear, it sounds like @jthrilly is talking about hashing some stringified version of a JavaScript object, rather than the source data directly (correct?). So there does need to be a well-defined serialization step before using crypto/hash. I see pros & cons to that; just want to make sure we're on the same page.

@bryfox
Copy link
Contributor

bryfox commented Mar 7, 2018

Can you comment more on "Any component using external data"? Would every interface, for example, be responsible for generating this? Or could the loader (#413) just take care of this centrally?

@bryfox
Copy link
Contributor

bryfox commented Mar 7, 2018

Alternatively, we can provide a way for the user to mark an externalData source as containing a property that should be used as a UID. Suggestions welcome on an API in protocol.js for this.

Can it be as simple as defining a primary key?

"externalData": {
    "clinics": {
      "primary_key": "clinic_id"
...

@jthrilly
Copy link
Member Author

jthrilly commented Mar 7, 2018

To be clear, it sounds like @jthrilly is talking about hashing some stringified version of a JavaScript object, rather than the source data directly (correct?). So there does need to be a well-defined serialization step before using crypto/hash. I see pros & cons to that; just want to make sure we're on the same page.

Yes, I am talking about hashing the serialized object, as the object-hash library I linked to does. Do you/ @wwqrd think there is a better way?

Can you comment more on "Any component using external data"? Would every interface, for example, be responsible for generating this? Or could the loader (#413) just take care of this centrally?

Currently the only instances where we need a UID to be added to external data are within the name generator interfaces. Other consumers of external data don't need this step, though conceptually you could generalize the UID injection to be effectively a map operation that happens after the data is loaded but before it is passed to the interface. If we implemented it that way, it could be part of the loading method. What do you think?

Can it be as simple as defining a primary key?

"externalData": {
"clinics": {
"primary_key": "clinic_id"
...

I think this would certainly fit the bill. It can be implemented separately, potentially in a later release.

@wwqrd
Copy link
Contributor

wwqrd commented Mar 7, 2018

I was thinking we'd use JSON.stringify(object) for that, of course that depends on JSON.stringify() being deterministic -- for which we might end up needing another library anyway...

@wwqrd
Copy link
Contributor

wwqrd commented Mar 7, 2018

Additionally object-hash uses crypto internally, so it's not like there's a performance reason for rolling our own.

@bryfox
Copy link
Contributor

bryfox commented Apr 20, 2018

Just for completeness, the pros & cons I saw with hashing objects vs source data...

The primary downside with using an external lib is related to determinism. We'd have to be careful about any version update to that lib, since a minor change to serialization could render persisted local data unrecognized as 'equal'. (Technically, there can also be information loss once the original data is parsed — e.g., 1.0 vs 1 in JSON — but I have a hard time imagining that mattering in practice.)

Hashing the source (string) data, rather than a reconstructed version of it, avoids that.

One upside to custom serialization (via object-hash or any other mechanism) is that data is resilient to non-deterministic server output (say, with object key ordering). But that puts the burden of correctness on this app.

@jthrilly
Copy link
Member Author

Good overview Bryan!

The primary downside with using an external lib is related to determinism. We'd have to be careful about any version update to that lib, since a minor change to serialization could render persisted local data unrecognized as 'equal'. (Technically, there can also be information loss once the original data is parsed — e.g., 1.0 vs 1 in JSON — but I have a hard time imagining that mattering in practice.)

I think realistically we are looking for within-version consistency. We won't auto-update with a new version of NC that changes this external lib. When we offer an update, it will be on the assumption that in-progress interview data will be lost. This will be up to researchers to manage.

@bryfox
Copy link
Contributor

bryfox commented Apr 20, 2018

And following up from #443 — getNodeLabelFunction should be updated to use whatever primary key (uid?) is chosen here, replacing id.

@bryfox
Copy link
Contributor

bryfox commented Jul 23, 2018

If hashing each time the data is loaded creates a performance issue

Using the above library, caching 1000 distinct objects, each ~450 bytes (based on the clinic example in the dev protocol), default settings:

  • electron, recent macbook pro: ~0.3s
  • iPad Air: ~4s
  • Android emulator: ~1.2s
  • Android Nexus 5x: ~0.8s

As above, but with md5:

  • electron, recent macbook pro: ~0.3s
  • iPad Air: ~2s
  • Android emulator: ~0.6s
  • Android Nexus 5x: ~0.8s

It probably makes sense to cache the values; mechanism will depend on some details in #413.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants