Components that consume external data as a network must generate UUIDs for nodes #415

jthrilly · 2018-03-06T21:31:55Z

Only makes sense to implement this once #413 has happened.

Concerning UUID issues (#397):

Any component using external data (JSON or CSV) interpreted as a network, and requiring a UUID for each node, is responsible for creating this UUID by hashing the the node properties. We should develop a standard helper function for this purpose.
The specific hash method is TBD, but look at: https://github.com/puleos/object-hash
When a node is brought into the interview, this UUID will be transferred as with any other property, allowing direct and consistent comparison to the origin node in the external data in subsequent page loads.
This approach will work for JSON data or CSV data, since ultimately both must be converted to an array of objects to work with the node lists, or the autocomplete roster. Parsing CSV etc, is out of scope for this issue.
If hashing each time the data is loaded creates a performance issue, we can cache the hashes in redux, and use a hash of the entire dataset to check for changes on protocol load (or something similar). Where exactly this will go in redux is not certain.
Alternatively, we can provide a way for the user to mark an externalData source as containing a property that should be used as a UID. Suggestions welcome on an API in protocol.js for this.

wwqrd · 2018-03-07T12:32:07Z

We could probably use the nodejs built-ins for hashing: https://nodejs.org/api/crypto.html#crypto_class_hash

The rest of this sounds good to me!

bryfox · 2018-03-07T14:09:58Z

To be clear, it sounds like @jthrilly is talking about hashing some stringified version of a JavaScript object, rather than the source data directly (correct?). So there does need to be a well-defined serialization step before using crypto/hash. I see pros & cons to that; just want to make sure we're on the same page.

bryfox · 2018-03-07T14:17:18Z

Can you comment more on "Any component using external data"? Would every interface, for example, be responsible for generating this? Or could the loader (#413) just take care of this centrally?

bryfox · 2018-03-07T14:20:37Z

Alternatively, we can provide a way for the user to mark an externalData source as containing a property that should be used as a UID. Suggestions welcome on an API in protocol.js for this.

Can it be as simple as defining a primary key?

"externalData": {
    "clinics": {
      "primary_key": "clinic_id"
...

jthrilly · 2018-03-07T14:27:37Z

To be clear, it sounds like @jthrilly is talking about hashing some stringified version of a JavaScript object, rather than the source data directly (correct?). So there does need to be a well-defined serialization step before using crypto/hash. I see pros & cons to that; just want to make sure we're on the same page.

Yes, I am talking about hashing the serialized object, as the object-hash library I linked to does. Do you/ @wwqrd think there is a better way?

Can you comment more on "Any component using external data"? Would every interface, for example, be responsible for generating this? Or could the loader (#413) just take care of this centrally?

Currently the only instances where we need a UID to be added to external data are within the name generator interfaces. Other consumers of external data don't need this step, though conceptually you could generalize the UID injection to be effectively a map operation that happens after the data is loaded but before it is passed to the interface. If we implemented it that way, it could be part of the loading method. What do you think?

Can it be as simple as defining a primary key?

"externalData": {
"clinics": {
"primary_key": "clinic_id"
...

I think this would certainly fit the bill. It can be implemented separately, potentially in a later release.

wwqrd · 2018-03-07T14:48:56Z

I was thinking we'd use JSON.stringify(object) for that, of course that depends on JSON.stringify() being deterministic -- for which we might end up needing another library anyway...

wwqrd · 2018-03-07T14:51:08Z

Additionally object-hash uses crypto internally, so it's not like there's a performance reason for rolling our own.

bryfox · 2018-04-20T14:16:11Z

Just for completeness, the pros & cons I saw with hashing objects vs source data...

The primary downside with using an external lib is related to determinism. We'd have to be careful about any version update to that lib, since a minor change to serialization could render persisted local data unrecognized as 'equal'. (Technically, there can also be information loss once the original data is parsed — e.g., 1.0 vs 1 in JSON — but I have a hard time imagining that mattering in practice.)

Hashing the source (string) data, rather than a reconstructed version of it, avoids that.

One upside to custom serialization (via object-hash or any other mechanism) is that data is resilient to non-deterministic server output (say, with object key ordering). But that puts the burden of correctness on this app.

jthrilly · 2018-04-20T14:24:24Z

Good overview Bryan!

The primary downside with using an external lib is related to determinism. We'd have to be careful about any version update to that lib, since a minor change to serialization could render persisted local data unrecognized as 'equal'. (Technically, there can also be information loss once the original data is parsed — e.g., 1.0 vs 1 in JSON — but I have a hard time imagining that mattering in practice.)

I think realistically we are looking for within-version consistency. We won't auto-update with a new version of NC that changes this external lib. When we offer an update, it will be on the assumption that in-progress interview data will be lost. This will be up to researchers to manage.

bryfox · 2018-04-20T14:31:08Z

And following up from #443 — getNodeLabelFunction should be updated to use whatever primary key (uid?) is chosen here, replacing id.

bryfox · 2018-07-23T19:21:17Z

If hashing each time the data is loaded creates a performance issue

Using the above library, caching 1000 distinct objects, each ~450 bytes (based on the clinic example in the dev protocol), default settings:

electron, recent macbook pro: ~0.3s
iPad Air: ~4s
Android emulator: ~1.2s
Android Nexus 5x: ~0.8s

As above, but with md5:

electron, recent macbook pro: ~0.3s
iPad Air: ~2s
Android emulator: ~0.6s
Android Nexus 5x: ~0.8s

It probably makes sense to cache the values; mechanism will depend on some details in #413.

jthrilly added enhancement refactor labels Mar 6, 2018

jthrilly added this to the Alpha.4 milestone Mar 6, 2018

jthrilly mentioned this issue Mar 7, 2018

Refactor externalData in protocol #413

Closed

rebeccamadsen mentioned this issue Mar 12, 2018

add missing uids #420

Merged

This was referenced Apr 12, 2018

Feature/display label defaults #443

Merged

node field/prop collisions #446

Closed

jthrilly modified the milestones: Alpha.4 - Gresley, Alpha.6 May 28, 2018

bryfox mentioned this issue Jun 25, 2018

Edge still exists when a node is deleted #437

Closed

bryfox mentioned this issue Jul 20, 2018

Arbitrary node labeling #613

Merged

bryfox self-assigned this Jul 23, 2018

bryfox mentioned this issue Jul 25, 2018

Ensure uids exist for all nodes in app #615

Merged

bryfox closed this as completed in #615 Aug 1, 2018

bryfox mentioned this issue Dec 19, 2018

Scope node IDs to session complexdatacollective/Server#217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Components that consume external data as a network must generate UUIDs for nodes #415

Components that consume external data as a network must generate UUIDs for nodes #415

jthrilly commented Mar 6, 2018 •

edited

Loading

wwqrd commented Mar 7, 2018

bryfox commented Mar 7, 2018

bryfox commented Mar 7, 2018

bryfox commented Mar 7, 2018

jthrilly commented Mar 7, 2018 •

edited

Loading

wwqrd commented Mar 7, 2018

wwqrd commented Mar 7, 2018 •

edited

Loading

bryfox commented Apr 20, 2018

jthrilly commented Apr 20, 2018

bryfox commented Apr 20, 2018

bryfox commented Jul 23, 2018

Components that consume external data as a network must generate UUIDs for nodes #415

Components that consume external data as a network must generate UUIDs for nodes #415

Comments

jthrilly commented Mar 6, 2018 • edited Loading

wwqrd commented Mar 7, 2018

bryfox commented Mar 7, 2018

bryfox commented Mar 7, 2018

bryfox commented Mar 7, 2018

jthrilly commented Mar 7, 2018 • edited Loading

wwqrd commented Mar 7, 2018

wwqrd commented Mar 7, 2018 • edited Loading

bryfox commented Apr 20, 2018

jthrilly commented Apr 20, 2018

bryfox commented Apr 20, 2018

bryfox commented Jul 23, 2018

jthrilly commented Mar 6, 2018 •

edited

Loading

jthrilly commented Mar 7, 2018 •

edited

Loading

wwqrd commented Mar 7, 2018 •

edited

Loading