-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BACK-37] Jellyfish data migration #660
Conversation
migrations/back-37/back-37.go
Outdated
logger.WithError(err).Error("Unable to find jellyfish results") | ||
} | ||
for jellyfishDocCursor.Next(context.Background()) { | ||
jellyfishDocCursor.Decode(&jellyfishResult) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to check for error here
migrations/back-37/back-37.go
Outdated
errorCount++ | ||
continue | ||
} | ||
if !dupCursor.Next(context.Background()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be reading this incorrectly but you would want to archive if there did exist a record right? So this should not be negated.
"_id": jellyfishResult["_id"], | ||
"_deduplicator": bson.M{"$exists": true}, | ||
} | ||
dupCursor, err := m.dataRepository.Find(context.Background(), dupQuery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this ever return a document since the _id is unique and the original query on line 101 already checks for no dedeupe hash? Or can the doc be modified in b/t this query and the one on line 105?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the _id
is yet another hash of the datum's "id" fields e.g. for the smbg var idFields = ['type', 'deviceId', 'time', 'value'];
see https://github.com/tidepool-org/jellyfish/blob/35165c8dce520390213fd2fbd7b0228931d976e3/lib/schema/smbg.js#L22 for example
So if you had a duplicate value they would have the same _id
from what I understand?
@darinkrauss ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two fields, the internal mongodb _id
and our own generated id
I believe. I think you want the id
here then.
bsonData["_userId"].(string), | ||
bsonData["deviceId"].(string), | ||
bsonData["time"].(string), | ||
bsonData["type"].(string), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt this now a Date ? Might want to make sure it's an actual string here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meant time field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crap - yeah it is. Thanks for that
m.Logger().Debug("Creating data repository") | ||
|
||
m.dataRepository = dataStore.GetRepository("deviceData") | ||
hashUpdatedCount, archivedCount, errorCount := m.migrateJellyfishDocuments() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Might want a debug log message before this line so the user can know the migration has started.
@@ -0,0 +1,54 @@ | |||
package main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Generally, Go filenames are lowercase.
@@ -0,0 +1,207 @@ | |||
package main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Historically, we've used "_" as a separator in directory and file names.
// represents data from base id fields | ||
var baseData = func(bsonData bson.M) []string { | ||
return []string{ | ||
bsonData["_userId"].(string), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] I wouldn't be surprised if there is old data in the production database that is missing one or more of these fields. If there is, then I believe the forced typecast will panic and crash the migration process. Do we know with 100% certainty that these four fields are on every record and are of the expected type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, even if the fields are there, you may want to do the same checks as in Base.IdentifyFields()
to ensure there aren't any missing strings or zero times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same down below in the switch
statement.
// used for BACK-37 to set hash for jellyfish datum for migrating to platform API | ||
func CreateHash(bsonData bson.M) string { | ||
// represents data from base id fields | ||
var baseData = func(bsonData bson.M) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this is a function? Why not simply calculate and use the result below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus, if you have a value of baseData
here, then you can simply append to it in the switch and then pull out the makeHash
below the switch and DRY up the code.
"_userId": userID, | ||
"_active": true, | ||
// uploads aren't de-duped. | ||
"type": bson.M{"$ne": "upload"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are upload
records in the data collection now? I thought they were moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or are they still around, but just not used in the data collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they have moved
logger.WithError(err).Error("error while fetching data. Please re-run to complete the migration.") | ||
errorCount++ | ||
} | ||
jellyfishDocCursor.Close(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving this as a defer
just below its creation.
if !m.DryRun() { | ||
if err := m.archiveDocument(jellyfishResult["_id"]); err != nil { | ||
logger.WithError(err).Error("Unable to archive jellyfish document") | ||
errorCount++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it errored, then should it count add to the archivedCount
? Same as below with hashUpdatedCount
.
archiveUpdate := bson.M{ | ||
"$set": bson.M{ | ||
"_active": false, | ||
"_archivedTime": time.Now().UnixNano() / int64(time.Millisecond), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is "archivedTime", not "_archivedTime".
deduplicatorUpdate = bson.M{ | ||
"$set": bson.M{ | ||
"_deduplicator": bson.M{ | ||
"hash": CreateHash(jfDatum), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to set the deduplicator name and version? Same below in default
clause.
superseded by #665 |
goal is to allow dedup while jellyfish is deprecated and uploader is migrated to only use platform API