Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BACK-37] Jellyfish data migration #660

Closed
wants to merge 14 commits into from
Closed

Conversation

jh-bate
Copy link
Contributor

@jh-bate jh-bate commented Sep 11, 2023

goal is to allow dedup while jellyfish is deprecated and uploader is migrated to only use platform API

@jh-bate jh-bate changed the title Jellyfish data migration [BACK-37] Jellyfish data migration Sep 11, 2023
@jh-bate jh-bate marked this pull request as ready for review September 25, 2023 23:00
@jh-bate jh-bate requested a review from lostlevels September 25, 2023 23:00
logger.WithError(err).Error("Unable to find jellyfish results")
}
for jellyfishDocCursor.Next(context.Background()) {
jellyfishDocCursor.Decode(&jellyfishResult)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to check for error here

errorCount++
continue
}
if !dupCursor.Next(context.Background()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be reading this incorrectly but you would want to archive if there did exist a record right? So this should not be negated.

"_id": jellyfishResult["_id"],
"_deduplicator": bson.M{"$exists": true},
}
dupCursor, err := m.dataRepository.Find(context.Background(), dupQuery)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this ever return a document since the _id is unique and the original query on line 101 already checks for no dedeupe hash? Or can the doc be modified in b/t this query and the one on line 105?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the _id is yet another hash of the datum's "id" fields e.g. for the smbg var idFields = ['type', 'deviceId', 'time', 'value'];

see https://github.com/tidepool-org/jellyfish/blob/35165c8dce520390213fd2fbd7b0228931d976e3/lib/schema/smbg.js#L22 for example

So if you had a duplicate value they would have the same _id from what I understand?

@darinkrauss ??

Copy link
Contributor

@lostlevels lostlevels Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two fields, the internal mongodb _id and our own generated id I believe. I think you want the id here then.

bsonData["_userId"].(string),
bsonData["deviceId"].(string),
bsonData["time"].(string),
bsonData["type"].(string),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this now a Date ? Might want to make sure it's an actual string here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meant time field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crap - yeah it is. Thanks for that

@jh-bate jh-bate requested a review from darinkrauss September 27, 2023 00:43
m.Logger().Debug("Creating data repository")

m.dataRepository = dataStore.GetRepository("deviceData")
hashUpdatedCount, archivedCount, errorCount := m.migrateJellyfishDocuments()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Might want a debug log message before this line so the user can know the migration has started.

@@ -0,0 +1,54 @@
package main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Generally, Go filenames are lowercase.

@@ -0,0 +1,207 @@
package main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Historically, we've used "_" as a separator in directory and file names.

// represents data from base id fields
var baseData = func(bsonData bson.M) []string {
return []string{
bsonData["_userId"].(string),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I wouldn't be surprised if there is old data in the production database that is missing one or more of these fields. If there is, then I believe the forced typecast will panic and crash the migration process. Do we know with 100% certainty that these four fields are on every record and are of the expected type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, even if the fields are there, you may want to do the same checks as in Base.IdentifyFields() to ensure there aren't any missing strings or zero times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same down below in the switch statement.

// used for BACK-37 to set hash for jellyfish datum for migrating to platform API
func CreateHash(bsonData bson.M) string {
// represents data from base id fields
var baseData = func(bsonData bson.M) []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is a function? Why not simply calculate and use the result below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, if you have a value of baseData here, then you can simply append to it in the switch and then pull out the makeHash below the switch and DRY up the code.

"_userId": userID,
"_active": true,
// uploads aren't de-duped.
"type": bson.M{"$ne": "upload"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are upload records in the data collection now? I thought they were moved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are they still around, but just not used in the data collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they have moved

logger.WithError(err).Error("error while fetching data. Please re-run to complete the migration.")
errorCount++
}
jellyfishDocCursor.Close(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving this as a defer just below its creation.

if !m.DryRun() {
if err := m.archiveDocument(jellyfishResult["_id"]); err != nil {
logger.WithError(err).Error("Unable to archive jellyfish document")
errorCount++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it errored, then should it count add to the archivedCount? Same as below with hashUpdatedCount.

archiveUpdate := bson.M{
"$set": bson.M{
"_active": false,
"_archivedTime": time.Now().UnixNano() / int64(time.Millisecond),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is "archivedTime", not "_archivedTime".

deduplicatorUpdate = bson.M{
"$set": bson.M{
"_deduplicator": bson.M{
"hash": CreateHash(jfDatum),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to set the deduplicator name and version? Same below in default clause.

@jh-bate
Copy link
Contributor Author

jh-bate commented Oct 4, 2023

superseded by #665

@jh-bate jh-bate closed this Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants