Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Google Drive] Add Incremental Sync #2679

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

moxarth-elastic
Copy link
Collaborator

@moxarth-elastic moxarth-elastic commented Jul 6, 2024

Part Of #2629

Adding a support of incremental sync using list api by passing a query for last updated time.

we've noticed few points while developing this feature, mentioning them below:

  1. When a file is uploaded from a user's local machine to Google Drive, the file's modified date on Google Drive will be same as the modified date from the local machine. To index these files, the user needs to run a full sync.

  2. After the initial full sync retrieves all files and folders from Google Drive, if some files are deleted from Google Drive, running an incremental sync will remove those files from Elasticsearch. If the deleted files are then restored on Google Drive and another incremental sync is run, those files won't be indexed back into Elasticsearch. This is because the Google Drive doesn't update the modifiedTime of the files; it only changes the trashed flag. To restore these files in Elasticsearch, a full sync is required.

  3. For the incremental sync, the trashed files (with trashed=true metadata) count always shows in the logs as deleted files since we used trashed=true or modifiedTime > '{last_sync_time}' to fetch files incrementally. (This can be resolved by updating the framework by checking if the deleted file IDs present in the Elasticsearch docs)

Checklists

Pre-Review Checklist

  • this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check config.yml.example)
  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)
  • Considered corresponding documentation changes
  • Contributed any configuration settings changes to the configuration reference
  • if you added or changed Rich Configurable Fields for a Native Connector, you made a corresponding PR in Kibana

@moxarth-elastic
Copy link
Collaborator Author

buildkite test this

@jedrazb
Copy link
Member

jedrazb commented Jul 8, 2024

Hey! Thank you for PR, taking a look! :)

Copy link
Member

@jedrazb jedrazb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over the PR and the approach looks good to me! :) The major question is about the query q used for incremental syncs.

When a file is uploaded from a user's local machine to Google Drive, the file's modified date on Google Drive will be same as the modified date from the local machine. To index these files, the user needs to run a full sync.

That's problematic and I think we should fix that if possible! Could we test any other file properties listed in the docs. See file schema, createdTime sounds very promising, I tested this property just in UI and the created time reflects the upload time (this is old file uploaded now, that is 2024-07-08):
Screenshot 2024-07-08 at 10 16 31

After the initial full sync retrieves all files and folders from Google Drive, if some files are deleted from Google Drive, running an incremental sync will remove those files from Elasticsearch. If the deleted files are then restored on Google Drive and another incremental sync is run, those files won't be indexed back into Elasticsearch. This is because the Google Drive doesn't update the modifiedTime of the files; it only changes the trashed flag. To restore these files in Elasticsearch, a full sync is required.

I see your point and looking at the file schema it can be tricky to correctly detect this. I think it's acceptable "edge case" for incremental sync. We can describe this issue in docs.

For the incremental sync, the trashed files (with trashed=true metadata) count always shows in the logs as deleted files since we used trashed=true or modifiedTime > '{last_sync_time}' to fetch files incrementally.

I think this can be fixed by removing trashed=true or from the query q and the check the trashed property while iterating over response.

if last_sync_time is None:
list_query = "trashed=false"
else:
list_query = f"trashed=true or modifiedTime > '{last_sync_time}'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we specify explicitly trashed=true or ...

This would fetch all trashed files in each incremental sync (potentially large number of trashed files).

Could we instead omit trashed from query q and just use modifiedTime > '{last_sync_time}' in query.

And then check if a file was recently trashed (and if so, mark it as deleted) while iterating over response from that query?

if fetch_permissions:
if fetch_permissions and last_sync_time:
files_fields = DRIVE_ITEMS_FIELDS_WITH_PERMISSIONS
list_query = f"(trashed=true or modifiedTime > '{last_sync_time}') and 'me' in writers"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about trashed=true here

files_fields = DRIVE_ITEMS_FIELDS_WITH_PERMISSIONS
# Google Drive API required write access to fetch file's permissions
list_query = "trashed=false and 'me' in writers"
elif not fetch_permissions and last_sync_time:
files_fields = DRIVE_ITEMS_FIELDS
list_query = f"trashed=true or modifiedTime > '{last_sync_time}'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about trashed=true here

@moxarth-elastic
Copy link
Collaborator Author

That's problematic and I think we should fix that if possible! Could we test any other file properties listed in the docs. See file schema, createdTime sounds very promising, I tested this property just in UI and the created time reflects the upload time (this is old file uploaded now, that is 2024-07-08)

Yes, i think this should solve our issue. Thanks!

@moxarth-elastic
Copy link
Collaborator Author

I think this can be fixed by removing trashed=true or from the query q and the check the trashed property while iterating over response.

Google drive does not update the modifiedTime if we delete/restore the file so this does not change anything. Also, for the shared files, we get trashedTime as a metadata so in that case we can check if the trashedTime is newer than the cursor time and overcome this issue.

@jedrazb
Copy link
Member

jedrazb commented Jul 9, 2024

Google drive does not update the modifiedTime if we delete/restore the file so this does not change anything.

Ahh fair, that's problematic.... I confirmed in api explorer (make sure to add files to fields parameter to see everything returned from the API)

Also, for the shared files, we get trashedTime as a metadata so in that case we can check if the trashedTime is newer than the cursor time and overcome this issue.

👍 Let's use trashedTime for shared drives. For personal drives, I agree that there is no other way to "detect" deleted files so it should be fine to keep trashed=true part in the query. The trashed files should be auto-deleted anyway after 30 days so this won't be growing forever.

Edge cases to be added in docs:

  • special case with deleted filed in personal drives (we detect delete correctly, just same ids can be deleted in subsequent syncs)
  • special case with delete/restore

@moxarth-elastic
Copy link
Collaborator Author

@jedrazb the PR is updated with addressing your comments, can you please re-review it?

@jedrazb
Copy link
Member

jedrazb commented Jul 15, 2024

@moxarth-elastic yes, it's on my TODOs for today! Sorry for delay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants