-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Google Drive] Add Incremental Sync #2679
base: main
Are you sure you want to change the base?
Conversation
buildkite test this |
Hey! Thank you for PR, taking a look! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over the PR and the approach looks good to me! :) The major question is about the query q
used for incremental syncs.
When a file is uploaded from a user's local machine to Google Drive, the file's modified date on Google Drive will be same as the modified date from the local machine. To index these files, the user needs to run a full sync.
That's problematic and I think we should fix that if possible! Could we test any other file properties listed in the docs. See file schema, createdTime
sounds very promising, I tested this property just in UI and the created time reflects the upload time (this is old file uploaded now, that is 2024-07-08):
After the initial full sync retrieves all files and folders from Google Drive, if some files are deleted from Google Drive, running an incremental sync will remove those files from Elasticsearch. If the deleted files are then restored on Google Drive and another incremental sync is run, those files won't be indexed back into Elasticsearch. This is because the Google Drive doesn't update the modifiedTime of the files; it only changes the trashed flag. To restore these files in Elasticsearch, a full sync is required.
I see your point and looking at the file
schema it can be tricky to correctly detect this. I think it's acceptable "edge case" for incremental sync. We can describe this issue in docs.
For the incremental sync, the trashed files (with trashed=true metadata) count always shows in the logs as deleted files since we used trashed=true or modifiedTime > '{last_sync_time}' to fetch files incrementally.
I think this can be fixed by removing trashed=true or
from the query q
and the check the trashed
property while iterating over response.
connectors/sources/google_drive.py
Outdated
if last_sync_time is None: | ||
list_query = "trashed=false" | ||
else: | ||
list_query = f"trashed=true or modifiedTime > '{last_sync_time}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we specify explicitly trashed=true or ...
This would fetch all trashed files in each incremental sync (potentially large number of trashed files).
Could we instead omit trashed
from query q
and just use modifiedTime > '{last_sync_time}'
in query.
And then check if a file was recently trashed (and if so, mark it as deleted) while iterating over response from that query?
connectors/sources/google_drive.py
Outdated
if fetch_permissions: | ||
if fetch_permissions and last_sync_time: | ||
files_fields = DRIVE_ITEMS_FIELDS_WITH_PERMISSIONS | ||
list_query = f"(trashed=true or modifiedTime > '{last_sync_time}') and 'me' in writers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above about trashed=true
here
connectors/sources/google_drive.py
Outdated
files_fields = DRIVE_ITEMS_FIELDS_WITH_PERMISSIONS | ||
# Google Drive API required write access to fetch file's permissions | ||
list_query = "trashed=false and 'me' in writers" | ||
elif not fetch_permissions and last_sync_time: | ||
files_fields = DRIVE_ITEMS_FIELDS | ||
list_query = f"trashed=true or modifiedTime > '{last_sync_time}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above about trashed=true
here
Yes, i think this should solve our issue. Thanks! |
Google drive does not update the |
Ahh fair, that's problematic.... I confirmed in api explorer (make sure to add
👍 Let's use Edge cases to be added in docs:
|
@jedrazb the PR is updated with addressing your comments, can you please re-review it? |
@moxarth-elastic yes, it's on my TODOs for today! Sorry for delay |
Part Of #2629
Adding a support of incremental sync using list api by passing a query for last updated time.
we've noticed few points while developing this feature, mentioning them below:
When a file is uploaded from a user's local machine to Google Drive, the file's modified date on Google Drive will be same as the modified date from the local machine. To index these files, the user needs to run a full sync.
After the initial full sync retrieves all files and folders from Google Drive, if some files are deleted from Google Drive, running an incremental sync will remove those files from Elasticsearch. If the deleted files are then restored on Google Drive and another incremental sync is run, those files won't be indexed back into Elasticsearch. This is because the Google Drive doesn't update the modifiedTime of the files; it only changes the
trashed
flag. To restore these files in Elasticsearch, a full sync is required.For the incremental sync, the trashed files (with
trashed=true
metadata) count always shows in the logs as deleted files since we usedtrashed=true or modifiedTime > '{last_sync_time}'
to fetch files incrementally. (This can be resolved by updating the framework by checking if the deleted file IDs present in the Elasticsearch docs)Checklists
Pre-Review Checklist
config.yml.example
)v7.13.2
,v7.14.0
,v8.0.0
)