Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Google Drive] Proper support of incremental syncs #2629

Open
4 tasks
jedrazb opened this issue Jun 11, 2024 · 0 comments
Open
4 tasks

[Google Drive] Proper support of incremental syncs #2629

jedrazb opened this issue Jun 11, 2024 · 0 comments
Labels

Comments

@jedrazb
Copy link
Member

jedrazb commented Jun 11, 2024

Problem Description

Right now, the "incrementa sync" of google drive falls back to the default naive incremental sync implementation where we have to at least fetch all document metadata and it only allows for skipping downloads of files that already exist.

Google drive incremental syncs do not use a "delta API" that would allow it to fetch only documents that changed from the last sync. E.g filtering documents at the source with e.g. q=lastiModifiedTime > syncCursor would result in much less file metadata to fetch and process during the incremental syncs and would likely result in much shorter incremental sync times.

Proposed Solution #

  • implement get_docs_incrementally function
    • it will largely stay the same as get_docs
    • the main difference would be to pass query q to list_files and list_files_from_my_drive function to filter doc that were modified recently ( last sync timestamp will be stored in sync_cursor) - list API docs (read more here)
    • keep the last sync timestamp in a sync_cursor

Once we have "smart" implementation of incremental syncs we can expect a big speedup for incremental syncs for massive datasets.

Open questions

  • How to track deletes?
    • Probably we can use trashed=true and "some" time property in query, to detect recently deleted docs - more investigation needed
      • According to docs trashedTime is populated only for files in a shared drive
      • For personal drive perhaps modifiedTime is sufficient to check (check if modifiedTime can be also used for shared drives)
    • Make sure we can mark docs operation as deleted in get_docs_incrementally function to signal ES that a doc should be deleted - see get_docs_incrementally doc

Additional Context

  • Would be great to benchmark the new implementation to have rough estimate of speedup
    • You can ping @jedrazb with the feature branch ready, I have dataset of 80k files we can run benchmark against

Acceptance criteria

  • Verify that it works for personal drive
    • Detect doc update, insert and deletes
  • Verify that it works with shared drive
    • Detect doc update, insert and deletes
@jedrazb jedrazb added the enhancement New feature or request label Jun 11, 2024
@jedrazb jedrazb changed the title [Google Drive] Support smart incremental syncs [Google Drive] Proper support of incremental syncs Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants