Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Google Drive Source #31458

Merged
merged 39 commits into from
Nov 2, 2023
Merged

✨ Google Drive Source #31458

merged 39 commits into from
Nov 2, 2023

Conversation

flash1293
Copy link
Contributor

@flash1293 flash1293 commented Oct 16, 2023

What

Fixes #31615

This PR adds a new source connector for google drive. It is based on the File CDK and allows for all features provided this way.

Features:

  • Authentication via OAuth or Service Account JSON
  • Sync files in specified folder (including subfolders)
  • Google Docs are read as Docx files
  • Other Google office formats (slides, sheets, drawing) are exported as PDF

⚠️ This is not using the smart_open library, instead the whole file is downloaded via the MediaIoBaseDownload class provided by the Google SDK. It should be possible to leverage smart_open (PoC is on a separate PR here: #31866), but I would like to split it out of the initial version ⚠️

⚠️ This is currently only doing file type detection via the file name, I opened https://github.com/airbytehq/airbyte-internal-issues/issues/2219 to track to improve this ⚠️

File paths always start at the defined folder.

How to test

OAuth

OAuth is not super simple to to test. The general steps are defined on this notion page: https://www.notion.so/Testing-OAuth-Locally-fea17aeb14c74cacb5f3ed856daae753

const defaultOssFeatures = [
	// ...Leave all the existing features and add...
	FeatureItem.AllowOAuthConnector
];
  • Re-build platform images ./gradlew assemble
  • Start platform via BASIC_AUTH_USERNAME="" BASIC_AUTH_PASSWORD="" VERSION=dev docker-compose --file ./docker-compose.yaml up
  • Build google drive connector airbyte-ci connectors --name=source-google-drive build
  • Add as custom connector (airbyte/source-google-drive.dev)
  • Run docker exec -ti airbyte-db psql -U docker -d airbyte -c "select id from actor_definition where name like '%Drive%';" to get the source deifnition id
  • Get OAuth client id and client secret from password manager (search for "Google Drive OAuth Test App")
  • Add oauth credentials to local Airbyte instance:
API_URL=localhost:8001/api/v1;

curl -X POST \
   -H 'Content-type: application/json' \
   $API_URL/source_oauths/oauth_params/create \
   -d '{"sourceDefinitionId": "<the id from above>", "params": {"credentials": {"client_id": "___", "client_secret": "___" }}}'
  • Navigate to localhost:8000
  • Create a source

Service account json

  • Get OAuth service account jsonfrom password manager (search for "Google Drive OAuth Test App")
  • Note that this service account will not have access to your Google Drive files (to test with it the folder needs to be public)

@vercel
Copy link

vercel bot commented Oct 16, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 2, 2023 0:24am

@github-actions
Copy link
Contributor

github-actions bot commented Oct 16, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@octavia-squidington-iii octavia-squidington-iii added the area/connectors Connector related issues label Oct 16, 2023
@flash1293 flash1293 marked this pull request as ready for review October 23, 2023 16:28
@octavia-squidington-iv octavia-squidington-iv requested a review from a team October 23, 2023 16:30
@airbyte-oss-build-runner
Copy link
Collaborator

source-google-drive test report (commit c54003ad48) - ❌

⏲️ Total pipeline duration: 01mn00s

Step Result
Build source-google-drive docker image for platform(s) linux/amd64
Unit tests
Code format checks
Validate metadata for source-google-drive
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-google-drive test

@airbyte-oss-build-runner
Copy link
Collaborator

source-google-drive test report (commit 54157ef2e9) - ❌

⏲️ Total pipeline duration: 28.81s

Step Result
Build source-google-drive docker image for platform(s) linux/amd64
Unit tests
Code format checks
Validate metadata for source-google-drive
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-google-drive test

Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@flash1293
Copy link
Contributor Author

@aaronsteers Worked through the comments, unstructured files are also supported now. One open question is whether we should allow multiple streams or simplify the configuration object to just point to a folder and specify a glob pattern to fetch. What do you think?

@flash1293
Copy link
Contributor Author

@aaronsteers As discussed, I added a check to avoid fetching prefixes in most cases. It's not perfect, (e.g. for subfolder/*.pdf it will still fetch folders within subfolder even though they have no chance of matching), but that seemed like an edge case that's hard to work around, so I will not cover it for now and open a separate issue. Could you take another look please?

Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a number of questions, most around UX, paths, and globs.

@flash1293 - We have a 1:1 tomorrow so we can also talk through this if helpful and more efficient to process that way.

docs/integrations/sources/google-drive.md Outdated Show resolved Hide resolved

One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.

Google documents are exported as Docx files while spreadsheets, presentations and drawings are exported as PDF files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a nit, but this seemed it could be misinterpreted as us passing a PDF document downstream. A bit tricky to explain concisely, but a possible alternate wording might be something like:

Suggested change
Google documents are exported as Docx files while spreadsheets, presentations and drawings are exported as PDF files.
Before parsing each document, the connector exports Google Document files to Docx format internally. Google Sheets, Google Slides, and drawings are internally exported and parsed by the connector as PDFs.

Or maybe this is simpler:

Suggested change
Google documents are exported as Docx files while spreadsheets, presentations and drawings are exported as PDF files.
Google documents are processed as Docx files while spreadsheets, presentations and drawings are processed as PDFs.

Totally up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it much better, thanks

Comment on lines 163 to 165
## User Schema

Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a callout, here or in the unstructured doc section, which marks this custom schema section as not applying to unstructured text docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added a callout

Comment on lines 186 to 191
def _get_export_mime_type(self, file: GoogleDriveRemoteFile):
"""
Returns the mime type to export Google App documents as.

Google Docs are exported as Docx to preserve as much formatting as possible, everything else goes through PDF.
"""
Copy link
Collaborator

@aaronsteers aaronsteers Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt of adding this inline for easy reference?

Suggested change
def _get_export_mime_type(self, file: GoogleDriveRemoteFile):
"""
Returns the mime type to export Google App documents as.
Google Docs are exported as Docx to preserve as much formatting as possible, everything else goes through PDF.
"""
def _get_export_mime_type(self, file: GoogleDriveRemoteFile):
"""
Returns the mime type to export Google App documents as.
Google Docs are exported as Docx to preserve as much formatting as possible,
everything else goes through PDF. Supported export formats are documented here:
- https://developers.google.com/drive/api/guides/ref-export-formats
"""

Comment on lines 138 to 141
Let's look at a specific example, matching the following bucket layout:

```text
myBucket
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Let's look at a specific example, matching the following bucket layout:
```text
myBucket
Let's look at a specific example, matching the following drive layout:
```text
MyDrive

Comment on lines 156 to 161
if file["mimeType"] == GOOGLE_DOC_MIME_TYPE:
return f"{file['name']}.docx"
elif self._is_exportable_document(file["mimeType"]):
return f"{file['name']}.pdf"
else:
return file["name"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary for us to add the extension? I ask because it is possible for these to create path collisions. Users could have "My File" and "My File.docx" or "My File.pdf" already in the same directory. When dealing with sharing files across Google and non-Google audiences, it does happen to sometimes have the same file twice in a directory, once as Google native format and ones as the more portable non-google format.

Copy link
Collaborator

@aaronsteers aaronsteers Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future, we may also want to customize handling for presentations. Similar to the DOCX format, a PPTX format is probably going to have a lower cost parsing for slide decks, but I don't think we need to prioritize it as of yet and there might be some areas where slides are better parsed as PDFs than as slides. 🤷 (Testing would tell.)

A larger issue, I just realized, is Google Sheets. They don't really fit the 'text documents' mold, and it might (?) be preferably to perhaps ignore them when this parser is selected. Presumably later on, this connector could add a "Google Sheets" parser that took the same parse options as Excel and internally converted to Excel before parsing. 🤷

I'm thinking now also about other document types like CSVs - whether they would+should get picked up by the text document parser, or if we should ignore them based on their MIME type when the Document Parser is selected.

I wonder if an efficient way to handle this would be to add a MIME type glob pattern. Traditionally, we'd use file extensions for this, but google docs don't have them, and a more robust include/exclude logic would perhaps necessarily use MIME type or some other means of filtering. (Caveat: this is an area I want to understand well and plan for, but I don't know if it should block merging.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are raising a very good point about robustness of the file detection - what about the following:

There are three "leads" we can leverage:

  • The mime type
  • The file extension
  • The actual file (inspecting the first few bytes) - this works well for docx, pptx and pdf so we should be pretty safe here

I created a PR to add this to the existing unstructured parser in the CDK here: #31997

Once that is merged I can remove the artificial file extensions from this PR and pass along the mime type (S3 and Azure Blob Storage also have a notion of mime type so it should be possible to extend them in a similar way, but as we have the file inspection it shouldn't be strictly necessary - plus it's always good to have another layer to fall back to as not all objects are guaranteed to have this information).

A larger issue, I just realized, is Google Sheets. They don't really fit the 'text documents' mold, and it might (?) be preferably to perhaps ignore them when this parser is selected. Presumably later on, this connector could add a "Google Sheets" parser that took the same parse options as Excel and internally converted to Excel before parsing. 🤷

Yeah, I wasn't sure about that either - we already have a dedicated Sheets connector, maybe that's good enough for now. I'm happy to exclude it for the time being and get back to it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to merge #31997 before this PR is merged? (That seems preferable to me.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would like to take that route.

Comment on lines 165 to 166
if mode == FileReadMode.READ:
raise ValueError("Cannot read Google Docs/Sheets/Presentations and so on as text. Please set the format to PDF")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an instruction for the developer or the user?

If to the developer, I could see the guidance of "please set the format to PDF" being potentially confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for the user, I adjusted the error message to be more clear/actionable

Comment on lines 24 to 29
EXPORTABLE_DOCUMENTS_MIME_TYPES = [
GOOGLE_DOC_MIME_TYPE,
"application/vnd.google-apps.spreadsheet",
"application/vnd.google-apps.presentation",
"application/vnd.google-apps.drawing",
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt of leaving off Google Sheets (application/vnd.google-apps.spreadsheet) for now?

It may be preferable to pass these through the Excel format parser, rather than parsing as text documents.

cc @clnoll

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @aaronsteers, we most certainly will want a separate parser for Excel eventually.

I can definitely see a situation where a user doesn't really care what files are in their folder they just want everything that can be parsed. But, if we wanted to offer that I think maybe it would be best done as an umbrella parser that can call out to other parsers rather than including Excel in with this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the points made here, let's just not process them for now (we have the Sheets parser already anyway)

Comment on lines 284 to 289
"folder_url": {
"title": "Folder Url",
"description": "URL for the folder you want to sync",
"order": 0,
"type": "string"
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand exactly how this global folder URL setting interacts with stream-level globs, and it might be helpful to add specific guidance on this front. In the readme, there's reference to the "bucket" (or "drive") name not needing to be included in the glob pattern, I'm not clear if that would extend to receiving a subfolder here as the base folder URL.

For instance, if folder URL points to the MyDrive/Engineering/Tech Specs folder, would the user provide a glob of ** or MyDrive/Engineering/Tech Specs/** to get all files in that directory?

I think providing just the subfolder glob is probably preferable but either way could work in theory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The glob is just for the subfolder (the specified global folder URL is like the "bucket"). I will clarify this in the documentation

Comment on lines 119 to 123
remote_file = GoogleDriveRemoteFile(
uri=file_name, last_modified=last_modified, id=new_file["id"], mimeType=new_file["mimeType"]
)
if self.file_matches_globs(remote_file, globs):
yield remote_file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following also my comments in _get_file_name(), it doesn't feel right to use our appended suffixes (.docx, .pdf) for documents that don't have those suffixes. Could we remove that suffix without breaking any key functionality?

Also, just to confirm, do we process all files in the included directory, except those excluded by a users' glob pattern? Or do we have some kind of internal filter when using the document parser? I'm a bit worried this could break if a non-readable non-text file (.zip, .ico, .parquet) exists in the directory.

Copy link
Contributor Author

@flash1293 flash1293 Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the other comment, once #31997 is merged we shouldn't need to append suffixes anymore.

Also, just to confirm, do we process all files in the included directory, except those excluded by a users' glob pattern? Or do we have some kind of internal filter when using the document parser? I'm a bit worried this could break if a non-readable non-text file (.zip, .ico, .parquet) exists in the directory.

As the fact whether a file is readable or not is a property of the parser, not the stream reader, this can't be controlled here. See our previous discussion of this over here: #31209 (comment)

Originally I wanted to go with the more lenient behavior of just skipping the file if it can't be read exactly for the reason you stated - if you just want to point to a folder and have it pick up everything that works somewhat, then skipping is probably better because otherwise it's relatively simple to fail the sync / connection check which could lead to an annoying user experience.

However, skipping is also not perfect - we have no good way to indicate what happened in the UI in this kind of situation, so another problematic behavior could be the user pointing to a folder full of files we can't process, the sync running "successfully" with minimal data actually arriving in the destination.

It comes down to what I highlighted already - it's hard to tell what's right as long as people didn't try to actually use it. In this kind of situation, going with the more strict approach is actually not a bad choice because we can relax requirements later on without disturbing existing users while going the opposite route would be a breaking change. I'm OK with going both ways here though - a third way would be to make it a configuration option of the document file type parser ("ignore unparsable files")

A takeaway from this I would like to follow a bit is to extend the protocol with something between a failed and a successful sync (something like "succeeded with warnings" or whatever) that could be used in this situation (the warning could be "encountered 1234 files that couldn't be processed")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of a failure with warnings. That seems like the best of both worlds to me.

@flash1293
Copy link
Contributor Author

@clnoll @aaronsteers Thanks for the review. I made the following changes:

  • Improved docs
  • Improved file type detection logic in CDK in separate PR
  • Removed handling of Google Sheets files

The biggest open question I'm seeing is whether unprocessable files should be ignored or fail the sync or whether this should be configurable by the user.

@octavia-squidington-iv octavia-squidington-iv requested a review from a team November 1, 2023 09:33
@flash1293
Copy link
Contributor Author

@aaronsteers My preferred option would be to publish as is (after the improved file format detection is added) and adding a user option to configure fail vs. skip in the document file format parser in a follow-up PR after the initial release.

Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all of the changes!

I'm excited to see this ship, and I don't think we need to block for the handling of unhandled file types.

Ideally, I think we'd have:

  1. The parser selection dictates which extensions and/or MIME types are eligible to be processed.
  2. Qualified files which cannot be processed would trigger an error. (E.g. a corrupted PDF would still fail the Text Document parser, since its eligible for parsing and can't be parsed.)
  3. Files which are ineligible for the selected file parser would not be attempted to be processed. For visibility, we can optionally log "skipped" files in connector logs.
  4. Configurability:
    • Config option, disabled by default, to try to parse everything, even ineligible files. (Basically an "opt-in" to fail on non-eligible files existing, but we try anyway.)
    • MIME Type Filers. Either a glob, set of globs, or list of MIME types to include/exclude.
    • Glob filtering. (No change. This already exists, and can be used to filter on file name extensions.)

Above is non-blocking. The above, or some variation or alteration of these, could be implemented in one or more subsequent iterations.

@aaronsteers
Copy link
Collaborator

@aaronsteers My preferred option would be to publish as is (after the improved file format detection is added) and adding a user option to configure fail vs. skip in the document file format parser in a follow-up PR after the initial release.

Sounds great! Added comment above with some ideas on potential configurability.

@flash1293 flash1293 merged commit f487e15 into master Nov 2, 2023
17 checks passed
@flash1293 flash1293 deleted the flash1293/source-google-drive branch November 2, 2023 13:28
@LuongNV1
Copy link

LuongNV1 commented Dec 4, 2023

I want authenticate via google with only button
image

@flash1293
Copy link
Contributor Author

@LuongNV1 This is currently only possible via Airbyte Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/google-drive
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google Drive source connector
6 participants