-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Google Drive Source #31458
✨ Google Drive Source #31458
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
…airbyte into flash1293/source-google-drive
source-google-drive test report (commit
|
Step | Result |
---|---|
Build source-google-drive docker image for platform(s) linux/amd64 | ✅ |
Unit tests | ❌ |
Code format checks | ❌ |
Validate metadata for source-google-drive | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ✅ |
QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-google-drive test
source-google-drive test report (commit
|
Step | Result |
---|---|
Build source-google-drive docker image for platform(s) linux/amd64 | ✅ |
Unit tests | ❌ |
Code format checks | ❌ |
Validate metadata for source-google-drive | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ✅ |
QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-google-drive test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@aaronsteers Worked through the comments, unstructured files are also supported now. One open question is whether we should allow multiple streams or simplify the configuration object to just point to a folder and specify a glob pattern to fetch. What do you think? |
@aaronsteers As discussed, I added a check to avoid fetching prefixes in most cases. It's not perfect, (e.g. for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a number of questions, most around UX, paths, and globs.
@flash1293 - We have a 1:1 tomorrow so we can also talk through this if helpful and more efficient to process that way.
|
||
One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields. | ||
|
||
Google documents are exported as Docx files while spreadsheets, presentations and drawings are exported as PDF files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a nit, but this seemed it could be misinterpreted as us passing a PDF document downstream. A bit tricky to explain concisely, but a possible alternate wording might be something like:
Google documents are exported as Docx files while spreadsheets, presentations and drawings are exported as PDF files. | |
Before parsing each document, the connector exports Google Document files to Docx format internally. Google Sheets, Google Slides, and drawings are internally exported and parsed by the connector as PDFs. |
Or maybe this is simpler:
Google documents are exported as Docx files while spreadsheets, presentations and drawings are exported as PDF files. | |
Google documents are processed as Docx files while spreadsheets, presentations and drawings are processed as PDFs. |
Totally up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it much better, thanks
## User Schema | ||
|
||
Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a callout, here or in the unstructured doc section, which marks this custom schema section as not applying to unstructured text docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added a callout
def _get_export_mime_type(self, file: GoogleDriveRemoteFile): | ||
""" | ||
Returns the mime type to export Google App documents as. | ||
|
||
Google Docs are exported as Docx to preserve as much formatting as possible, everything else goes through PDF. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt of adding this inline for easy reference?
def _get_export_mime_type(self, file: GoogleDriveRemoteFile): | |
""" | |
Returns the mime type to export Google App documents as. | |
Google Docs are exported as Docx to preserve as much formatting as possible, everything else goes through PDF. | |
""" | |
def _get_export_mime_type(self, file: GoogleDriveRemoteFile): | |
""" | |
Returns the mime type to export Google App documents as. | |
Google Docs are exported as Docx to preserve as much formatting as possible, | |
everything else goes through PDF. Supported export formats are documented here: | |
- https://developers.google.com/drive/api/guides/ref-export-formats | |
""" |
Let's look at a specific example, matching the following bucket layout: | ||
|
||
```text | ||
myBucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's look at a specific example, matching the following bucket layout: | |
```text | |
myBucket | |
Let's look at a specific example, matching the following drive layout: | |
```text | |
MyDrive |
if file["mimeType"] == GOOGLE_DOC_MIME_TYPE: | ||
return f"{file['name']}.docx" | ||
elif self._is_exportable_document(file["mimeType"]): | ||
return f"{file['name']}.pdf" | ||
else: | ||
return file["name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for us to add the extension? I ask because it is possible for these to create path collisions. Users could have "My File" and "My File.docx" or "My File.pdf" already in the same directory. When dealing with sharing files across Google and non-Google audiences, it does happen to sometimes have the same file twice in a directory, once as Google native format and ones as the more portable non-google format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In future, we may also want to customize handling for presentations. Similar to the DOCX format, a PPTX format is probably going to have a lower cost parsing for slide decks, but I don't think we need to prioritize it as of yet and there might be some areas where slides are better parsed as PDFs than as slides. 🤷 (Testing would tell.)
A larger issue, I just realized, is Google Sheets. They don't really fit the 'text documents' mold, and it might (?) be preferably to perhaps ignore them when this parser is selected. Presumably later on, this connector could add a "Google Sheets" parser that took the same parse options as Excel and internally converted to Excel before parsing. 🤷
I'm thinking now also about other document types like CSVs - whether they would+should get picked up by the text document parser, or if we should ignore them based on their MIME type when the Document Parser is selected.
I wonder if an efficient way to handle this would be to add a MIME type glob pattern. Traditionally, we'd use file extensions for this, but google docs don't have them, and a more robust include/exclude logic would perhaps necessarily use MIME type or some other means of filtering. (Caveat: this is an area I want to understand well and plan for, but I don't know if it should block merging.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are raising a very good point about robustness of the file detection - what about the following:
There are three "leads" we can leverage:
- The mime type
- The file extension
- The actual file (inspecting the first few bytes) - this works well for docx, pptx and pdf so we should be pretty safe here
I created a PR to add this to the existing unstructured parser in the CDK here: #31997
Once that is merged I can remove the artificial file extensions from this PR and pass along the mime type (S3 and Azure Blob Storage also have a notion of mime type so it should be possible to extend them in a similar way, but as we have the file inspection it shouldn't be strictly necessary - plus it's always good to have another layer to fall back to as not all objects are guaranteed to have this information).
A larger issue, I just realized, is Google Sheets. They don't really fit the 'text documents' mold, and it might (?) be preferably to perhaps ignore them when this parser is selected. Presumably later on, this connector could add a "Google Sheets" parser that took the same parse options as Excel and internally converted to Excel before parsing. 🤷
Yeah, I wasn't sure about that either - we already have a dedicated Sheets connector, maybe that's good enough for now. I'm happy to exclude it for the time being and get back to it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to merge #31997 before this PR is merged? (That seems preferable to me.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would like to take that route.
if mode == FileReadMode.READ: | ||
raise ValueError("Cannot read Google Docs/Sheets/Presentations and so on as text. Please set the format to PDF") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an instruction for the developer or the user?
If to the developer, I could see the guidance of "please set the format to PDF" being potentially confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for the user, I adjusted the error message to be more clear/actionable
EXPORTABLE_DOCUMENTS_MIME_TYPES = [ | ||
GOOGLE_DOC_MIME_TYPE, | ||
"application/vnd.google-apps.spreadsheet", | ||
"application/vnd.google-apps.presentation", | ||
"application/vnd.google-apps.drawing", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt of leaving off Google Sheets (application/vnd.google-apps.spreadsheet
) for now?
It may be preferable to pass these through the Excel format parser, rather than parsing as text documents.
cc @clnoll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @aaronsteers, we most certainly will want a separate parser for Excel eventually.
I can definitely see a situation where a user doesn't really care what files are in their folder they just want everything that can be parsed. But, if we wanted to offer that I think maybe it would be best done as an umbrella parser that can call out to other parsers rather than including Excel in with this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the points made here, let's just not process them for now (we have the Sheets parser already anyway)
"folder_url": { | ||
"title": "Folder Url", | ||
"description": "URL for the folder you want to sync", | ||
"order": 0, | ||
"type": "string" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand exactly how this global folder URL setting interacts with stream-level globs, and it might be helpful to add specific guidance on this front. In the readme, there's reference to the "bucket" (or "drive") name not needing to be included in the glob pattern, I'm not clear if that would extend to receiving a subfolder here as the base folder URL.
For instance, if folder URL points to the MyDrive/Engineering/Tech Specs
folder, would the user provide a glob of **
or MyDrive/Engineering/Tech Specs/**
to get all files in that directory?
I think providing just the subfolder glob is probably preferable but either way could work in theory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The glob is just for the subfolder (the specified global folder URL is like the "bucket"). I will clarify this in the documentation
remote_file = GoogleDriveRemoteFile( | ||
uri=file_name, last_modified=last_modified, id=new_file["id"], mimeType=new_file["mimeType"] | ||
) | ||
if self.file_matches_globs(remote_file, globs): | ||
yield remote_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following also my comments in _get_file_name()
, it doesn't feel right to use our appended suffixes (.docx, .pdf) for documents that don't have those suffixes. Could we remove that suffix without breaking any key functionality?
Also, just to confirm, do we process all files in the included directory, except those excluded by a users' glob pattern? Or do we have some kind of internal filter when using the document parser? I'm a bit worried this could break if a non-readable non-text file (.zip, .ico, .parquet) exists in the directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the other comment, once #31997 is merged we shouldn't need to append suffixes anymore.
Also, just to confirm, do we process all files in the included directory, except those excluded by a users' glob pattern? Or do we have some kind of internal filter when using the document parser? I'm a bit worried this could break if a non-readable non-text file (.zip, .ico, .parquet) exists in the directory.
As the fact whether a file is readable or not is a property of the parser, not the stream reader, this can't be controlled here. See our previous discussion of this over here: #31209 (comment)
Originally I wanted to go with the more lenient behavior of just skipping the file if it can't be read exactly for the reason you stated - if you just want to point to a folder and have it pick up everything that works somewhat, then skipping is probably better because otherwise it's relatively simple to fail the sync / connection check which could lead to an annoying user experience.
However, skipping is also not perfect - we have no good way to indicate what happened in the UI in this kind of situation, so another problematic behavior could be the user pointing to a folder full of files we can't process, the sync running "successfully" with minimal data actually arriving in the destination.
It comes down to what I highlighted already - it's hard to tell what's right as long as people didn't try to actually use it. In this kind of situation, going with the more strict approach is actually not a bad choice because we can relax requirements later on without disturbing existing users while going the opposite route would be a breaking change. I'm OK with going both ways here though - a third way would be to make it a configuration option of the document file type parser ("ignore unparsable files")
A takeaway from this I would like to follow a bit is to extend the protocol with something between a failed and a successful sync (something like "succeeded with warnings" or whatever) that could be used in this situation (the warning could be "encountered 1234 files that couldn't be processed")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of a failure with warnings. That seems like the best of both worlds to me.
@clnoll @aaronsteers Thanks for the review. I made the following changes:
The biggest open question I'm seeing is whether unprocessable files should be ignored or fail the sync or whether this should be configurable by the user. |
@aaronsteers My preferred option would be to publish as is (after the improved file format detection is added) and adding a user option to configure fail vs. skip in the document file format parser in a follow-up PR after the initial release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all of the changes!
I'm excited to see this ship, and I don't think we need to block for the handling of unhandled file types.
Ideally, I think we'd have:
- The parser selection dictates which extensions and/or MIME types are eligible to be processed.
- Qualified files which cannot be processed would trigger an error. (E.g. a corrupted PDF would still fail the Text Document parser, since its eligible for parsing and can't be parsed.)
- Files which are ineligible for the selected file parser would not be attempted to be processed. For visibility, we can optionally log "skipped" files in connector logs.
- Configurability:
- Config option, disabled by default, to try to parse everything, even ineligible files. (Basically an "opt-in" to fail on non-eligible files existing, but we try anyway.)
- MIME Type Filers. Either a glob, set of globs, or list of MIME types to include/exclude.
- Glob filtering. (No change. This already exists, and can be used to filter on file name extensions.)
Above is non-blocking. The above, or some variation or alteration of these, could be implemented in one or more subsequent iterations.
Sounds great! Added comment above with some ideas on potential configurability. |
@LuongNV1 This is currently only possible via Airbyte Cloud |
What
Fixes #31615
This PR adds a new source connector for google drive. It is based on the File CDK and allows for all features provided this way.
Features:
MediaIoBaseDownload
class provided by the Google SDK. It should be possible to leverage smart_open (PoC is on a separate PR here: #31866), but I would like to split it out of the initial versionFile paths always start at the defined folder.
How to test
OAuth
OAuth is not super simple to to test. The general steps are defined on this notion page: https://www.notion.so/Testing-OAuth-Locally-fea17aeb14c74cacb5f3ed856daae753
cd oss
airbyte-webapp
project, enable OAuth undeross/airbyte-webapp/src/core/services/features/constants.ts
:./gradlew assemble
BASIC_AUTH_USERNAME="" BASIC_AUTH_PASSWORD="" VERSION=dev docker-compose --file ./docker-compose.yaml up
airbyte-ci connectors --name=source-google-drive build
airbyte/source-google-drive.dev
)docker exec -ti airbyte-db psql -U docker -d airbyte -c "select id from actor_definition where name like '%Drive%';"
to get the source deifnition idlocalhost:8000
Service account json