-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move over unstructured parser #31390
Conversation
…yte into flash1293/markdown-parser
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
…airbytehq/airbyte into flash1293/move-over-unstructured-parser
def _read_file(self, file_handle: IOBase, file_name: str) -> str: | ||
from unstructured.file_utils.filetype import FileType | ||
from unstructured.partition.auto import partition | ||
from unstructured.partition.md import optional_decode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding these inline imports, are these not on the top of the file because they are extras in the CDK, and because they might not be available at runtime if the extra is not included?
I think the general best practice for these scenarios would be something like this at the top of the file...
try:
import unstructured
from unstructured.file_utils.filetype import FileType
from unstructured.partition.auto import partition
from unstructured.partition.md import optional_decode
except ImportError as ex:
logger.warning(f"Unstructured libraries could not be imported. {ex}")
FileType = None
partition = None
optional_decode = None
Then within the individual functions, we can either assume that the code path will not be traversed without the dependencies, or in certain points we can just check the needed import at top of the function.
def _read_file(self, file_handle: IOBase, file_name: str) -> str:
if unstructured is none:
raise ImportError("Unstructured library is not available. Please install any missing libraries (...) and try again.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are inline because they are quite slow to import - if they are on the top of the file it would slow down the connector even if the parser is not used because python has to load so much code. I can add a comment to explain this (or maybe there is a more elegant way to handle this I’m not aware of)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, what do you think about putting into a helper function called by the class constructor?
Pseudocode, may need adjustment:
...
unstructured_lib = None
def _import_unstructured():
"""Dynamically imported as needed, due to slow import speed."""
global unstructured_lib
import unstructured
unstructured_lib = unstructured
class UnstructuredParser(FileTypeParser):
def __init__():
_import_unstructured()
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/unstructured_parser.py
Outdated
Show resolved
Hide resolved
Thanks for the review @aaronsteers , I changed the way the dynamic imports are handled, could you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @flash1293!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Thanks for those changes. 👍
What
Closes #31614
This PR moves the unstructured parser currently in the S3 connector over to the CDK.
This will enable unstructured file extraction for all connectors based on the file based CDK.
hi_res
parsing strategy can't be applied - it will default tofast
and fall back toocr_only
if necessary.Open questions: