-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow vectorizer to read documents via smart-open parse via pymupdf #431
feat: allow vectorizer to read documents via smart-open parse via pymupdf #431
Conversation
chunking needs a special logic for document processing, given that we're no longer expecting a column name that should exist in the source table, but what's going to be chunked is the content of the file stored in the cloud object storage. Co-authored-by: Sergio Moya <[email protected]>
fixes some tests while working on the new chunking strategy. Co-authored-by: Adol Rodriguez <[email protected]>
Co-authored-by: Adol Rodriguez <[email protected]>
Co-authored-by: Sergio Moya <[email protected]>
pgai v0.8.0 was released so it's time to rebase this branch Co-authored-by: Sergio Moya <[email protected]>
Co-authored-by: Adol Rodriguez <[email protected]>
disable ollama tests and include new content on expected files. remove unrequired asserts from vectorizer test and add statement timeout to speed feedback on transient local failures.
…orage and parse it
577957e
to
220aafa
Compare
9af8756
to
f0b2e2b
Compare
f0b2e2b
to
f364ca6
Compare
if self.vectorizer.config.loader is not None: | ||
file_content = self.vectorizer.config.loader.load(item) | ||
markdown_content = self.vectorizer.config.parser.parse_file_content( | ||
file_content | ||
) | ||
chunks = self.vectorizer.config.chunking.into_chunks(markdown_content) | ||
else: | ||
chunks = self.vectorizer.config.chunking.into_chunks(item) | ||
for chunk_id, chunk in enumerate(chunks, 0): | ||
formatted = self.vectorizer.config.formatting.format(chunk, item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be best to conceive of this as a processing pipeline where most processing nodes alter the row
if self.vectorizer.config.loader is not None: | |
file_content = self.vectorizer.config.loader.load(item) | |
markdown_content = self.vectorizer.config.parser.parse_file_content( | |
file_content | |
) | |
chunks = self.vectorizer.config.chunking.into_chunks(markdown_content) | |
else: | |
chunks = self.vectorizer.config.chunking.into_chunks(item) | |
for chunk_id, chunk in enumerate(chunks, 0): | |
formatted = self.vectorizer.config.formatting.format(chunk, item) | |
if self.vectorizer.config.loader is not None: | |
# loader reads a "url" column and loads binary data into item['file_contents'] | |
self.vectorizer.config.loader.load(item) | |
if self.vectorizer.config.parser is not None: | |
# reads some column (default: file_contents) and puts result in item['parsed_file_content'] | |
self.vectorizer.config.parser.parse_file_content(item) | |
chunks = self.vectorizer.config.chunking.into_chunks(item) | |
for chunk_id, chunk in enumerate(chunks, 0): | |
formatted = self.vectorizer.config.formatting.format(chunk, item) | |
records_without_embeddings.append(pk + [chunk_id, formatted]) |
-- loader_file_loader | ||
create or replace function ai.loader_from_document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming considerations:
- our other configs for vectorizer end in
ing
:embedding
,formatting
,chunking
. Should we follow the same convention? - would
object_storage
be more appropriate thandocument
? We should describe "where" instead of "what". We may load documents from other places in the future
-- parser_auto | ||
create or replace function ai.parser_auto() returns pg_catalog.jsonb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parsing
vs parser
to follow the existing convention?
@@ -2,7 +2,7 @@ | |||
------------------------------------------------------------------------------- | |||
-- chunking_character_text_splitter | |||
create or replace function ai.chunking_character_text_splitter | |||
( chunk_column pg_catalog.name | |||
( chunk_column pg_catalog.name default '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this column will go away completely when we have a loader_row
configuration, but in case it doesn't the default should be null
rather than ''
Closing in favor of the WIP work in #442 |
Uses our agreed upon API spec number 4.
Works end to end for local files and s3 (credential loading also works, see tests)
What's missing?