Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow vectorizer to read documents via smart-open parse via pymupdf #431

Closed

Conversation

Askir
Copy link
Contributor

@Askir Askir commented Feb 5, 2025

Uses our agreed upon API spec number 4.
Works end to end for local files and s3 (credential loading also works, see tests)

What's missing?

  • Validation in create_vectorizer (currently there is no check that you use an empty chunking column with a file parser so if you configure one that will crash the vectorizer)
  • Docs
  • Fixing the tests in the extension (I did the minimal thing to get the vectorizer working)

adolsalamanca and others added 9 commits February 5, 2025 13:37
chunking needs a special logic for document processing, given that we're
no longer expecting a column name that should exist in the source table,
but what's going to be chunked is the content of the file stored in the
cloud object storage.

Co-authored-by: Sergio Moya <[email protected]>
fixes some tests while working on the new chunking strategy.

Co-authored-by: Adol Rodriguez <[email protected]>
Co-authored-by: Adol Rodriguez <[email protected]>
pgai v0.8.0 was released so it's time to rebase this branch

Co-authored-by: Sergio Moya <[email protected]>
Co-authored-by: Adol Rodriguez <[email protected]>
disable ollama tests and include new content on expected files.
remove unrequired asserts from vectorizer test and add statement timeout
to speed feedback on transient local failures.
@Askir Askir changed the base branch from main to s3-integration-feature-branch February 5, 2025 17:54
@Askir Askir marked this pull request as ready for review February 5, 2025 18:11
@Askir Askir requested a review from a team as a code owner February 5, 2025 18:11
@Askir Askir force-pushed the jascha/vectorizer-s3-document branch from 577957e to 220aafa Compare February 5, 2025 18:13
@Askir Askir force-pushed the jascha/vectorizer-s3-document branch from 9af8756 to f0b2e2b Compare February 5, 2025 19:00
@Askir Askir force-pushed the jascha/vectorizer-s3-document branch from f0b2e2b to f364ca6 Compare February 5, 2025 19:41
Comment on lines +771 to 780
if self.vectorizer.config.loader is not None:
file_content = self.vectorizer.config.loader.load(item)
markdown_content = self.vectorizer.config.parser.parse_file_content(
file_content
)
chunks = self.vectorizer.config.chunking.into_chunks(markdown_content)
else:
chunks = self.vectorizer.config.chunking.into_chunks(item)
for chunk_id, chunk in enumerate(chunks, 0):
formatted = self.vectorizer.config.formatting.format(chunk, item)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be best to conceive of this as a processing pipeline where most processing nodes alter the row

Suggested change
if self.vectorizer.config.loader is not None:
file_content = self.vectorizer.config.loader.load(item)
markdown_content = self.vectorizer.config.parser.parse_file_content(
file_content
)
chunks = self.vectorizer.config.chunking.into_chunks(markdown_content)
else:
chunks = self.vectorizer.config.chunking.into_chunks(item)
for chunk_id, chunk in enumerate(chunks, 0):
formatted = self.vectorizer.config.formatting.format(chunk, item)
if self.vectorizer.config.loader is not None:
# loader reads a "url" column and loads binary data into item['file_contents']
self.vectorizer.config.loader.load(item)
if self.vectorizer.config.parser is not None:
# reads some column (default: file_contents) and puts result in item['parsed_file_content']
self.vectorizer.config.parser.parse_file_content(item)
chunks = self.vectorizer.config.chunking.into_chunks(item)
for chunk_id, chunk in enumerate(chunks, 0):
formatted = self.vectorizer.config.formatting.format(chunk, item)
records_without_embeddings.append(pk + [chunk_id, formatted])

Comment on lines +2 to +3
-- loader_file_loader
create or replace function ai.loader_from_document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming considerations:

  1. our other configs for vectorizer end in ing: embedding, formatting, chunking. Should we follow the same convention?
  2. would object_storage be more appropriate than document? We should describe "where" instead of "what". We may load documents from other places in the future

Comment on lines +2 to +3
-- parser_auto
create or replace function ai.parser_auto() returns pg_catalog.jsonb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsing vs parser to follow the existing convention?

@@ -2,7 +2,7 @@
-------------------------------------------------------------------------------
-- chunking_character_text_splitter
create or replace function ai.chunking_character_text_splitter
( chunk_column pg_catalog.name
( chunk_column pg_catalog.name default ''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this column will go away completely when we have a loader_row configuration, but in case it doesn't the default should be null rather than ''

@smoya
Copy link
Contributor

smoya commented Feb 10, 2025

Closing in favor of the WIP work in #442

@smoya smoya closed this Feb 10, 2025
@smoya smoya deleted the jascha/vectorizer-s3-document branch February 10, 2025 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants