feat: allow vectorizer to read documents via smart-open parse via pymupdf #431

Askir · 2025-02-05T17:54:33Z

Uses our agreed upon API spec number 4.
Works end to end for local files and s3 (credential loading also works, see tests)

What's missing?

Validation in create_vectorizer (currently there is no check that you use an empty chunking column with a file parser so if you configure one that will crash the vectorizer)
Docs
Fixing the tests in the extension (I did the minimal thing to get the vectorizer working)

chunking needs a special logic for document processing, given that we're no longer expecting a column name that should exist in the source table, but what's going to be chunked is the content of the file stored in the cloud object storage. Co-authored-by: Sergio Moya <[email protected]>

fixes some tests while working on the new chunking strategy. Co-authored-by: Adol Rodriguez <[email protected]>

Co-authored-by: Adol Rodriguez <[email protected]>

Co-authored-by: Sergio Moya <[email protected]>

pgai v0.8.0 was released so it's time to rebase this branch Co-authored-by: Sergio Moya <[email protected]>

Co-authored-by: Adol Rodriguez <[email protected]>

disable ollama tests and include new content on expected files. remove unrequired asserts from vectorizer test and add statement timeout to speed feedback on transient local failures.

…orage and parse it

cevian · 2025-02-05T19:43:50Z

projects/pgai/pgai/vectorizer/vectorizer.py

+            if self.vectorizer.config.loader is not None:
+                file_content = self.vectorizer.config.loader.load(item)
+                markdown_content = self.vectorizer.config.parser.parse_file_content(
+                    file_content
+                )
+                chunks = self.vectorizer.config.chunking.into_chunks(markdown_content)
+            else:
+                chunks = self.vectorizer.config.chunking.into_chunks(item)
            for chunk_id, chunk in enumerate(chunks, 0):
                formatted = self.vectorizer.config.formatting.format(chunk, item)


I think it may be best to conceive of this as a processing pipeline where most processing nodes alter the row

Suggested change

if self.vectorizer.config.loader is not None:

file_content = self.vectorizer.config.loader.load(item)

markdown_content = self.vectorizer.config.parser.parse_file_content(

file_content

)

chunks = self.vectorizer.config.chunking.into_chunks(markdown_content)

else:

chunks = self.vectorizer.config.chunking.into_chunks(item)

for chunk_id, chunk in enumerate(chunks, 0):

formatted = self.vectorizer.config.formatting.format(chunk, item)

if self.vectorizer.config.loader is not None:

# loader reads a "url" column and loads binary data into item['file_contents']

self.vectorizer.config.loader.load(item)

if self.vectorizer.config.parser is not None:

# reads some column (default: file_contents) and puts result in item['parsed_file_content']

self.vectorizer.config.parser.parse_file_content(item)

chunks = self.vectorizer.config.chunking.into_chunks(item)

for chunk_id, chunk in enumerate(chunks, 0):

formatted = self.vectorizer.config.formatting.format(chunk, item)

records_without_embeddings.append(pk + [chunk_id, formatted])

jgpruitt · 2025-02-05T20:55:01Z

projects/extension/sql/idempotent/013-loader.sql

+-- loader_file_loader
+create or replace function ai.loader_from_document


naming considerations:

our other configs for vectorizer end in ing: embedding, formatting, chunking. Should we follow the same convention?

would object_storage be more appropriate than document? We should describe "where" instead of "what". We may load documents from other places in the future

jgpruitt · 2025-02-05T20:56:08Z

projects/extension/sql/idempotent/014-parser.sql

+-- parser_auto
+create or replace function ai.parser_auto() returns pg_catalog.jsonb


parsing vs parser to follow the existing convention?

jgpruitt · 2025-02-05T20:58:44Z

projects/extension/sql/idempotent/005-chunking.sql

@@ -2,7 +2,7 @@
 -------------------------------------------------------------------------------
 -- chunking_character_text_splitter
 create or replace function ai.chunking_character_text_splitter
-( chunk_column pg_catalog.name
+( chunk_column pg_catalog.name default ''


i think this column will go away completely when we have a loader_row configuration, but in case it doesn't the default should be null rather than ''

smoya · 2025-02-10T11:21:23Z

Closing in favor of the WIP work in #442

adolsalamanca and others added 9 commits February 5, 2025 13:37

feat: fix tests

4a3ca4f

fixes some tests while working on the new chunking strategy. Co-authored-by: Adol Rodriguez <[email protected]>

fix: chunkin upgrades

fee3190

Co-authored-by: Adol Rodriguez <[email protected]>

chore: invert the chunk_doc validation

4f8bb00

Co-authored-by: Sergio Moya <[email protected]>

chore: rebase over main

2a85e7d

pgai v0.8.0 was released so it's time to rebase this branch Co-authored-by: Sergio Moya <[email protected]>

chore: fixing test expectations

9beff92

Co-authored-by: Adol Rodriguez <[email protected]>

chore: rename incremental file after rebasing

da68e22

chore: fix content and permissions tests

2c4e8de

disable ollama tests and include new content on expected files. remove unrequired asserts from vectorizer test and add statement timeout to speed feedback on transient local failures.

feat: allow vectorizer to read documents from external or internal st…

6c50ad6

…orage and parse it

Askir changed the base branch from main to s3-integration-feature-branch February 5, 2025 17:54

Askir marked this pull request as ready for review February 5, 2025 18:11

Askir requested a review from a team as a code owner February 5, 2025 18:11

Askir force-pushed the jascha/vectorizer-s3-document branch from 577957e to 220aafa Compare February 5, 2025 18:13

chore: remove exec_info=true because it hangs on some tests

cd4d68b

Askir force-pushed the jascha/vectorizer-s3-document branch from 9af8756 to f0b2e2b Compare February 5, 2025 19:00

test: set fake aws keys for ci

f364ca6

Askir force-pushed the jascha/vectorizer-s3-document branch from f0b2e2b to f364ca6 Compare February 5, 2025 19:41

cevian reviewed Feb 5, 2025

View reviewed changes

jgpruitt requested changes Feb 5, 2025

View reviewed changes

smoya closed this Feb 10, 2025

smoya deleted the jascha/vectorizer-s3-document branch February 10, 2025 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow vectorizer to read documents via smart-open parse via pymupdf #431

feat: allow vectorizer to read documents via smart-open parse via pymupdf #431

Askir commented Feb 5, 2025 •

edited

Loading

cevian Feb 5, 2025

jgpruitt Feb 5, 2025

jgpruitt Feb 5, 2025

jgpruitt Feb 5, 2025

smoya commented Feb 10, 2025

		-- loader_file_loader
		create or replace function ai.loader_from_document

		-- parser_auto
		create or replace function ai.parser_auto() returns pg_catalog.jsonb

feat: allow vectorizer to read documents via smart-open parse via pymupdf #431

feat: allow vectorizer to read documents via smart-open parse via pymupdf #431

Conversation

Askir commented Feb 5, 2025 • edited Loading

cevian Feb 5, 2025

Choose a reason for hiding this comment

jgpruitt Feb 5, 2025

Choose a reason for hiding this comment

jgpruitt Feb 5, 2025

Choose a reason for hiding this comment

jgpruitt Feb 5, 2025

Choose a reason for hiding this comment

smoya commented Feb 10, 2025

Askir commented Feb 5, 2025 •

edited

Loading