feat: add support for generating embeddings for external documents #442

smoya · 2025-02-06T10:06:35Z

Summary

Added document loading and parsing capabilities to vectorizers, supporting various file formats (PDF, XLSX, HTML, EPUB, etc.)
Implemented loading_uri to fetch documents from local files or remote storage (S3)
Added automated parsing via parsing_auto to intelligently process different document types
Integrated document retry mechanism for handling transient failures
Created example for generating embeddings directly from documents

Test plan

Added comprehensive test suite for document loading and parsing
Added example with various document types to demonstrate functionality
Testing all document formats: PDF, XLSX, HTML, EPUB, Markdown
Verified retry mechanisms work correctly with proper error handling

--------- Co-authored-by: Adol Rodriguez <[email protected]>

* feat: tweak chunking func to support doc chunking in the future (#407) --------- Co-authored-by: Sergio Moya <[email protected]> Co-authored-by: Adol Rodriguez <[email protected]>

…457) continue the work to support s3 document loading Co-authored-by: Sergio Moya <[email protected]> Co-authored-by: Adol Rodriguez <[email protected]>

another iteration on implementation of documents processing.

we're adding a couple of new columns to the vectorizer queue tables so we can track whether the vectorizers are being properly generated or require a retry.

…e-branch

…ment (#480) --------- Co-authored-by: Sergio Moya <[email protected]>

jgpruitt

First pass. Only looked at the extension.

projects/extension/sql/idempotent/012-vectorizer-int.sql

projects/extension/sql/idempotent/013-loading.sql

projects/extension/sql/idempotent/015-vectorizer-api.sql

projects/extension/sql/idempotent/014-parsing.sql

projects/extension/tests/vectorizer/test_loading.py

projects/extension/tests/vectorizer/test_parsing.py

projects/extension/tests/vectorizer/test_vectorizer.py

implementation of the document failure retries. not only retry but also include info about those in the errors table. make sure we don't retry unless retry_after is in the past. tweak the queueing query so it ignores future records. rename doc_retries and upgrade pyright.

Modified it to make it work with composite primary keys. Also skip flaky test on CI.

both in the extension and in the vectorizer.

projects/extension/tests/vectorizer/test_vectorizer.py

projects/extension/sql/idempotent/013-loading.sql

projects/extension/sql/idempotent/012-vectorizer-int.sql

projects/extension/tests/vectorizer/server.py

jgpruitt · 2025-03-06T15:04:27Z

projects/extension/sql/incremental/020-migrate-existing-vectorizers-to-loading.sql

+            );
+
+            -- Update the vectorizer with new config
+            UPDATE ai.vectorizer 


We should update the version in the config to be the version of this release.

cevian

Overall, good work.

My biggest concern is the worker retry handling. See comments

cevian · 2025-03-05T20:33:11Z

docs/vectorizer/api-reference.md

+The parsing functions are:
+
+- [ai.parsing_auto](#aiparsing_auto): Automatically selects the appropriate parser based on file type.
+- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.


Suggested change

- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.

- [ai.parsing_pymupdf](#aiparsing_pymupdf): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.

cevian · 2025-03-05T20:33:33Z

docs/vectorizer/api-reference.md

+- [ai.parsing_auto](#aiparsing_auto): Automatically selects the appropriate parser based on file type.
+- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.
+- [ai.parsing_docling](#aiparsing_docling): More powerful alternative to PyMuPDF. See [Docling](https://ds4sd.github.io/docling/supported_formats/) for supported formats.
+- [ai.parsing_pymupdf](#aiparsing_pymupdf): For cases where no parsing is needed.


Suggested change

- [ai.parsing_pymupdf](#aiparsing_pymupdf): For cases where no parsing is needed.

- [ai.parsing_none](#aiparsing_none): For cases where no parsing is needed.

cevian · 2025-03-05T20:34:50Z

docs/vectorizer/api-reference.md

+
+### ai.parsing_none
+
+You use `ai.parsing_none` to load the data as-is from the source table.


Suggested change

You use `ai.parsing_none` to load the data as-is from the source table.

You use `ai.parsing_none` to skip the parsing step. Only appropriate for textual data.

cevian · 2025-03-05T20:35:31Z

docs/vectorizer/api-reference.md

+
+### ai.parsing_docling
+
+You use `ai.parsing_docling` to parse the data provided by the loader using [docling](https://ds4sd.github.io/docling/).


Suggested change

You use `ai.parsing_docling` to parse the data provided by the loader using [docling](https://ds4sd.github.io/docling/).

You use `ai.parsing_docling` to parse the data using [docling](https://ds4sd.github.io/docling/).

cevian · 2025-03-05T20:38:19Z

examples/embeddings_from_documents/README.md

+- Download this example subdirectory. You can quickly do it by generating a downloadable `.zip` file from [here](https://download-directory.github.io/?url=https%3A%2F%2Fgithub.com%2Ftimescale%2Fpgai%2Ftree%2Fmain%2Fexamples%2Fembeddings_from_documents).
+- PostgreSQL database with the PGAI extension installed. Refer to [pgai install](/docs/README.md#pgai-install) for installation instructions.
+- Documents to process (supports various formats including MD, XLSX, HTML, PDF). We will use those available in the [documents](documents) directory.
+- A running instance of the [Vectorizer Worker](/docs/vectorizer/worker.md). In order to load the documents from the [documents](documents) directory, you need to modify the `compose.yaml` file created in the previous step and add the following volume to the `vectorizer-worker` service:


what's the "previous step" here

cevian · 2025-03-06T21:06:03Z

projects/extension/sql/idempotent/012-loading.sql

+-- loading_uri
+create or replace function ai.loading_uri
+( column_name pg_catalog.name
+, retries pg_catalog.int4 default 6)


This retry only applies to loading right? not the rest of the workflow?

cevian · 2025-03-06T21:09:50Z

projects/extension/sql/idempotent/014-vectorizer-int.sql

+      create table %I.%I
+      ( %s
+      , queued_at pg_catalog.timestamptz not null default now()
+      , loading_retries pg_catalog.int4 not null default 0


do these not apply to parsing, just loading?

cevian · 2025-03-06T21:18:38Z

projects/pgai/pgai/vectorizer/loading.py

+    return guess.extension
+
+
+class RowLoading(BaseModel):


didn't we change this to ColumnLoading in the postgres API? Lets keep names consistent.

cevian · 2025-03-06T21:29:48Z

projects/pgai/pgai/vectorizer/vectorizer.py

+            ON {merge_predicates}
+            WHEN MATCHED
+                AND target.loading_retries >= %(loading_retries)s THEN
+                    DELETE


This makes me unhappy as this leaves no record of what permanently failed.
I see a few options

add a tombstone column

add a tombstone table

just increment loading_retries again and have the "get stuff from work queue" logic ignore items that have to high a retry count.

cevian · 2025-03-06T21:38:11Z

projects/pgai/pgai/vectorizer/vectorizer.py

+                    raise e.__cause__  # noqa
+                raise e
+
+            except UriLoadingError as e:


hmm handling this error at this level fails the rest of the rows in the batch too? what if 99 items succeeded and 1 failed? This seems not very good. why don't we handle these like we do in _generate_embeddings. specifically we should handle it like ChunkEmbeddingError and return but not raise.

feat: tweak chunking func to support doc chunking in the future (#407)

b5cf803

--------- Co-authored-by: Adol Rodriguez <[email protected]>

smoya force-pushed the s3-integration-feature-branch branch from 554c4dc to b5cf803 Compare February 10, 2025 10:18

feat: vectorizer row loader, default chunking (#448)

82ea7fb

* feat: tweak chunking func to support doc chunking in the future (#407) --------- Co-authored-by: Sergio Moya <[email protected]> Co-authored-by: Adol Rodriguez <[email protected]>

This was referenced Feb 10, 2025

feat: allow vectorizer to read documents via smart-open parse via pymupdf #431

Closed

poc: document support #408

Closed

smoya and others added 3 commits February 10, 2025 16:00

feat: add loading_row & loading_document functions to the extension (#…

c16e13a

…457) continue the work to support s3 document loading Co-authored-by: Sergio Moya <[email protected]> Co-authored-by: Adol Rodriguez <[email protected]>

feat: add parsing and implement load document (#461)

d5254fe

another iteration on implementation of documents processing.

test: fix test_simple_document_embedding_s3_no_credentials (#465)

d7fdcaa

adolsalamanca force-pushed the s3-integration-feature-branch branch 2 times, most recently from b783aac to d7fdcaa Compare February 11, 2025 13:40

smoya and others added 9 commits February 11, 2025 11:05

feat: support multiple document file types (#467)

5119923

test: add epub and markdown tests + byta column (#469)

9af568b

feat: support queue item retries (#466)

f87b25a

we're adding a couple of new columns to the vectorizer queue tables so we can track whether the vectorizers are being properly generated or require a retry.

feat: docling parser

3b880a6

chore: disable OCR

9071c33

ci: prefetch docling models in CI

b4f81b7

feat: support epub through pymupdf

7252034

ci: globally ignore docling models urls in vcrpy

9a74ae0

ci: prefetch in conftest.py

627387f

smoya force-pushed the s3-integration-feature-branch branch 2 times, most recently from 330e86c to b28f186 Compare February 18, 2025 10:37

Merge remote-tracking branch 'origin/main' into s3-integration-featur…

b2f01b4

…e-branch

smoya force-pushed the s3-integration-feature-branch branch from c96b0e8 to b2f01b4 Compare February 18, 2025 11:08

smoya and others added 4 commits February 18, 2025 12:13

Merge branch 'sergio/docling' into s3-integration-feature-branch

6764726

feat: support docling parser in the extension (#489)

74b7c7a

Merge remote-tracking branch 'origin/main' into s3-integration-featur…

5f12867

…e-branch

docs: add loading and parsing where relevant remove chunk column argu…

1213cd0

…ment (#480) --------- Co-authored-by: Sergio Moya <[email protected]>

jgpruitt requested changes Feb 20, 2025

View reviewed changes

adolsalamanca added 2 commits February 24, 2025 18:29

fix: rework requeue_or_remove_work_query (#513)

516a512

Modified it to make it work with composite primary keys. Also skip flaky test on CI.

adolsalamanca force-pushed the s3-integration-feature-branch branch from 8eb851f to efded76 Compare March 5, 2025 11:20

adolsalamanca temporarily deployed to internal-contributors March 5, 2025 11:20 — with GitHub Actions Inactive

adolsalamanca force-pushed the s3-integration-feature-branch branch from efded76 to 1819549 Compare March 5, 2025 11:25

adolsalamanca temporarily deployed to internal-contributors March 5, 2025 11:25 — with GitHub Actions Inactive

adolsalamanca force-pushed the s3-integration-feature-branch branch from 1819549 to f19d0a6 Compare March 5, 2025 11:32

adolsalamanca temporarily deployed to internal-contributors March 5, 2025 11:32 — with GitHub Actions Inactive

fix: rework retries logic so it's loader specific

c5ec403

both in the extension and in the vectorizer.

adolsalamanca force-pushed the s3-integration-feature-branch branch from f19d0a6 to c5ec403 Compare March 5, 2025 11:40

adolsalamanca temporarily deployed to internal-contributors March 5, 2025 11:40 — with GitHub Actions Inactive

test: use localstack for tests using s3

5b856d6

smoya temporarily deployed to internal-contributors March 5, 2025 11:46 — with GitHub Actions Inactive

test: fix bad queue_table expectation

2b7314e

smoya temporarily deployed to internal-contributors March 5, 2025 12:09 — with GitHub Actions Inactive

smoya requested review from jgpruitt and JamesGuthrie March 5, 2025 12:09

chore: fix formatting issues

2b152ae

smoya temporarily deployed to internal-contributors March 5, 2025 12:28 — with GitHub Actions Inactive

jgpruitt requested changes Mar 5, 2025

View reviewed changes

adolsalamanca temporarily deployed to internal-contributors March 6, 2025 10:41 — with GitHub Actions Inactive

adolsalamanca force-pushed the s3-integration-feature-branch branch from 0deac42 to f831208 Compare March 6, 2025 10:44

adolsalamanca temporarily deployed to internal-contributors March 6, 2025 10:45 — with GitHub Actions Inactive

adolsalamanca force-pushed the s3-integration-feature-branch branch from f831208 to 75b164d Compare March 6, 2025 10:59

adolsalamanca temporarily deployed to internal-contributors March 6, 2025 10:59 — with GitHub Actions Inactive

fix: add latest suggestions

3c50aed

adolsalamanca force-pushed the s3-integration-feature-branch branch from 75b164d to 3c50aed Compare March 6, 2025 11:03

adolsalamanca temporarily deployed to internal-contributors March 6, 2025 11:04 — with GitHub Actions Inactive

docs: only show example of AWS S3 credentials

f350a08

jgpruitt requested changes Mar 6, 2025

View reviewed changes

smoya temporarily deployed to internal-contributors March 6, 2025 15:04 — with GitHub Actions Inactive

cevian requested changes Mar 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for generating embeddings for external documents #442

feat: add support for generating embeddings for external documents #442

smoya commented Feb 6, 2025 •

edited

Loading

jgpruitt left a comment

jgpruitt Mar 6, 2025

cevian left a comment

cevian Mar 5, 2025

cevian Mar 5, 2025

cevian Mar 5, 2025

cevian Mar 5, 2025

cevian Mar 5, 2025

cevian Mar 6, 2025

cevian Mar 6, 2025

cevian Mar 6, 2025

cevian Mar 6, 2025

cevian Mar 6, 2025

	- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.
	- [ai.parsing_pymupdf](#aiparsing_pymupdf): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.

	- [ai.parsing_pymupdf](#aiparsing_pymupdf): For cases where no parsing is needed.
	- [ai.parsing_none](#aiparsing_none): For cases where no parsing is needed.


		### ai.parsing_none

		You use `ai.parsing_none` to load the data as-is from the source table.

	You use `ai.parsing_none` to load the data as-is from the source table.
	You use `ai.parsing_none` to skip the parsing step. Only appropriate for textual data.


		### ai.parsing_docling

		You use `ai.parsing_docling` to parse the data provided by the loader using [docling](https://ds4sd.github.io/docling/).

feat: add support for generating embeddings for external documents #442

Are you sure you want to change the base?

feat: add support for generating embeddings for external documents #442

Conversation

smoya commented Feb 6, 2025 • edited Loading

Summary

Test plan

jgpruitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cevian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smoya commented Feb 6, 2025 •

edited

Loading