Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for generating embeddings for external documents #442

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

smoya
Copy link
Contributor

@smoya smoya commented Feb 6, 2025

Summary

  • Added document loading and parsing capabilities to vectorizers, supporting various file formats (PDF, XLSX, HTML, EPUB, etc.)
  • Implemented loading_uri to fetch documents from local files or remote storage (S3)
  • Added automated parsing via parsing_auto to intelligently process different document types
  • Integrated document retry mechanism for handling transient failures
  • Created example for generating embeddings directly from documents

Test plan

  • Added comprehensive test suite for document loading and parsing
  • Added example with various document types to demonstrate functionality
  • Testing all document formats: PDF, XLSX, HTML, EPUB, Markdown
  • Verified retry mechanisms work correctly with proper error handling

@smoya smoya force-pushed the s3-integration-feature-branch branch from 554c4dc to b5cf803 Compare February 10, 2025 10:18
* feat: tweak chunking func to support doc chunking in the future (#407)


---------

Co-authored-by: Sergio Moya <[email protected]>
Co-authored-by: Adol Rodriguez <[email protected]>
smoya and others added 3 commits February 10, 2025 16:00
…457)

continue the work to support s3 document loading

Co-authored-by: Sergio Moya <[email protected]>
Co-authored-by: Adol Rodriguez <[email protected]>
another iteration on implementation of documents processing.
@adolsalamanca adolsalamanca force-pushed the s3-integration-feature-branch branch 2 times, most recently from b783aac to d7fdcaa Compare February 11, 2025 13:40
@smoya smoya force-pushed the s3-integration-feature-branch branch 2 times, most recently from 330e86c to b28f186 Compare February 18, 2025 10:37
@smoya smoya force-pushed the s3-integration-feature-branch branch from c96b0e8 to b2f01b4 Compare February 18, 2025 11:08
Copy link
Collaborator

@jgpruitt jgpruitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Only looked at the extension.

implementation of the document failure retries.
not only retry but also include info about those in the errors table.
make sure we don't retry unless retry_after is in the past.

tweak the queueing query so it ignores future records.
rename doc_retries and upgrade pyright.
Modified it to make it work with composite primary keys.
Also skip flaky test on CI.
@adolsalamanca adolsalamanca force-pushed the s3-integration-feature-branch branch from 8eb851f to efded76 Compare March 5, 2025 11:20
@adolsalamanca adolsalamanca temporarily deployed to internal-contributors March 5, 2025 11:20 — with GitHub Actions Inactive
@adolsalamanca adolsalamanca force-pushed the s3-integration-feature-branch branch from efded76 to 1819549 Compare March 5, 2025 11:25
@adolsalamanca adolsalamanca temporarily deployed to internal-contributors March 5, 2025 11:25 — with GitHub Actions Inactive
@adolsalamanca adolsalamanca force-pushed the s3-integration-feature-branch branch from 1819549 to f19d0a6 Compare March 5, 2025 11:32
@adolsalamanca adolsalamanca temporarily deployed to internal-contributors March 5, 2025 11:32 — with GitHub Actions Inactive
both in the extension and in the vectorizer.
@adolsalamanca adolsalamanca force-pushed the s3-integration-feature-branch branch from f19d0a6 to c5ec403 Compare March 5, 2025 11:40
@adolsalamanca adolsalamanca temporarily deployed to internal-contributors March 5, 2025 11:40 — with GitHub Actions Inactive
@smoya smoya temporarily deployed to internal-contributors March 5, 2025 11:46 — with GitHub Actions Inactive
@smoya smoya temporarily deployed to internal-contributors March 5, 2025 12:09 — with GitHub Actions Inactive
@smoya smoya requested review from jgpruitt and JamesGuthrie March 5, 2025 12:09
@smoya smoya temporarily deployed to internal-contributors March 5, 2025 12:28 — with GitHub Actions Inactive
@adolsalamanca adolsalamanca temporarily deployed to internal-contributors March 6, 2025 10:41 — with GitHub Actions Inactive
@adolsalamanca adolsalamanca force-pushed the s3-integration-feature-branch branch from 0deac42 to f831208 Compare March 6, 2025 10:44
@adolsalamanca adolsalamanca temporarily deployed to internal-contributors March 6, 2025 10:45 — with GitHub Actions Inactive
@adolsalamanca adolsalamanca force-pushed the s3-integration-feature-branch branch from f831208 to 75b164d Compare March 6, 2025 10:59
@adolsalamanca adolsalamanca temporarily deployed to internal-contributors March 6, 2025 10:59 — with GitHub Actions Inactive
@adolsalamanca adolsalamanca force-pushed the s3-integration-feature-branch branch from 75b164d to 3c50aed Compare March 6, 2025 11:03
@adolsalamanca adolsalamanca temporarily deployed to internal-contributors March 6, 2025 11:04 — with GitHub Actions Inactive
);

-- Update the vectorizer with new config
UPDATE ai.vectorizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the version in the config to be the version of this release.

@smoya smoya temporarily deployed to internal-contributors March 6, 2025 15:04 — with GitHub Actions Inactive
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, good work.

My biggest concern is the worker retry handling. See comments

The parsing functions are:

- [ai.parsing_auto](#aiparsing_auto): Automatically selects the appropriate parser based on file type.
- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.
- [ai.parsing_pymupdf](#aiparsing_pymupdf): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.

- [ai.parsing_auto](#aiparsing_auto): Automatically selects the appropriate parser based on file type.
- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats.
- [ai.parsing_docling](#aiparsing_docling): More powerful alternative to PyMuPDF. See [Docling](https://ds4sd.github.io/docling/supported_formats/) for supported formats.
- [ai.parsing_pymupdf](#aiparsing_pymupdf): For cases where no parsing is needed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [ai.parsing_pymupdf](#aiparsing_pymupdf): For cases where no parsing is needed.
- [ai.parsing_none](#aiparsing_none): For cases where no parsing is needed.


### ai.parsing_none

You use `ai.parsing_none` to load the data as-is from the source table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You use `ai.parsing_none` to load the data as-is from the source table.
You use `ai.parsing_none` to skip the parsing step. Only appropriate for textual data.


### ai.parsing_docling

You use `ai.parsing_docling` to parse the data provided by the loader using [docling](https://ds4sd.github.io/docling/).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You use `ai.parsing_docling` to parse the data provided by the loader using [docling](https://ds4sd.github.io/docling/).
You use `ai.parsing_docling` to parse the data using [docling](https://ds4sd.github.io/docling/).

- Download this example subdirectory. You can quickly do it by generating a downloadable `.zip` file from [here](https://download-directory.github.io/?url=https%3A%2F%2Fgithub.com%2Ftimescale%2Fpgai%2Ftree%2Fmain%2Fexamples%2Fembeddings_from_documents).
- PostgreSQL database with the PGAI extension installed. Refer to [pgai install](/docs/README.md#pgai-install) for installation instructions.
- Documents to process (supports various formats including MD, XLSX, HTML, PDF). We will use those available in the [documents](documents) directory.
- A running instance of the [Vectorizer Worker](/docs/vectorizer/worker.md). In order to load the documents from the [documents](documents) directory, you need to modify the `compose.yaml` file created in the previous step and add the following volume to the `vectorizer-worker` service:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the "previous step" here

-- loading_uri
create or replace function ai.loading_uri
( column_name pg_catalog.name
, retries pg_catalog.int4 default 6)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry only applies to loading right? not the rest of the workflow?

create table %I.%I
( %s
, queued_at pg_catalog.timestamptz not null default now()
, loading_retries pg_catalog.int4 not null default 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these not apply to parsing, just loading?

return guess.extension


class RowLoading(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we change this to ColumnLoading in the postgres API? Lets keep names consistent.

ON {merge_predicates}
WHEN MATCHED
AND target.loading_retries >= %(loading_retries)s THEN
DELETE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me unhappy as this leaves no record of what permanently failed.
I see a few options

  • add a tombstone column
  • add a tombstone table
  • just increment loading_retries again and have the "get stuff from work queue" logic ignore items that have to high a retry count.

raise e.__cause__ # noqa
raise e

except UriLoadingError as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm handling this error at this level fails the rest of the rows in the batch too? what if 99 items succeeded and 1 failed? This seems not very good. why don't we handle these like we do in _generate_embeddings. specifically we should handle it like ChunkEmbeddingError and return but not raise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants