-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for generating embeddings for external documents #442
base: main
Are you sure you want to change the base?
Conversation
--------- Co-authored-by: Adol Rodriguez <[email protected]>
554c4dc
to
b5cf803
Compare
* feat: tweak chunking func to support doc chunking in the future (#407) --------- Co-authored-by: Sergio Moya <[email protected]> Co-authored-by: Adol Rodriguez <[email protected]>
…457) continue the work to support s3 document loading Co-authored-by: Sergio Moya <[email protected]> Co-authored-by: Adol Rodriguez <[email protected]>
another iteration on implementation of documents processing.
b783aac
to
d7fdcaa
Compare
we're adding a couple of new columns to the vectorizer queue tables so we can track whether the vectorizers are being properly generated or require a retry.
330e86c
to
b28f186
Compare
c96b0e8
to
b2f01b4
Compare
…ment (#480) --------- Co-authored-by: Sergio Moya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. Only looked at the extension.
implementation of the document failure retries. not only retry but also include info about those in the errors table. make sure we don't retry unless retry_after is in the past. tweak the queueing query so it ignores future records. rename doc_retries and upgrade pyright.
Modified it to make it work with composite primary keys. Also skip flaky test on CI.
8eb851f
to
efded76
Compare
efded76
to
1819549
Compare
1819549
to
f19d0a6
Compare
both in the extension and in the vectorizer.
f19d0a6
to
c5ec403
Compare
0deac42
to
f831208
Compare
f831208
to
75b164d
Compare
75b164d
to
3c50aed
Compare
); | ||
|
||
-- Update the vectorizer with new config | ||
UPDATE ai.vectorizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update the version
in the config to be the version of this release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, good work.
My biggest concern is the worker retry handling. See comments
The parsing functions are: | ||
|
||
- [ai.parsing_auto](#aiparsing_auto): Automatically selects the appropriate parser based on file type. | ||
- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats. | |
- [ai.parsing_pymupdf](#aiparsing_pymupdf): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats. |
- [ai.parsing_auto](#aiparsing_auto): Automatically selects the appropriate parser based on file type. | ||
- [ai.parsing_none](#aiparsing_none): Converts various formats to Markdown. See [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) for supported formats. | ||
- [ai.parsing_docling](#aiparsing_docling): More powerful alternative to PyMuPDF. See [Docling](https://ds4sd.github.io/docling/supported_formats/) for supported formats. | ||
- [ai.parsing_pymupdf](#aiparsing_pymupdf): For cases where no parsing is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [ai.parsing_pymupdf](#aiparsing_pymupdf): For cases where no parsing is needed. | |
- [ai.parsing_none](#aiparsing_none): For cases where no parsing is needed. |
|
||
### ai.parsing_none | ||
|
||
You use `ai.parsing_none` to load the data as-is from the source table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use `ai.parsing_none` to load the data as-is from the source table. | |
You use `ai.parsing_none` to skip the parsing step. Only appropriate for textual data. |
|
||
### ai.parsing_docling | ||
|
||
You use `ai.parsing_docling` to parse the data provided by the loader using [docling](https://ds4sd.github.io/docling/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use `ai.parsing_docling` to parse the data provided by the loader using [docling](https://ds4sd.github.io/docling/). | |
You use `ai.parsing_docling` to parse the data using [docling](https://ds4sd.github.io/docling/). |
- Download this example subdirectory. You can quickly do it by generating a downloadable `.zip` file from [here](https://download-directory.github.io/?url=https%3A%2F%2Fgithub.com%2Ftimescale%2Fpgai%2Ftree%2Fmain%2Fexamples%2Fembeddings_from_documents). | ||
- PostgreSQL database with the PGAI extension installed. Refer to [pgai install](/docs/README.md#pgai-install) for installation instructions. | ||
- Documents to process (supports various formats including MD, XLSX, HTML, PDF). We will use those available in the [documents](documents) directory. | ||
- A running instance of the [Vectorizer Worker](/docs/vectorizer/worker.md). In order to load the documents from the [documents](documents) directory, you need to modify the `compose.yaml` file created in the previous step and add the following volume to the `vectorizer-worker` service: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the "previous step" here
-- loading_uri | ||
create or replace function ai.loading_uri | ||
( column_name pg_catalog.name | ||
, retries pg_catalog.int4 default 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This retry only applies to loading right? not the rest of the workflow?
create table %I.%I | ||
( %s | ||
, queued_at pg_catalog.timestamptz not null default now() | ||
, loading_retries pg_catalog.int4 not null default 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do these not apply to parsing, just loading?
return guess.extension | ||
|
||
|
||
class RowLoading(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't we change this to ColumnLoading in the postgres API? Lets keep names consistent.
ON {merge_predicates} | ||
WHEN MATCHED | ||
AND target.loading_retries >= %(loading_retries)s THEN | ||
DELETE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me unhappy as this leaves no record of what permanently failed.
I see a few options
- add a tombstone column
- add a tombstone table
- just increment loading_retries again and have the "get stuff from work queue" logic ignore items that have to high a retry count.
raise e.__cause__ # noqa | ||
raise e | ||
|
||
except UriLoadingError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm handling this error at this level fails the rest of the rows in the batch too? what if 99 items succeeded and 1 failed? This seems not very good. why don't we handle these like we do in _generate_embeddings
. specifically we should handle it like ChunkEmbeddingError
and return but not raise.
Summary
Test plan