Add processors for generating embeddings, adding context to docs, submitting prompts #2008

hariso · 2024-12-11T13:28:53Z

Description

Depends on: https://github.com/conduitio-labs/conduit-connector-weaviate/tree/haris/different-vectors.

Quick checks

I have followed the Code Guidelines.
There is no other pull request for the same update/change.
I have written unit tests.
I have made sure that the PR is of reasonable size and can be easily reviewed.

lyuboxa · 2024-12-12T22:10:38Z

pkg/plugin/processor/builtin/impl/ai/openai/embedding.go

+type embeddingProcConfig struct {
+	APIKey     string `json:"apiKey" validate:"required"`
+	Endpoint   string `json:"endpoint" default:"https://api.openai.com/v1"`
+	Model      string `json:"model" validate:"required,inclusion=gpt-4|gpt-4-turbo|gpt-3.5-turbo|text-davinci-003|text-davinci-002|text-curie-001|text-babbage-001|text-ada-001"`


so these are not embeddings models?

Good catch, I was too quick to copy-paste all the models.:)

lyuboxa · 2024-12-13T14:10:22Z

pkg/plugin/processor/builtin/impl/ai/openai/embedding.go

+		Msg("got embeddings")
+
+	for i, record := range records {
+		record.Metadata[EmbeddingMetadataBase64] = embeddings.Data[i].EmbeddingBase64


These can be horribly large, my suggestion here is to:

add the embedding to .Payload.After

include the model name used for the embedding.

Particularly 2, since there are others openai compat systems which may allow for different models to be used.

Additionally, these can be very large (depending on the model) and I find that you can reduce the size of them with compression. All floats too. Base64 will not be very compressible.

You're right, especially about no. 2. As for 1, that would also mean that raw data records get transformed into structured data records. That might be unexpected for some destinations.

hariso added 7 commits December 10, 2024 20:33

OpenAI embeddings processor

fea96e0

prompt processor

6133700

linter

6d74f54

get context processor

2bfb465

get context

cd96a21

lint

33681dc

logs

28db62d

lyuboxa reviewed Dec 12, 2024

View reviewed changes

wrong list of embedding models

956c9f6

lyuboxa reviewed Dec 13, 2024

View reviewed changes

hariso added 2 commits December 13, 2024 15:12

AI pipeline

0c8a703

hack

0d50385

hariso changed the title ~~OpenAI embeddings processor~~ Add processors for generating embeddings, adding context to docs, submitting prompts Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add processors for generating embeddings, adding context to docs, submitting prompts #2008

Add processors for generating embeddings, adding context to docs, submitting prompts #2008

hariso commented Dec 11, 2024 •

edited

Loading

lyuboxa Dec 12, 2024

hariso Dec 13, 2024

lyuboxa Dec 13, 2024

lyuboxa Dec 13, 2024

hariso Dec 13, 2024

Add processors for generating embeddings, adding context to docs, submitting prompts #2008

Are you sure you want to change the base?

Add processors for generating embeddings, adding context to docs, submitting prompts #2008

Conversation

hariso commented Dec 11, 2024 • edited Loading

Description

Quick checks

lyuboxa Dec 12, 2024

Choose a reason for hiding this comment

hariso Dec 13, 2024

Choose a reason for hiding this comment

lyuboxa Dec 13, 2024

Choose a reason for hiding this comment

lyuboxa Dec 13, 2024

Choose a reason for hiding this comment

hariso Dec 13, 2024

Choose a reason for hiding this comment

hariso commented Dec 11, 2024 •

edited

Loading