Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of suggested changes to the vectorstore #579

Open
wants to merge 2 commits into
base: add-metadata-filter-func
Choose a base branch
from

Conversation

fantix
Copy link
Member

@fantix fantix commented Feb 21, 2025

This PR includes draft implementation of how I would design the vectorstore client APIs. The idea here is to remain consistent with the rest of the client binding, honor EdgeQL and be Pythonic, at the same time, make the API easier to use.

I didn’t know the team had agreed on the APIs, so please just treat this PR as a reference and cherry-pick what you need.

Refs #573

Usage

Adding records:

from gel import ai

client = gel.create_client()
model = MyEmbeddingModel(...)

# generate embedding from text with random metadata
record_id: uuid.UUID = client.query_required_single(
    model.store().add_text(
        "user text to generate embeddings",
        tag1="red",
        prop="pyro",
    )
)

# bring your own embedding
record_id: uuid.UUID = client.query(
    model.store().add_embedding(
        np.asarray([3.0, 9.0, -42.5], dtype=np.float32),
        # text="optional text",
        tag1="blue",
        prop="hydro",
    )
)

# batch insertion
record_ids = client.query(
    ai.AddRecords(
        model.store().add_text(...),
        model.store().add_embedding(...),
        model.store().add_text(...),
        model.store().add_embedding(...),
    )
)

Search by vector similarity and filter by metadata:

records = client.query(
    model
        .store()
        .search_by_vector(np.asarray([3.0, 9.0, -42.5], dtype=np.float32))
        .filter_metadata("prop", eq="hydro")  # (<str>json_get(.metadata, 'prop') = <str>$0) ?? false
        .filter_metadata("nested", "int", gt=5)  # (<int64>json_get(.metadata, 'nested', 'int') = <int64>$1) ?? false
        .filter_metadata("inc", "empty", lte=2.0, default=True)  # (<float64>json_get(.metadata, 'inc', 'empty') = <float>$2) ?? true
        .filter("exists(json_get(.metadata, 'custom', 'edgeql'))")
        .limit(10)
)
for record in records:
    print(record.text, record.cosine_similarity)

(note: I find the previous edgeql-query-builder-alike API neither easy to use nor keeping the user away from EdgeQL knowledge. User would easily end up composing a complex filter with many and/or's without dealing with empty sets/missing keys correctly, and get surprises when results are unexpected, and there's no way to correct it once user learned EdgeQL because it's an incomplete query builder. I think now that we cannot easily introduce a full-blown query builder into just the vectorstore API, why not just use the most simple and native API and hide the EdgeQL complications from the AI-users.)

Update records:

# update embedding
client.execute(
    model.store().update_record(
        uuid.UUID("75c3a683-0cd2-43d6-9bef-ad78f35b6381"),
        embedding=np.asarray([3.0, 9.0, -42.5], dtype=np.float32),
    )
)

# update embedding and text
client.execute(
    model.store().update_record(
        uuid.UUID("75c3a683-0cd2-43d6-9bef-ad78f35b6381"),
        embedding=np.asarray([3.0, 9.0, -42.5], dtype=np.float32),
        text="new sentence",
    )
)

# update metadata only
record_id = uuid.UUID("75c3a683-0cd2-43d6-9bef-ad78f35b6381")
record = client.query(model.store().get_by_ids(record_id))
new_metadata = record[0].metadata.copy()
new_metadata.setdefault("nested", {})["value"] = "new"
client.execute(
    model.store().update_record(
        record_id,
        metadata=new_metadata,
    )
)

# clear metadata
client.execute(
    model.store().update_record(
        uuid.UUID("75c3a683-0cd2-43d6-9bef-ad78f35b6381"),
        metadata=None,
    )
)

All above just works with EdgeDB transactions:

for tx in client.transaction():
    with tx:
        tx.execute(model.store().update_record(...))
        tx.execute("update OtherObject filter ...")

Asynchronous interface is just the same, with explicit awaits before embedding generation:

client = gel.create_async_client()
model = MyAsyncModel()
await client.query_single_required(
    await model.store().add_text(...)
)

So that the user would realize and generate embeddings out of transactions:

add_cmd = await model.store().add_text(...)
async for tx in client.transaction():
    async with tx:
        record_id = await tx.query_single_required(add_cmd)
        await tx.execute('update MyObj ...', record_id)

But no awaits if not calling the model for generation:

await client.query(
    model.store().add_embedding(...)
)

@fantix fantix requested a review from diksipav February 21, 2025 03:00
@fantix fantix changed the title Draft of suggested changes Draft of suggested changes to the vectorstore Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant