Skip to content
This repository has been archived by the owner on Jul 16, 2024. It is now read-only.

Fix missing unique key when searching embeddings #226

Merged
merged 5 commits into from
Dec 7, 2023
Merged

Conversation

xuebinsu
Copy link
Contributor

@xuebinsu xuebinsu commented Dec 5, 2023

Unique key is required when searching embeddings to join the embedding table and the original data table before returning the results. Previously, search on a dataframe that created from an existing table in database failed due to lacking of unique key in the dataframe.

This patch fixes the issue by recording the unique key when create_index() in pg_class so that the info can be read when search().

Fix missing unique key when searching embeddings

Unique key is required when searching embeddings to join the
embedding table and the original data table before returing the results.
Previously, search on a dataframe that created from an existing table
in database failed due to lacking of unique key in the dataframe.

This patch fixes the issue by recoding the unique key when
`create_index()` in `pg_class` so that the info can be read when
`search()`.
@xuebinsu xuebinsu requested review from ruxuez, yihong0618 and beeender and removed request for ruxuez December 5, 2023 05:25
tests/test_embedding.py Show resolved Hide resolved
tests/test_embedding.py Outdated Show resolved Hide resolved
@@ -19,8 +21,13 @@ def test_embedding_query_string(db: gp.Database):
)
.check_unique(columns={"id"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. check_unique actually does more than *check", it creates indexes which is not obvious from the function name. Shall we consider to rename the function?
  2. Need tests multi columns for check_unique() and search(). Another PR will be fine since it is not relevant to this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "check" comes from SQL https://www.postgresql.org/docs/current/ddl-constraints.html, like in

CREATE TABLE products (
    product_no integer,
    name text,
    price numeric CHECK (price > 0)
);

AFAIK, creating an index is the only way for database to ensure that a set of columns contains only unique values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a test case for multi-column unique key.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[5.4.1. Check Constraints](https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-CHECK-CONSTRAINTS)
[5.4.2. Not-Null Constraints](https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-NOT-NULL)
[5.4.3. Unique Constraints](https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-UNIQUE-CONSTRAINTS)

5.4.1. Check Constraints

A check constraint is the most generic constraint type. It allows you to specify that the value in a certain column must satisfy a Boolean (truth-value) expression. For instance, to require positive product prices, you could use:

CREATE TABLE products (
    product_no integer,
    name text,
    price numeric CHECK (price > 0)
);

Doesn't this mean CHECK is one kind of constrains, and UNIQUE is another kind of constrain?

Is check in check_unique a verb? Or do I miss anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "constraints" is the object of "check".

That is, "check" is used for "check constraints". In this example price > 0 is the constraint, and uniqueness is another type of constraints.

Therefore, I think it makes sense to call this function check_unique.

Comment on lines 11 to 12
for row in results:
assert row["content"] == "I like eating apples."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since only one item here no need to for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Changed.

Comment on lines 39 to 62
@pytest.mark.requires_pgvector
def test_embedding_multi_col_unique(db: gp.Database):
content = ["I have a dog.", "I like eating apples."]
columns = {"id": range(len(content)), "id2": [1] * len(content), "content": content}
t = (
db.create_dataframe(columns=columns)
.save_as(
temp=True,
column_names=list(columns.keys()),
distribution_key={"id"},
distribution_type="hash",
drop_if_exists=True,
drop_cascade=True,
)
.check_unique(columns={"id", "id2"})
)
t.embedding().create_index(column="content", model_name="all-MiniLM-L6-v2")
print(
"reloptions =",
db._execute(
f"SELECT reloptions FROM pg_class WHERE oid = '{t._qualified_table_name}'::regclass"
),
)
search_embeddings(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test always pass? seems we have no assert here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserts are in search_embeddings().

assert len(list(df)) == 1
for row in df:
assert row["content"] == "I like eating apples."
t.embedding().create_index(column="content", model_name="all-MiniLM-L6-v2")
Copy link
Contributor

@ruxuez ruxuez Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to assign result to t anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let me fix it.

Copy link
Contributor

@yihong0618 yihong0618 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xuebinsu xuebinsu merged commit 2147a59 into main Dec 7, 2023
6 checks passed
@xuebinsu xuebinsu deleted the fix_unique_key branch December 7, 2023 02:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants