Fix missing unique key when searching embeddings #226

xuebinsu · 2023-12-05T05:22:53Z

Unique key is required when searching embeddings to join the embedding table and the original data table before returning the results. Previously, search on a dataframe that created from an existing table in database failed due to lacking of unique key in the dataframe.

This patch fixes the issue by recording the unique key when create_index() in pg_class so that the info can be read when search().

Fix missing unique key when searching embeddings Unique key is required when searching embeddings to join the embedding table and the original data table before returing the results. Previously, search on a dataframe that created from an existing table in database failed due to lacking of unique key in the dataframe. This patch fixes the issue by recoding the unique key when `create_index()` in `pg_class` so that the info can be read when `search()`.

tests/test_embedding.py

beeender · 2023-12-05T06:15:09Z

tests/test_embedding.py

@@ -19,8 +21,13 @@ def test_embedding_query_string(db: gp.Database):
        )
        .check_unique(columns={"id"})


check_unique actually does more than *check", it creates indexes which is not obvious from the function name. Shall we consider to rename the function?

Need tests multi columns for check_unique() and search(). Another PR will be fine since it is not relevant to this one.

The word "check" comes from SQL https://www.postgresql.org/docs/current/ddl-constraints.html, like in

CREATE TABLE products ( product_no integer, name text, price numeric CHECK (price > 0) );

AFAIK, creating an index is the only way for database to ensure that a set of columns contains only unique values.

I will add a test case for multi-column unique key.

[5.4.1. Check Constraints](https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-CHECK-CONSTRAINTS) [5.4.2. Not-Null Constraints](https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-NOT-NULL) [5.4.3. Unique Constraints](https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-UNIQUE-CONSTRAINTS) 5.4.1. Check Constraints A check constraint is the most generic constraint type. It allows you to specify that the value in a certain column must satisfy a Boolean (truth-value) expression. For instance, to require positive product prices, you could use: CREATE TABLE products ( product_no integer, name text, price numeric CHECK (price > 0) );

Doesn't this mean CHECK is one kind of constrains, and UNIQUE is another kind of constrain?

Is check in check_unique a verb? Or do I miss anything?

I think "constraints" is the object of "check".

That is, "check" is used for "check constraints". In this example price > 0 is the constraint, and uniqueness is another type of constraints.

Therefore, I think it makes sense to call this function check_unique.

yihong0618 · 2023-12-06T13:41:24Z

tests/test_embedding.py

+    for row in results:
+        assert row["content"] == "I like eating apples."


since only one item here no need to for loop?

Thanks! Changed.

yihong0618 · 2023-12-06T13:44:29Z

tests/test_embedding.py

+@pytest.mark.requires_pgvector
+def test_embedding_multi_col_unique(db: gp.Database):
+    content = ["I have a dog.", "I like eating apples."]
+    columns = {"id": range(len(content)), "id2": [1] * len(content), "content": content}
+    t = (
+        db.create_dataframe(columns=columns)
+        .save_as(
+            temp=True,
+            column_names=list(columns.keys()),
+            distribution_key={"id"},
+            distribution_type="hash",
+            drop_if_exists=True,
+            drop_cascade=True,
+        )
+        .check_unique(columns={"id", "id2"})
+    )
+    t.embedding().create_index(column="content", model_name="all-MiniLM-L6-v2")
+    print(
+        "reloptions =",
+        db._execute(
+            f"SELECT reloptions FROM pg_class WHERE oid = '{t._qualified_table_name}'::regclass"
+        ),
+    )
+    search_embeddings(t)


is this test always pass? seems we have no assert here

Asserts are in search_embeddings().

ruxuez · 2023-12-06T16:45:45Z

tests/test_embedding.py

-    assert len(list(df)) == 1
-    for row in df:
-        assert row["content"] == "I like eating apples."
+    t.embedding().create_index(column="content", model_name="all-MiniLM-L6-v2")


We don't need to assign result to t anymore?

Thanks, let me fix it.

yihong0618

LGTM

vmwclabot added the cla-not-required label Dec 5, 2023

xuebinsu requested review from ruxuez, yihong0618 and beeender and removed request for ruxuez December 5, 2023 05:25

beeender reviewed Dec 5, 2023

View reviewed changes

Xuebin Su added 2 commits December 5, 2023 02:08

Add multi-column unique key case

64daab4

Rephrase

8604cb0

beeender approved these changes Dec 6, 2023

View reviewed changes

yihong0618 reviewed Dec 6, 2023

View reviewed changes

ruxuez reviewed Dec 6, 2023

View reviewed changes

Make test more rigorous

c58dd98

yihong0618 approved these changes Dec 7, 2023

View reviewed changes

Fix test again

89175b8

xuebinsu merged commit 2147a59 into main Dec 7, 2023
6 checks passed

xuebinsu deleted the fix_unique_key branch December 7, 2023 02:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing unique key when searching embeddings #226

Fix missing unique key when searching embeddings #226

xuebinsu commented Dec 5, 2023 •

edited

Loading

beeender Dec 5, 2023

xuebinsu Dec 5, 2023

xuebinsu Dec 5, 2023

beeender Dec 5, 2023

xuebinsu Dec 6, 2023

yihong0618 Dec 6, 2023

xuebinsu Dec 7, 2023

yihong0618 Dec 6, 2023

xuebinsu Dec 7, 2023

ruxuez Dec 6, 2023 •

edited

Loading

xuebinsu Dec 7, 2023

yihong0618 left a comment

		@@ -19,8 +21,13 @@ def test_embedding_query_string(db: gp.Database):
		)
		.check_unique(columns={"id"})

		for row in results:
		assert row["content"] == "I like eating apples."

Fix missing unique key when searching embeddings #226

Fix missing unique key when searching embeddings #226

Conversation

xuebinsu commented Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruxuez Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yihong0618 left a comment

Choose a reason for hiding this comment

xuebinsu commented Dec 5, 2023 •

edited

Loading

ruxuez Dec 6, 2023 •

edited

Loading