-
Notifications
You must be signed in to change notification settings - Fork 26
Fix missing unique key when searching embeddings #226
Conversation
Fix missing unique key when searching embeddings Unique key is required when searching embeddings to join the embedding table and the original data table before returing the results. Previously, search on a dataframe that created from an existing table in database failed due to lacking of unique key in the dataframe. This patch fixes the issue by recoding the unique key when `create_index()` in `pg_class` so that the info can be read when `search()`.
@@ -19,8 +21,13 @@ def test_embedding_query_string(db: gp.Database): | |||
) | |||
.check_unique(columns={"id"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_unique
actually does more than *check", it creates indexes which is not obvious from the function name. Shall we consider to rename the function?- Need tests multi columns for
check_unique()
andsearch()
. Another PR will be fine since it is not relevant to this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "check" comes from SQL https://www.postgresql.org/docs/current/ddl-constraints.html, like in
CREATE TABLE products (
product_no integer,
name text,
price numeric CHECK (price > 0)
);
AFAIK, creating an index is the only way for database to ensure that a set of columns contains only unique values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a test case for multi-column unique key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[5.4.1. Check Constraints](https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-CHECK-CONSTRAINTS)
[5.4.2. Not-Null Constraints](https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-NOT-NULL)
[5.4.3. Unique Constraints](https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-UNIQUE-CONSTRAINTS)
5.4.1. Check Constraints
A check constraint is the most generic constraint type. It allows you to specify that the value in a certain column must satisfy a Boolean (truth-value) expression. For instance, to require positive product prices, you could use:
CREATE TABLE products (
product_no integer,
name text,
price numeric CHECK (price > 0)
);
Doesn't this mean CHECK
is one kind of constrains, and UNIQUE
is another kind of constrain?
Is check
in check_unique
a verb? Or do I miss anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "constraints" is the object of "check".
That is, "check" is used for "check constraints". In this example price > 0
is the constraint, and uniqueness is another type of constraints.
Therefore, I think it makes sense to call this function check_unique
.
tests/test_embedding.py
Outdated
for row in results: | ||
assert row["content"] == "I like eating apples." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since only one item here no need to for loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Changed.
tests/test_embedding.py
Outdated
@pytest.mark.requires_pgvector | ||
def test_embedding_multi_col_unique(db: gp.Database): | ||
content = ["I have a dog.", "I like eating apples."] | ||
columns = {"id": range(len(content)), "id2": [1] * len(content), "content": content} | ||
t = ( | ||
db.create_dataframe(columns=columns) | ||
.save_as( | ||
temp=True, | ||
column_names=list(columns.keys()), | ||
distribution_key={"id"}, | ||
distribution_type="hash", | ||
drop_if_exists=True, | ||
drop_cascade=True, | ||
) | ||
.check_unique(columns={"id", "id2"}) | ||
) | ||
t.embedding().create_index(column="content", model_name="all-MiniLM-L6-v2") | ||
print( | ||
"reloptions =", | ||
db._execute( | ||
f"SELECT reloptions FROM pg_class WHERE oid = '{t._qualified_table_name}'::regclass" | ||
), | ||
) | ||
search_embeddings(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this test always pass? seems we have no assert here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserts are in search_embeddings()
.
tests/test_embedding.py
Outdated
assert len(list(df)) == 1 | ||
for row in df: | ||
assert row["content"] == "I like eating apples." | ||
t.embedding().create_index(column="content", model_name="all-MiniLM-L6-v2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to assign result to t
anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let me fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Unique key is required when searching embeddings to join the embedding table and the original data table before returning the results. Previously, search on a dataframe that created from an existing table in database failed due to lacking of unique key in the dataframe.
This patch fixes the issue by recording the unique key when
create_index()
inpg_class
so that the info can be read whensearch()
.