-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add redisearch vectorstore #1307
feat: add redisearch vectorstore #1307
Conversation
a jupyter notebook example would be very helpful! |
Thanks for opening this 💪🏼 My team at Redis is also taking a quick look. We might have some suggestions/optimizations but could always include those in a later PR! |
96b4d34
to
7169eb1
Compare
I had added jupyter example, please help review it |
I fixed the lint problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love to see this. Thanks for doing this as this was definitely on @tylerhutcherson and I's list. I have a couple high level comments.
-
I think the name
Redis
would be better since people might assumeRediSearch
is a different thing. First part of the connection could use theINFO MODULES
command to ensure that RediSearch is loaded. Another point is that we allow for vectors to be indexed in RedisJSON which would be confusing if people used the RediSearch container to use both modules without knowing they had to use the Redis-Stack container for this. -
In relation, I think it would be nice to also have the ability to store as hash or JSON. This, I can do in a follow up PR though.
-
I left some performance pieces in there, but I'm happy to also do these as a follow up.
-
Lots of points on connection arguments. It would be ideal to match the semantics of the REdis-py library so that this is a similar drop in for our community.
-
I didn't see the addition of a optional package like
pip install langchain[redis]
which I think would be really nice. pinning a version to ensure these features are present is a must though.
Let me know what you think, we are happy to contribute some of these changes if you get stumped. Please reach out for clarification.
Awesome!
langchain/vectorstores/redisearch.py
Outdated
# Prepare the Query | ||
return_fields = ["metadata", "content", "vector_score"] | ||
vector_field = "content_vector" | ||
hybrid_fields = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the user have the option to change this here. I might be missing some logic but our users love this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields are currently fixed, similar to ElasticVectorSearch
.
/langchain/vectorstores/elastic_vector_search.py#L189
request = {
"_op_type": "index",
"_index": index_name,
"vector": embeddings[i],
"text": text,
"metadata": metadata,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea which is ok, but that's fixed for when metadata
is a key. Redis allows for arbitrarily named keys to be passed in this argument.
Assuming the user created and used the index with langchain, this is totally fine, just thinking about use cases other than the typical. again, one thing we can contribute after the initial version too.
langchain/vectorstores/redisearch.py
Outdated
except ImportError: | ||
raise ValueError( | ||
"Could not import redis python package. " | ||
"Please install it with `pip install redis`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show pinned version? might be difficult unless it's a set constant. Slight maintenance burden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referred to other code under 'vectorstores', but no specific version was given in the error prompt.
langchain/vectorstores/redisearch.py
Outdated
index_name = uuid.uuid4().hex | ||
prefix = "doc" # prefix for the document keys | ||
distance_metric = ( | ||
"COSINE" # distance metric for the vectors (ex. COSINE, IP, L2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow this to be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think default values can be used here. The purpose of this code is to create the corresponding data structure in Redis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will work for now, but for many production use cases - users will need the ability to define and specify index name, prefix, and distance metrics. But that can come later as optional arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tylerhutcherson expressed the main point behind what I meant. users will want to use HNSW for larger indices and change distance metrics based on process and normalization. but for now, this is fine and awesome to see.
langchain/vectorstores/redisearch.py
Outdated
metadata = TextField(name="metadata") | ||
content_embedding = VectorField( | ||
"content_vector", | ||
"FLAT", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow this to be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think default values can be used here. The purpose of this code is to create the corresponding data structure in Redis.
Thank you for the several comments you made.I have made fixes for the majority of the comments. Regarding the first point, I also initially thought the same way as you, not to use 'rediSearch'. However, considering that if the user directly uses a normal version of Redis and then uses this vectorstore, an error will occur, that's why it was changed to 'rediSearch'. The use of 'INFO MODULES' for checking, as you mentioned, is indeed a good point. So do you think the following approach is more appropriate? If so, I will change it to 'Redis' as the file name and class name, looking forward to your re-review, thank you very much. if "search" not in [m["name"] for m in client.info().get("modules")]:
raise ValueError("Could not use redis directly, you need to add search module"
"Please refer [RediSearch](https://redis.io/docs/stack/search/quick_start/)"
) |
@xinqiu Thanks for this! I just pulled and tested all of the code. Beyond what @Spartee shared for future improvements and the comment I made above, we think adding another integration test would be helpful here: def test_redisearch_new_vector() -> None:
"""Test adding a new document"""
texts = ["foo", "bar", "baz"]
docsearch = RediSearch.from_texts(
texts, FakeEmbeddings(), redisearch_url="redis://localhost:6379"
)
docsearch.add_texts(["foo"])
output = docsearch.similarity_search("foo", k=2)
assert output == [Document(page_content="foo"), Document(page_content="foo")] Your suggestion about using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice -- Overall this is a solid start! :) Our recommendation would still be to use Redis
as the vector store name and enforce the module check when loading. @hwchase17 Anything else you need to see here? We're happy to collaborate on other extensions, iterations, examples, and docs.
OK,I will refactor this. |
this looks pretty good to me! will defer to you on naming @tylerhutcherson. what you mean regarding enforcing the module check? isnt there a check already? |
There is a check for the installation of the Redis Python client - that looks great! But we (cc @xinqiu, @Spartee) discussed also adding an additional check on class init to confirm the presence of the |
@tylerhutcherson @Spartee Another issue with redis-py has been discovered. When trying to query based on index_name in client.ft(self.index_name).search(redis_query, params_dict), there is no restriction and it appears that Redis is currently returning full results of a match. Same problem like this |
Interesting. This is not the Also - worth noting that Redis associates keys to an index using the |
I have renamed it as Redis. |
@tylerhutcherson any updated thoughts? am happy to help get this into tmrws release if its close |
@hwchase17 All checks have passed, What else do I need to do? |
No other requirements from our end at the moment. But we can happily contribute another PR later with more examples, docs, and expanded functionality. Thanks @hwchase17 and @xinqiu !!! |
woo awesome! gonna slot this for wednesday release! thanks for the work @xinqiu and the review @tylerhutcherson ! @xinqiu do you have a twitter for a shoutout? |
Great, my twitter is @xinqiu_bot |
Description
Add
RediSearch
vectorstore for LangChainRediSearch: RediSearch quick start
How to use