Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add redisearch vectorstore #1307

Merged

Conversation

xinqiu
Copy link
Contributor

@xinqiu xinqiu commented Feb 26, 2023

Description

Add RediSearch vectorstore for LangChain

RediSearch: RediSearch quick start

How to use

from langchain.vectorstores.redisearch import RediSearch

rds = RediSearch.from_documents(docs, embeddings,redisearch_url="redis://localhost:6379")

@hwchase17
Copy link
Contributor

a jupyter notebook example would be very helpful!

@tylerhutcherson
Copy link
Contributor

Thanks for opening this 💪🏼

My team at Redis is also taking a quick look. We might have some suggestions/optimizations but could always include those in a later PR!

@xinqiu xinqiu changed the title WIP: feat: add redisearch vectorstore feat: add redisearch vectorstore Feb 27, 2023
@xinqiu xinqiu force-pushed the feat/add_vectorstore_redisearch branch from 96b4d34 to 7169eb1 Compare February 27, 2023 16:04
@xinqiu
Copy link
Contributor Author

xinqiu commented Feb 27, 2023

a jupyter notebook example would be very helpful!

I had added jupyter example, please help review it

@xinqiu
Copy link
Contributor Author

xinqiu commented Feb 28, 2023

I fixed the lint problem.

Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love to see this. Thanks for doing this as this was definitely on @tylerhutcherson and I's list. I have a couple high level comments.

  1. I think the name Redis would be better since people might assume RediSearch is a different thing. First part of the connection could use the INFO MODULES command to ensure that RediSearch is loaded. Another point is that we allow for vectors to be indexed in RedisJSON which would be confusing if people used the RediSearch container to use both modules without knowing they had to use the Redis-Stack container for this.

  2. In relation, I think it would be nice to also have the ability to store as hash or JSON. This, I can do in a follow up PR though.

  3. I left some performance pieces in there, but I'm happy to also do these as a follow up.

  4. Lots of points on connection arguments. It would be ideal to match the semantics of the REdis-py library so that this is a similar drop in for our community.

  5. I didn't see the addition of a optional package like pip install langchain[redis] which I think would be really nice. pinning a version to ensure these features are present is a must though.

Let me know what you think, we are happy to contribute some of these changes if you get stumped. Please reach out for clarification.

Awesome!

langchain/vectorstores/redisearch.py Outdated Show resolved Hide resolved
langchain/vectorstores/redisearch.py Outdated Show resolved Hide resolved
# Prepare the Query
return_fields = ["metadata", "content", "vector_score"]
vector_field = "content_vector"
hybrid_fields = "*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the user have the option to change this here. I might be missing some logic but our users love this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are currently fixed, similar to ElasticVectorSearch.

/langchain/vectorstores/elastic_vector_search.py#L189

request = {
                "_op_type": "index",
                "_index": index_name,
                "vector": embeddings[i],
                "text": text,
                "metadata": metadata,
            }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea which is ok, but that's fixed for when metadata is a key. Redis allows for arbitrarily named keys to be passed in this argument.

Assuming the user created and used the index with langchain, this is totally fine, just thinking about use cases other than the typical. again, one thing we can contribute after the initial version too.

except ImportError:
raise ValueError(
"Could not import redis python package. "
"Please install it with `pip install redis`."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show pinned version? might be difficult unless it's a set constant. Slight maintenance burden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referred to other code under 'vectorstores', but no specific version was given in the error prompt.

langchain/vectorstores/redisearch.py Outdated Show resolved Hide resolved
langchain/vectorstores/redisearch.py Outdated Show resolved Hide resolved
index_name = uuid.uuid4().hex
prefix = "doc" # prefix for the document keys
distance_metric = (
"COSINE" # distance metric for the vectors (ex. COSINE, IP, L2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow this to be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think default values can be used here. The purpose of this code is to create the corresponding data structure in Redis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work for now, but for many production use cases - users will need the ability to define and specify index name, prefix, and distance metrics. But that can come later as optional arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerhutcherson expressed the main point behind what I meant. users will want to use HNSW for larger indices and change distance metrics based on process and normalization. but for now, this is fine and awesome to see.

metadata = TextField(name="metadata")
content_embedding = VectorField(
"content_vector",
"FLAT",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow this to be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think default values can be used here. The purpose of this code is to create the corresponding data structure in Redis.

@xinqiu
Copy link
Contributor Author

xinqiu commented Mar 1, 2023

Love to see this. Thanks for doing this as this was definitely on @tylerhutcherson and I's list. I have a couple high level comments.

  1. I think the name Redis would be better since people might assume RediSearch is a different thing. First part of the connection could use the INFO MODULES command to ensure that RediSearch is loaded. Another point is that we allow for vectors to be indexed in RedisJSON which would be confusing if people used the RediSearch container to use both modules without knowing they had to use the Redis-Stack container for this.
  2. In relation, I think it would be nice to also have the ability to store as hash or JSON. This, I can do in a follow up PR though.
  3. I left some performance pieces in there, but I'm happy to also do these as a follow up.
  4. Lots of points on connection arguments. It would be ideal to match the semantics of the REdis-py library so that this is a similar drop in for our community.
  5. I didn't see the addition of a optional package like pip install langchain[redis] which I think would be really nice. pinning a version to ensure these features are present is a must though.

Let me know what you think, we are happy to contribute some of these changes if you get stumped. Please reach out for clarification.

Awesome!

Thank you for the several comments you made.I have made fixes for the majority of the comments.

Regarding the first point, I also initially thought the same way as you, not to use 'rediSearch'. However, considering that if the user directly uses a normal version of Redis and then uses this vectorstore, an error will occur, that's why it was changed to 'rediSearch'. The use of 'INFO MODULES' for checking, as you mentioned, is indeed a good point.

So do you think the following approach is more appropriate? If so, I will change it to 'Redis' as the file name and class name, looking forward to your re-review, thank you very much.

if "search" not in [m["name"] for m in client.info().get("modules")]:
    raise ValueError("Could not use redis directly, you need to add search module"
                     "Please refer [RediSearch](https://redis.io/docs/stack/search/quick_start/)"
                     )

@tylerhutcherson
Copy link
Contributor

@xinqiu Thanks for this! I just pulled and tested all of the code. Beyond what @Spartee shared for future improvements and the comment I made above, we think adding another integration test would be helpful here:

def test_redisearch_new_vector() -> None:
    """Test adding a new document"""
    texts = ["foo", "bar", "baz"]
    docsearch = RediSearch.from_texts(
        texts, FakeEmbeddings(), redisearch_url="redis://localhost:6379"
    )
    docsearch.add_texts(["foo"])
    output = docsearch.similarity_search("foo", k=2)
    assert output == [Document(page_content="foo"), Document(page_content="foo")]

Your suggestion about using the client.info().get("modules") cmd above is also good!

@xinqiu xinqiu requested review from tylerhutcherson and Spartee and removed request for tylerhutcherson and Spartee March 2, 2023 01:42
Copy link
Contributor

@tylerhutcherson tylerhutcherson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- Overall this is a solid start! :) Our recommendation would still be to use Redis as the vector store name and enforce the module check when loading. @hwchase17 Anything else you need to see here? We're happy to collaborate on other extensions, iterations, examples, and docs.

@xinqiu
Copy link
Contributor Author

xinqiu commented Mar 3, 2023

Nice -- Overall this is a solid start! :) Our recommendation would still be to use Redis as the vector store name and enforce the module check when loading. @hwchase17 Anything else you need to see here? We're happy to collaborate on other extensions, iterations, examples, and docs.

OK,I will refactor this.

@hwchase17
Copy link
Contributor

this looks pretty good to me! will defer to you on naming @tylerhutcherson. what you mean regarding enforcing the module check? isnt there a check already?

@tylerhutcherson
Copy link
Contributor

this looks pretty good to me! will defer to you on naming @tylerhutcherson. what you mean regarding enforcing the module check? isnt there a check already?

There is a check for the installation of the Redis Python client - that looks great! But we (cc @xinqiu, @Spartee) discussed also adding an additional check on class init to confirm the presence of the RediSearch "module" installed on the Redis instance.

@xinqiu
Copy link
Contributor Author

xinqiu commented Mar 6, 2023

@tylerhutcherson @Spartee Another issue with redis-py has been discovered. When trying to query based on index_name in client.ft(self.index_name).search(redis_query, params_dict), there is no restriction and it appears that Redis is currently returning full results of a match.

Same problem like this

@tylerhutcherson
Copy link
Contributor

tylerhutcherson commented Mar 6, 2023

@tylerhutcherson @Spartee Another issue with redis-py has been discovered. When trying to query based on index_name in client.ft(self.index_name).search(redis_query, params_dict), there is no restriction and it appears that Redis is currently returning full results of a match.

Same problem like this

Interesting. This is not the redis Python library though, it's redisearch which is older. All search functionality has been ported to the official Redis library here. Possible that has something to do with it.

Also - worth noting that Redis associates keys to an index using the prefix argument. So any Redis HASH that is prefixed by prefix: will be included in the index. It's more flexible that way. So if you have docs that need to be in different indices, you use a different prefix for those keys.

@xinqiu
Copy link
Contributor Author

xinqiu commented Mar 8, 2023

I have renamed it as Redis.

@hwchase17
Copy link
Contributor

hwchase17 commented Mar 9, 2023

@tylerhutcherson any updated thoughts? am happy to help get this into tmrws release if its close

@xinqiu
Copy link
Contributor Author

xinqiu commented Mar 12, 2023

@hwchase17 All checks have passed, What else do I need to do?

@tylerhutcherson
Copy link
Contributor

No other requirements from our end at the moment. But we can happily contribute another PR later with more examples, docs, and expanded functionality. Thanks @hwchase17 and @xinqiu !!!

@hwchase17
Copy link
Contributor

woo awesome! gonna slot this for wednesday release! thanks for the work @xinqiu and the review @tylerhutcherson !

@xinqiu do you have a twitter for a shoutout?

@xinqiu
Copy link
Contributor Author

xinqiu commented Mar 14, 2023

woo awesome! gonna slot this for wednesday release! thanks for the work @xinqiu and the review @tylerhutcherson !

@xinqiu do you have a twitter for a shoutout?

Great, my twitter is @xinqiu_bot

@hwchase17 hwchase17 merged commit 4e13cef into langchain-ai:master Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants