-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Index created from a LanceDBVectorStore is not returning all nodes? #14435
Comments
Hey there, @caes27! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this together while we wait for a human maintainer. The issue you're encountering is likely due to the Here's the relevant code snippet: table = self.connection.open_table(self.table_name)
lance_query = (
table.search(
query=query.query_embedding,
vector_column_name=self.vector_column_name,
)
.limit(query.similarity_top_k)
.where(where)
.nprobes(self.nprobes)
)
if self.refine_factor is not None:
lance_query.refine_factor(self.refine_factor) In this code, if To address this, you can try adjusting the vector_store = LanceDBVectorStore.from_table(table, refine_factor=None) Additionally, ensure that the vector_store = LanceDBVectorStore.from_table(table, nprobes=20) By fine-tuning these parameters, you should be able to control the number of nodes retrieved and improve the accuracy of your queries. |
Speaking about this, can this be a feature added in the future? I've been having to go into the LanceDBVectorStore and manually changing the query_type myself when conducting some evaluation on my RAG system. |
Speaking about this, can this be a feature added in the future? I've been having to go into the LanceDBVectorStore and manually changing the query_type myself when conducting some evaluation on my RAG system. |
You're right, the current However, it is feasible to modify the @classmethod
def from_table(cls, table: Any, refine_factor: Optional[int] = None, nprobes: int = 20) -> "LanceDBVectorStore":
"""Create instance from table."""
try:
if not isinstance(
table, (lancedb.db.LanceTable, lancedb.remote.table.RemoteTable)
):
raise Exception("argument is not lancedb table instance")
return cls(table=table, refine_factor=refine_factor, nprobes=nprobes)
except Exception as e:
print("ldb version", lancedb.__version__)
raise This modification would allow users to specify For now, you might need to manually adjust these parameters in the @logan-markewich, could you step in and provide your guidance on this feature request? |
Update: I ran some more tests and it is not an issue of creating a vector store from a lancedb table, which I thought it might have. Here is some code:
Keep in mind, "all_leaf_nodes" contains 3558 nodes, but both times I retrieve nodes using the VectorStoreIndex as a retriever, it is being limited to 1498. Any idea of what might be happening? I can see the 3500+ nodes inside of my lancedb table directory. |
The retrieval of nodes from your Here is the relevant code snippet: lance_query = (
self._table.search(
query=_query,
vector_column_name=self.vector_column_name,
)
.limit(query.similarity_top_k * self.overfetch_factor)
.where(where)
) The Additionally, the default value of vector_store = LanceDBVectorStore(overfetch_factor=2) This should help you retrieve more nodes if needed [2]. |
Hi @caes27 , thanks for reporting the issue. I tested from integration end and came to the following conclusions (I used hierarchical parser and ingested 768 nodes into the DB) :
I am not sure but it seems to be an issue in how the final results are built by llama index retriever API / query engine API, I can see from lancedb integration API end, it seems to be fine, perhaps some minor docstore, storage context issue could be there and I can make the fix if needed but I am not sure what the fix is. adding the query function debug code snippet : def query(
self,
query: VectorStoreQuery,
**kwargs: Any,
) -> VectorStoreQueryResult:
"""Query index for top k most similar nodes."""
if query.filters is not None:
if "where" in kwargs:
raise ValueError(
"Cannot specify filter via both query and kwargs. "
"Use kwargs only for lancedb specific items that are "
"not supported via the generic query interface."
)
where = _to_lance_filter(query.filters, self._metadata_keys)
else:
where = kwargs.pop("where", None)
query_type = kwargs.pop("query_type", self.query_type)
_logger.info("query_type :", query_type)
if query_type == "vector":
_query = query.query_embedding
else:
if not isinstance(self._table, lancedb.db.LanceTable):
raise ValueError(
"creating FTS index is not supported for LanceDB Cloud yet. "
"Please use a local table for FTS/Hybrid search."
)
if self._fts_index is None:
self._fts_index = self._table.create_fts_index(
self.text_key, replace=True
)
if query_type == "hybrid":
_query = (query.query_embedding, query.query_str)
elif query_type == "fts":
_query = query.query_str
else:
raise ValueError(f"Invalid query type: {query_type}")
lance_query = (
self._table.search(
query=_query,
vector_column_name=self.vector_column_name,
)
.limit(query.similarity_top_k * self.overfetch_factor)
.where(where)
)
if query_type != "fts":
lance_query.nprobes(self.nprobes)
if query_type == "hybrid" and self._reranker is not None:
_logger.info(f"using {self._reranker} for reranking results.")
lance_query.rerank(reranker=self._reranker)
if self.refine_factor is not None:
lance_query.refine_factor(self.refine_factor)
results = lance_query.to_pandas()
if len(results) == 0:
raise Warning("query results are empty..")
nodes = []
for _, item in results.iterrows():
try:
node = metadata_dict_to_node(item.metadata)
node.embedding = list(item[self.vector_column_name])
except Exception:
# deprecated legacy logic for backward compatibility
_logger.debug(
"Failed to parse Node metadata, fallback to legacy logic."
)
if item.metadata:
metadata, node_info, _relation = legacy_metadata_dict_to_node(
item.metadata, text_key=self.text_key
)
else:
metadata, node_info = {}, {}
node = TextNode(
text=item[self.text_key] or "",
id_=item.id,
metadata=metadata,
start_char_idx=node_info.get("start", None),
end_char_idx=node_info.get("end", None),
relationships={
NodeRelationship.SOURCE: RelatedNodeInfo(
node_id=item[self.doc_id_key]
),
},
)
nodes.append(node)
# _logger.info("nodes :", len(nodes))
print("nodes :", len(nodes)) # this returns the correct no. of nodes as per similarity_top_k
return VectorStoreQueryResult(
nodes=nodes,
similarities=_to_llama_similarities(results),
ids=results["id"].tolist(),
) |
Hello @raghavdixit99, Thank you for helping me, I really appreciate it. There are a bunch of things that are weird. I rechunked a smaller set of documents and ingested 3500 nodes into a separate lancedb table. I set similarity_top_k to 1500 and by adding your debugging statement of:
It correctly showed 1500 nodes being retuned, but in the final response:
This outputted 1488 nodes, so some nodes were lost in this process. It was kinda fascinating how yours went from 700 to 234. But there is also another issue. Since there is 3500 documents, I wanted to test it with a larger limit/similarity_top_k. I set it to 2500 and everytime, both by using:
The top piece of code returned 1510 nodes. The limit/similarity_top_k was set to 2500, so what is going on here? I think this a bigger issue than the nodes being lost in the final stages of the retrieval process? Tagging for visbility: @logan-markewich |
@caes27 , a lancedb search : Additionally, I locally tested it via Perhaps your table has not ingested all the data or your uri needs a refresh ( As for the final retrieval results coming less than expected I have already covered that in my comment and tagged Logan, we should wait for his response as it seems like a parsing problem from the base retriever class. Thanks |
Hey @raghavdixit99, I believe you when you say the I have refreshed the uri multiple times and same issue. Maybe it's a matter of how nodes are being ingested into the lancedb table when you do this:
I can't see anywhere else where it can go wrong. If you have time, maybe you can try it on your end by populating the table with 2000+ nodes and see if you get the same issue? Thank you! |
Did more digging. As I was populating the table little by little, instead of sending it 25000+ nodes at once, I realized something. Suppose my table has 500 nodes in it currently and I want to add 300 more nodes to the table. I run the following code:
After this is done, this should mean there is 800 nodes in the lancedb table, but after I execute the following code:
nodes3 is of length 300, which were the nodes I just added. It ignores the 500 nodes that were in the lancedb table previously. Is this not the correct way to add nodes to an existing lancedb table? |
Hi @caes27 Since you are trying to iteratively ingest data you should try changing the mode to “append” by default the table overwrites the data could be the reason for such behavior.
|
Hello @raghavdixit99, I think I might have found the issue that was causing problems. First, I noticed some faulty logic in the LanceDBVectorStore's "add" method, and fixed it myself. At the same time, I thought about upgrading the package and this also fixed it lol. I also tried your solution yesterday, and it works if the table already exists and has some sort of data in it. However, it throws an error when the table is empty. So that is a work around, but this fix in the "add" method seemed to solve ingesting data into a fresh table: Previous:
After:
From what I saw, the data was ingested in batches and when the second batch came around, because it was in "overwrite" mode, the second batch's data would completely wipe the first batch's data and so on. The other problem with LlamaIndex losing some nodes in the retrieval process still persists, so I'm still waiting. Thanks again Raghav for your help throughout this whole thread. |
Hi @caes27 that is not faulty logic, you are hardcoding the mode argument and we have added that as per user’s requirement/ input. Please follow the usage as per my last comment, rest we are waiting on Logans response. |
@raghavdixit99 I think the problem @caes27 is pointing out is that
The llama index code is using the |
Hi @caes27 In your example, I saw |
Hi @logan-markewich @raghavdixit99 @spearki but since data ingested in batch, and latest batch keep overwriting previous, in the end VectorDB will be initiated with only 'input_record_size%insert_batch_size` records
Could you kindly provide a patch update to fix this issue? @caes27 's solution solved it for me |
It would be great if @caes27 or @manfredwang093 can open a PR for this, and maybe include a unit test :) Tbh there have been several updates in lancedb since this issue was opened, I'm not even sure if this is an issue still |
Encountered the same issue and looking forward to a bug fix. |
Question Validation
Question
I don't know what I am doing wrong. I chunked a few hundred documents using the HierarchicalNodeParser and stored them in a lanceDB database using this guide. It has about 24000 leaf nodes in it.
If I want to query the data, I use the code down below:
What this seems to be doing is initially grabbing the same exact 1080 nodes from the database, then ranking them based on vector similarity to query. I tried tuning the overfetch_factor and nprobes parameters of the LanceDBVectorStore, but this seems to do nothing. I am very confused on what I might be doing wrong? Any help?
The text was updated successfully, but these errors were encountered: